# Datasets Generator

- arXiv
    - [github](https://github.com/lukasschwab/arxiv.py)
    - [api](https://arxiv.org/help/api/user-manual)
- Semantic Scholar
    - [api](https://api.semanticscholar.org/)

In [1]:
import arxiv
import pprint
import semanticscholar as sch

## arXiv

In [9]:
papers = arxiv.query(
    query="cat:cs.LG",
    max_results=1,
    iterative=True
)

paper_keys = set()

for paper in papers():
    pprint.pprint(paper)
    for key in paper.keys():
        if key not in paper_keys:
            paper_keys.add(key)
            
print(paper_keys)
print(len(paper_keys))

{'affiliation': 'None',
 'arxiv_comment': '63 pages, 15 figures',
 'arxiv_primary_category': {'scheme': 'http://arxiv.org/schemas/atom',
                            'term': 'cs.LG'},
 'arxiv_url': 'http://arxiv.org/abs/cs/9905014v1',
 'author': 'Thomas G. Dietterich',
 'author_detail': {'name': 'Thomas G. Dietterich'},
 'authors': ['Thomas G. Dietterich'],
 'doi': None,
 'guidislink': True,
 'id': 'http://arxiv.org/abs/cs/9905014v1',
 'journal_reference': None,
 'links': [{'href': 'http://arxiv.org/abs/cs/9905014v1',
            'rel': 'alternate',
            'type': 'text/html'},
           {'href': 'http://arxiv.org/pdf/cs/9905014v1',
            'rel': 'related',
            'title': 'pdf',
            'type': 'application/pdf'}],
 'pdf_url': 'http://arxiv.org/pdf/cs/9905014v1',
 'published': '1999-05-21T14:26:07Z',
 'published_parsed': time.struct_time(tm_year=1999, tm_mon=5, tm_mday=21, tm_hour=14, tm_min=26, tm_sec=7, tm_wday=4, tm_yday=141, tm_isdst=0),
 'summary': 'This paper 

## Understanding arXiv API Response

|element|explanation|
|---|---|
|`arxiv_primary_category`|the primary arXiv category|
|`id`|a unique id assigned to this query|
|`pdf_url`||
|`journal_reference`|a journal reference if present|
|`authors`||
|`arxiv_comment`|the authors comment if present|
|`published`|the date that `version 1` of the article was submitted|
|`tags`||
|`updated`|the last time search results for this query were updated. Set to midnight of the current day|
|`summary`|The article abstract|
|`published_parsed`||
|`title`|the title of the article|
|`affiliation`|the author's affiliation included as a subelement of `<author>` if present|
|`guidislink`||
|`author_detail`||
|`doi`|a url for the resolved DOI to an external resource if present|
|`arxiv_url`||
|`title_detail`||
|`author`|one for each author. Has child element `<name>` containing the author name|
|`summary_detail`||
|`links`|can be up to 3 given url's associated with this article|
|`updated_parsed`||

## Semantic Scholar

In [4]:
paper = sch.paper('arXiv:1507.06228')
pprint.pprint(paper)

{'abstract': 'Theoretical and empirical evidence indicates that the depth of '
             'neural networks is crucial for their success. However, training '
             'becomes more difficult as depth increases, and training of very '
             'deep networks remains an open problem. Here we introduce a new '
             'architecture designed to overcome this. Our so-called highway '
             'networks allow unimpeded information flow across many layers on '
             'information highways. They are inspired by Long Short-Term '
             'Memory recurrent networks and use adaptive gating units to '
             'regulate the information flow. Even with hundreds of layers, '
             'highway networks can be trained directly through simple gradient '
             'descent. This enables the study of extremely deep and efficient '
             'architectures.',
 'arxivId': '1507.06228',
 'authors': [{'authorId': '2100612',
              'name': 'Rupesh Kumar Srivas