# Data-Mining ArXiv for Abstracts

Using the arxiv module, we will create a data set of thousands of abstracts from recent papers.

In [15]:
import arxiv
import pandas as pd
from pyarxiv import query, download_entries
from pyarxiv.arxiv_categories import ArxivCategory, arxiv_category_map

#set the display option to show all columns
pd.set_option('display.max_columns', None)

In [2]:
"""
The below commented out code details the available options for querying ArXiv
"""


#query(max_results=100, ids=[], categories=[],
#                title='', authors='', abstract='', journal_ref='',
#                querystring='')
# entries = query(max_results=1000, abstract='electron')
# titles = map(lambda x: x['title'], entries)
# print(list(titles))


# #download_entries(entries_or_ids_or_uris=[], target_folder='.',
# #                     use_title_for_filename=False, append_id=False,
# #                     progress_callback=(lambda x, y: id))
# download_entries(entries)


# entries_with_category = query([ArxivCategory.cs_AI])
# print(arxiv_category_map(ArxivCategory.cs_AI))

In [69]:
def arxiv_querier(n_results):
    "Returns a pandas dataframe of arxiv papers satisfying the query. Prompts for a keyword search"
    keyword = input("Enter the topic for which you want to search papers on arXiv: ")

    #outputs a list of dictionaries, each dictionary corresponding to one paper
    results = query(title = keyword,max_results=n_results)

    #create a pandas dataframe from the entries
    arxiv_results = pd.DataFrame(results)

    #remove some of the unnecessary columns
    arxiv_results = arxiv_results[['id','published','updated','title','summary','authors','arxiv_primary_category']]

    #clean the authors column
    # arxiv_entries['authors'] = arxiv_entries['authors'].map(lambda x:  for y in x)

    #clean the primary category column
    arxiv_results['primary_category'] = arxiv_results['arxiv_primary_category'].map(lambda x: x['term'])

    return arxiv_results

def abstract_preview(n_to_show,dataframe):
    for i in range(n_to_show):
        print(dataframe['summary'][i] + '\n')

In [63]:
entanglement = arxiv_querier(5000)

Enter the topic for which you want to search papers on arXiv: entanglement


In [64]:
quantum = arxiv_querier(5000)

Enter the topic for which you want to search papers on arXiv: quantum


In [65]:
quantum_thermo = arxiv_querier(5000)

Enter the topic for which you want to search papers on arXiv: quantum thermodynamics


In [66]:
quantum_computing = arxiv_querier(5000)

Enter the topic for which you want to search papers on arXiv: quantum computing


In [68]:
abstract_preview(3,quantum_thermo)

Quantum thermodynamics addresses the emergence of thermodynamical laws from
quantum mechanics. The link is based on the intimate connection of quantum
thermodynamics with the theory of open quantum systems. Quantum mechanics
inserts dynamics into thermodynamics giving a sound foundation to
finite-time-thermodynamics. The emergence of the 0-law I-law II-law and III-law
of thermodynamics from quantum considerations is presented. The emphasis is on
consistence between the two theories which address the same subject from
different foundations. We claim that inconsistency is the result of faulty
analysis pointing to flaws in approximations.

Quantum thermodynamics is an emerging research field aiming to extend
standard thermodynamics and non-equilibrium statistical physics to ensembles of
sizes well below the thermodynamic limit, in non-equilibrium situations, and
with the full inclusion of quantum effects. Fuelled by experimental advances
and the potential of future nanoscale applications 

In [70]:
#combining the different dataframes
combined_arxiv = pd.concat([quantum,entanglement,quantum_thermo,quantum_computing])

In [84]:
#export the dataframe
combined_arxiv.to_csv('arxiv_quantum_data')