# Data collection of arXiv full text article metadata

## 1. Introduction

 arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. The articles are pre- and postprints approved for posting after moderation but not peer-reviewed by arXiv.

 Metadata will be collected from [arXiv](https://arxiv.org/) for full text scientific papers via the [arXiv API](https://info.arxiv.org/help/api/index.html). The  query used for the project relates to COVID-19 and drug repurposing from 2019-2022.



## 2. Install/import libraries

In [None]:
!pip install arxiv

In [None]:
import pandas as pd
import pickle
import arxiv
import datetime
import time

## 3. Download metadata

Construct a function to download metadata using the [arXiv API](https://info.arxiv.org/help/api/index.html) to search the publications database for the following fields: 'arxiv-id', 'published', 'revised', 'title', 'journal', 'authors', 'doi', 'pdf_url', and convert the dictionary to a DataFrame.






In [None]:
# Python wrapper adapted from https://github.com/lukasschwab/arxiv.py

def get_results(search_query: str, date_from: str = None, date_until: str = None):

    # Create a DataFrame to populate while iterating
    article_results = pd.DataFrame({'arxiv-id':[],
                                'published':[],
                                'revised':[],
                                'title':[],
                                'journal':[],
                                'authors': [],
                                'doi':[],
                                'pdf_url':[]})

    results_generator = arxiv.Client(
      # the number of papers to fetch from arXiv per page of results.
      # The API's limit is 30,000 in slices of at most 2000 at a time.
      page_size=100,
      # the number of seconds to wait between requests for pages.
      # arXiv's Terms of Use ask that you "make no more than one request every three seconds."
      delay_seconds=3,
      # The number of times the client will retry a request that fails
      num_retries=3
    ).results(arxiv.Search(
      query = f'search_query={search_query}',
      max_results = float('inf'),
      sort_by = arxiv.SortCriterion.LastUpdatedDate,
      sort_order = arxiv.SortOrder.Descending,
    ))

    if date_from is not None and date_from > date_until:
        raise Exception("Check that date_from precedes date_until")

    for result in results_generator:
        if date_from is not None and result.published.strftime('%Y-%m-%d') < date_from:
          continue
        elif date_until is not None and result.published.strftime('%Y-%m-%d') > date_until:
          continue
        print(f"arxiv-id: {result.entry_id.split('/abs/')[-1]}")
        print(f"published: {result.published.strftime('%Y-%m-%d')}")
        print(f"revised: {result.updated.strftime('%Y-%m-%d')}")
        print(f"title: {result.title}")
        print(f"journal: {result.journal_ref}")
        print(f"authors: {', '.join(author.name for author in result.authors)}")
        print(f"doi: {result.doi}")
        print(f"pdf_url: {result.pdf_url}")

        tmpdic = {'arxiv-id': result.entry_id.split('/abs/')[-1], 'published': result.published.strftime('%Y-%m-%d'),
        'revised': result.updated.strftime('%Y-%m-%d'), 'title': result.title, 'journal': result.journal_ref,
        'authors': ', '.join(author.name for author in result.authors), 'doi': result.doi, 'pdf_url': result.pdf_url}
        article_results = article_results.append(pd.DataFrame(tmpdic, index=[0]))
        article_results.reset_index(drop=True, inplace=True)


    return article_results

## 4. Construct search query

In the arXiv search engine, each article is divided up into a number of fields that can be searched by prepending the field prefix, followed by a colon, to the search term.

The API allows advanced query construction by combining these search fields with Boolean operators (AND, OR, ANDNOT) and, for even more complex queries, by using parentheses for grouping the Boolean expressions.

For advanced query syntax documentation, see the [arXiv API User's Manual](https://arxiv.org/help/api/user-manual#query_details).





In [None]:
search_query = "(all:covid OR all:coronavirus OR all:sars-cov-2) AND (all:'drug discovery' OR all:'drug repurposing' OR all:'drug repositioning')"
date_from = None
date_until = '2022-12-31'

article_results = get_results(search_query, date_from, date_until)

arxiv-id: 2109.06377v4
published: 2021-09-14
revised: 2022-12-22
title: ASGARD: A Single-cell Guided pipeline to Aid Repurposing of Drugs
journal: None
authors: Bing He, Yao Xiao, Haodong Liang, Qianhui Huang, Yuheng Du, Yijun Li, David Garmire, Duxin Sun, Lana X. Garmire
doi: None
pdf_url: http://arxiv.org/pdf/2109.06377v4
arxiv-id: 2212.09867v1
published: 2022-12-19
revised: 2022-12-19
title: Detecting Contradictory COVID-19 Drug Efficacy Claims from Biomedical Literature
journal: None
authors: Daniel N. Sosa, Malavika Suresh, Christopher Potts, Russ B. Altman
doi: None
pdf_url: http://arxiv.org/pdf/2212.09867v1
arxiv-id: 2212.09610v1
published: 2022-12-19
revised: 2022-12-19
title: Drying of Bio-colloidal Sessile Droplets: Advances, Applications, and Perspectives
journal: None
authors: Anusuya Pal, Amalesh Gope, Anupam Sengupta
doi: None
pdf_url: http://arxiv.org/pdf/2212.09610v1
arxiv-id: 2212.00023v2
published: 2022-11-30
revised: 2022-12-08
title: Random Copolymer inverse design 

In [None]:
len(article_results)

304

In [None]:
article_results

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url
0,2109.06377v4,2021-09-14,2022-12-22,ASGARD: A Single-cell Guided pipeline to Aid R...,,"Bing He, Yao Xiao, Haodong Liang, Qianhui Huan...",,http://arxiv.org/pdf/2109.06377v4
1,2212.09867v1,2022-12-19,2022-12-19,Detecting Contradictory COVID-19 Drug Efficacy...,,"Daniel N. Sosa, Malavika Suresh, Christopher P...",,http://arxiv.org/pdf/2212.09867v1
2,2212.09610v1,2022-12-19,2022-12-19,Drying of Bio-colloidal Sessile Droplets: Adva...,,"Anusuya Pal, Amalesh Gope, Anupam Sengupta",,http://arxiv.org/pdf/2212.09610v1
3,2212.00023v2,2022-11-30,2022-12-08,Random Copolymer inverse design system orienti...,,"Tianyu Wu, Yang Tang",,http://arxiv.org/pdf/2212.00023v2
4,2212.03911v1,2022-12-07,2022-12-07,Analysis of Drug repurposing Knowledge graphs ...,,Ajay Kumar Gogineni,,http://arxiv.org/pdf/2212.03911v1
...,...,...,...,...,...,...,...,...
299,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://arxiv.org/pdf/2003.13665v1
300,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://arxiv.org/pdf/2003.14258v1
301,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://arxiv.org/pdf/2003.12454v1
302,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://arxiv.org/pdf/2003.04524v1


In [None]:
article_results_for_dl = article_results.copy()

The pdf_url column in the API output contains URLs from the arxiv.org domain. This must be replaced with the export.arxiv.org subdomain before programmatically downloading the PDF full text to avoid being blocked by the firewall.

In [None]:
article_results_for_dl = article_results_for_dl.replace('http://', 'http://export.', regex=True)
article_results_for_dl

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url
0,2109.06377v4,2021-09-14,2022-12-22,ASGARD: A Single-cell Guided pipeline to Aid R...,,"Bing He, Yao Xiao, Haodong Liang, Qianhui Huan...",,http://export.arxiv.org/pdf/2109.06377v4
1,2212.09867v1,2022-12-19,2022-12-19,Detecting Contradictory COVID-19 Drug Efficacy...,,"Daniel N. Sosa, Malavika Suresh, Christopher P...",,http://export.arxiv.org/pdf/2212.09867v1
2,2212.09610v1,2022-12-19,2022-12-19,Drying of Bio-colloidal Sessile Droplets: Adva...,,"Anusuya Pal, Amalesh Gope, Anupam Sengupta",,http://export.arxiv.org/pdf/2212.09610v1
3,2212.00023v2,2022-11-30,2022-12-08,Random Copolymer inverse design system orienti...,,"Tianyu Wu, Yang Tang",,http://export.arxiv.org/pdf/2212.00023v2
4,2212.03911v1,2022-12-07,2022-12-07,Analysis of Drug repurposing Knowledge graphs ...,,Ajay Kumar Gogineni,,http://export.arxiv.org/pdf/2212.03911v1
...,...,...,...,...,...,...,...,...
299,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://export.arxiv.org/pdf/2003.13665v1
300,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://export.arxiv.org/pdf/2003.14258v1
301,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://export.arxiv.org/pdf/2003.12454v1
302,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://export.arxiv.org/pdf/2003.04524v1


In [None]:
with open('2023-01-06_arxiv_results_for_dl.pickle', "wb") as f:
    pickle.dump(article_results_for_dl, f)

## 5. Check for missing values

Concise summary of DataFrame to see if there are any columns with missing values.

In [None]:
article_results_for_dl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   arxiv-id   304 non-null    object
 1   published  304 non-null    object
 2   revised    304 non-null    object
 3   title      304 non-null    object
 4   journal    39 non-null     object
 5   authors    304 non-null    object
 6   doi        63 non-null     object
 7   pdf_url    304 non-null    object
dtypes: object(8)
memory usage: 19.1+ KB


We can see that most of the articles have a missing journal reference and DOI. In most cases submissions are not yet published so this information will be added by the author at a later date.

### References

* arXiv free distribution service and open-access archive https://arxiv.org/

* arXiv API Access https://info.arxiv.org/help/api/index.html

* arXiv API User's Manual https://arxiv.org/help/api/user-manual

* Python wrapper for arXiv API https://github.com/lukasschwab/arxiv.py