# Scraping bioRxiv using the API

## Overview
I want to scrape the available information for every paper in bioRxiv using their API.

## API
The API is described here: https://api.biorxiv.org/

bioRxiv was founded in November of 2013, so that will be the start date. I'll want everything through November 2024 to start.

I think to start, I'll just collect the first year of papers.

In [8]:
import requests
import pandas as pd
from datetime import datetime

def fetch_biorxiv_papers(start_date, end_date, cursor=0):
    """
    Fetch papers from bioRxiv API for a given date range.
    
    Args:
        start_date (str): Start date in format YYYY-MM-DD
        end_date (str): End date in format YYYY-MM-DD
        cursor (int): Pagination cursor
        
    Returns:
        dict: API response
    """
    base_url = "https://api.biorxiv.org/details/biorxiv"
    url = f"{base_url}/{start_date}/{end_date}/{cursor}"
    
    response = requests.get(url)
    return response.json()

# Example usage for the first year of bioRxiv ( 2013)
start_date = "2013-11-01"
end_date = "2013-12-31"

# First request to get initial data and total count
initial_response = fetch_biorxiv_papers(start_date, end_date)
total_papers = int(initial_response['messages'][0]['total'])
print(f"Total papers for this period: {total_papers}")

# Create a list to store all papers
all_papers = []

# Fetch all papers using pagination (100 results per page)
cursor = 0
while cursor < total_papers:
    response = fetch_biorxiv_papers(start_date, end_date, cursor)
    papers = response['collection']
    all_papers.extend(papers)
    cursor += 100
    print(f"Fetched {len(all_papers)} papers so far...")

# Convert to DataFrame
df = pd.DataFrame(all_papers)
print(f"\nFinal dataset shape: {df.shape}")

Total papers for this period: 143
Fetched 100 papers so far...
Fetched 143 papers so far...

Final dataset shape: (143, 14)


In [9]:
df.head()

Unnamed: 0,doi,title,authors,author_corresponding,author_corresponding_institution,date,version,type,license,category,jatsxml,abstract,published,server
0,10.1101/000109,Speciation and introgression between Mimulus n...,Yaniv Brandvain;Amanda M Kenney;Lex Fagel;Grah...,Yaniv Brandvain,Department of Evolution and Ecology & Center f...,2013-11-07,1,New Results,cc_by,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,Mimulus guttatus and M. nasutus are an evoluti...,10.1371/journal.pgen.1004410,biorxiv
1,10.1101/000075,A Scalable Formulation for Engineering Combina...,Vanessa Jonsson;Anders Rantzer;Richard M Murray;,Vanessa Jonsson,Caltech,2013-11-07,1,New Results,cc_by_nc,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,It has been shown that optimal controller synt...,10.1109/ACC.2014.6859452,biorxiv
2,10.1101/000240,Genome-wide targets of selection: female respo...,Paolo Innocenti;Ilona Flis;Edward H Morrow;,Edward H Morrow,University of Sussex,2013-11-12,1,New Results,cc_by,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,Despite the common assumption that promiscuity...,,biorxiv
3,10.1101/000208,Population genomics of parallel hybrid zones i...,Nicola Nadeau;Mayte Ruiz;Patricio Salazar;Bria...,Chri Jiggins,Cambridge,2013-11-12,1,New Results,cc_by_nc_nd,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,Hybrid zones can be valuable tools for studyin...,10.1101/gr.169292.113,biorxiv
4,10.1101/000398,The Origin of Human-infecting Avian Influenza ...,Liangsheng Zhang;Zhenguo Zhang;,Zhenguo Zhang,"Department of Biology, The Pennsylvania State ...",2013-11-14,1,New Results,cc_by_nc_nd,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,"In this study, we retraced the origin of the r...",,biorxiv
