# Scraping bioRxiv using the API

## Overview
I want to scrape the available information for every paper in bioRxiv using their API.

## API
The API is described here: https://api.biorxiv.org/

bioRxiv was founded in November of 2013, so that will be the start date. I'll want everything through November 2024 to start.

I think to start, I'll just collect the first year of papers.

In [10]:
import requests
import pandas as pd
from datetime import datetime, timedelta
import time
from pathlib import Path
import logging

In [11]:
def fetch_biorxiv_papers(start_date, end_date, cursor=0):
    """
    Fetch papers from bioRxiv API for a given date range.
    
    Args:
        start_date (str): Start date in format YYYY-MM-DD
        end_date (str): End date in format YYYY-MM-DD
        cursor (int): Pagination cursor
        
    Returns:
        dict: API response
    """
    base_url = "https://api.biorxiv.org/details/biorxiv"
    url = f"{base_url}/{start_date}/{end_date}/{cursor}"
    
    response = requests.get(url)
    return response.json()

# Example usage for the first year of bioRxiv ( 2013)
start_date = "2013-11-01"
end_date = "2013-12-31"

# First request to get initial data and total count
initial_response = fetch_biorxiv_papers(start_date, end_date)
total_papers = int(initial_response['messages'][0]['total'])
print(f"Total papers for this period: {total_papers}")

# Create a list to store all papers
all_papers = []

# Fetch all papers using pagination (100 results per page)
cursor = 0
while cursor < total_papers:
    response = fetch_biorxiv_papers(start_date, end_date, cursor)
    papers = response['collection']
    all_papers.extend(papers)
    cursor += 100
    print(f"Fetched {len(all_papers)} papers so far...")

# Convert to DataFrame
df = pd.DataFrame(all_papers)
print(f"\nFinal dataset shape: {df.shape}")

Total papers for this period: 143
Fetched 100 papers so far...
Fetched 143 papers so far...

Final dataset shape: (143, 14)


In [12]:
df.head()

Unnamed: 0,doi,title,authors,author_corresponding,author_corresponding_institution,date,version,type,license,category,jatsxml,abstract,published,server
0,10.1101/000109,Speciation and introgression between Mimulus n...,Yaniv Brandvain;Amanda M Kenney;Lex Fagel;Grah...,Yaniv Brandvain,Department of Evolution and Ecology & Center f...,2013-11-07,1,New Results,cc_by,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,Mimulus guttatus and M. nasutus are an evoluti...,10.1371/journal.pgen.1004410,biorxiv
1,10.1101/000075,A Scalable Formulation for Engineering Combina...,Vanessa Jonsson;Anders Rantzer;Richard M Murray;,Vanessa Jonsson,Caltech,2013-11-07,1,New Results,cc_by_nc,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,It has been shown that optimal controller synt...,10.1109/ACC.2014.6859452,biorxiv
2,10.1101/000240,Genome-wide targets of selection: female respo...,Paolo Innocenti;Ilona Flis;Edward H Morrow;,Edward H Morrow,University of Sussex,2013-11-12,1,New Results,cc_by,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,Despite the common assumption that promiscuity...,,biorxiv
3,10.1101/000208,Population genomics of parallel hybrid zones i...,Nicola Nadeau;Mayte Ruiz;Patricio Salazar;Bria...,Chri Jiggins,Cambridge,2013-11-12,1,New Results,cc_by_nc_nd,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,Hybrid zones can be valuable tools for studyin...,10.1101/gr.169292.113,biorxiv
4,10.1101/000398,The Origin of Human-infecting Avian Influenza ...,Liangsheng Zhang;Zhenguo Zhang;,Zhenguo Zhang,"Department of Biology, The Pennsylvania State ...",2013-11-14,1,New Results,cc_by_nc_nd,Evolutionary Biology,https://www.biorxiv.org/content/early/2013/11/...,"In this study, we retraced the origin of the r...",,biorxiv


Great! Now I'll try to scrape everything.

In [13]:
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')

def fetch_biorxiv_papers(start_date, end_date, cursor=0, max_retries=3):
    """
    Fetch papers from bioRxiv API with retry logic.
    """
    base_url = "https://api.biorxiv.org/details/biorxiv"
    url = f"{base_url}/{start_date}/{end_date}/{cursor}"
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url)
            response.raise_for_status()
            return response.json()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            logging.warning(f"Attempt {attempt + 1} failed: {e}. Retrying...")
            time.sleep(2 ** attempt)  # Exponential backoff

def get_papers_for_period(start_date, end_date, output_dir):
    """
    Get all papers for a specific date range with periodic saves.
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)
    
    all_papers = []
    cursor = 0
    
    # First request to get total count
    initial_response = fetch_biorxiv_papers(start_date, end_date)
    total_papers = int(initial_response['messages'][0]['total'])
    logging.info(f"Found {total_papers} papers for period {start_date} to {end_date}")
    
    while True:
        response = fetch_biorxiv_papers(start_date, end_date, cursor)
        papers = response['collection']
        
        if not papers:
            break
            
        all_papers.extend(papers)
        cursor += 100
        
        # Log progress
        logging.info(f"Fetched {len(all_papers)}/{total_papers} papers ({(len(all_papers)/total_papers)*100:.1f}%)")
        
        # Save every 1000 papers
        if len(all_papers) % 1000 == 0:
            temp_df = pd.DataFrame(all_papers)
            temp_df.to_csv(output_dir / f"temp_{start_date}_{end_date}_{len(all_papers)}.csv", index=False)
        
        time.sleep(1)  # Rate limiting
    
    return all_papers

def get_monthly_chunks(start_date, end_date):
    """Generate monthly date ranges between start and end dates."""
    start = datetime.strptime(start_date, "%Y-%m-%d")
    end = datetime.strptime(end_date, "%Y-%m-%d")
    
    current = start
    while current < end:
        month_end = min(
            (current + timedelta(days=32)).replace(day=1) - timedelta(days=1),
            end
        )
        yield (
            current.strftime("%Y-%m-%d"),
            month_end.strftime("%Y-%m-%d")
        )
        current = (current + timedelta(days=32)).replace(day=1)

# Main execution
output_dir = Path("biorxiv_data")
start_date = "2013-11-01"
end_date = "2024-11-30"

all_papers = []
failed_chunks = []

# Process month by month
for chunk_start, chunk_end in get_monthly_chunks(start_date, end_date):
    logging.info(f"\nProcessing period: {chunk_start} to {chunk_end}")
    try:
        papers = get_papers_for_period(chunk_start, chunk_end, output_dir)
        all_papers.extend(papers)
        
        # Save after each month
        df = pd.DataFrame(papers)
        df.to_csv(output_dir / f"biorxiv_{chunk_start}_{chunk_end}.csv", index=False)
        
    except Exception as e:
        logging.error(f"Failed to process {chunk_start} to {chunk_end}: {e}")
        failed_chunks.append((chunk_start, chunk_end))

# Save complete dataset
final_df = pd.DataFrame(all_papers)
final_df.to_csv(output_dir / "biorxiv_complete_dataset.csv", index=False)

# Report results
logging.info(f"\nComplete! Collected {len(all_papers)} papers total")
if failed_chunks:
    logging.warning(f"Failed chunks: {failed_chunks}")

# Display some basic statistics
print("\nDataset Overview:")
print(f"Total papers: {len(final_df)}")
print("\nPapers per year:")
final_df['year'] = pd.to_datetime(final_df['date']).dt.year
print(final_df['year'].value_counts().sort_index())

2024-12-09 17:57:53,753 - 
Processing period: 2013-11-01 to 2013-11-30
2024-12-09 17:57:59,953 - Found 73 papers for period 2013-11-01 to 2013-11-30
2024-12-09 17:58:05,066 - Fetched 73/73 papers (100.0%)
2024-12-09 17:58:08,896 - 
Processing period: 2013-12-01 to 2013-12-31
2024-12-09 17:58:13,938 - Found 70 papers for period 2013-12-01 to 2013-12-31
2024-12-09 17:58:19,286 - Fetched 70/70 papers (100.0%)
2024-12-09 17:58:23,039 - 
Processing period: 2014-01-01 to 2014-01-31
2024-12-09 17:58:29,471 - Found 74 papers for period 2014-01-01 to 2014-01-31
2024-12-09 17:58:34,239 - Fetched 74/74 papers (100.0%)
2024-12-09 17:58:38,604 - 
Processing period: 2014-02-01 to 2014-02-28
2024-12-09 17:58:44,987 - Found 87 papers for period 2014-02-01 to 2014-02-28
2024-12-09 17:58:50,601 - Fetched 87/87 papers (100.0%)
2024-12-09 17:58:55,090 - 
Processing period: 2014-03-01 to 2014-03-31
2024-12-09 17:59:00,941 - Found 73 papers for period 2014-03-01 to 2014-03-31
2024-12-09 17:59:06,916 - Fetch


Dataset Overview:
Total papers: 357046

Papers per year:
year
2013      143
2014     1258
2015     2549
2016     6622
2017    15964
2018    28564
2019    39879
2020    53046
2021    51244
2022    48580
2023    53949
2024    55248
Name: count, dtype: int64
