# Biorxiv Loader

- Author: [frimer](https://github.com/brian604)
- Design:
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/14-medrxivLoader.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/14-medrxivLoader.ipynb)

## Overview

This tutorial will introduce you to another archives of health-related and biological-related contents: **medRxiv** and **bioRxiv**, both of which
are operated by the Cold Spring Harbor Laboratory

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Example Queries](#example-queries)

### References

- [medrxivr](https://github.com/ropensci/medrxivr)
    - Access and search medRxiv and bioRxiv
- [Arxiv Langchain](https://python.langchain.com/docs/integrations/providers/arxiv/)
- [medrxiv-langchain](https://github.com/brian604/medrxiv-langchain)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install  --upgrade langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "BiorxivLoader",  # Please set it the same as title
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

**[Note]** This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Example Queries

In this step, we will test out few examples to see if the biorxiv loader works as expected so it has a potential to contribute to `langchain_community`
- We will test the date range from server "biorxiv" for the period from 2024-01-01 to 2024-02-17

In [6]:
import requests
from datetime import datetime, timedelta
from typing import List, Dict, Optional

class SimpleBioRxivSearch:
    def __init__(self):
        self.base_url = "https://api.biorxiv.org/details"
    
    def search(self, 
              query: str, 
              server: List[str] = ["biorxiv"],
              start_date: Optional[str] = None,
              end_date: Optional[str] = None,
              max_results: int = 5) -> List[Dict]:
        """
        Search bioRxiv and/or medRxiv papers.
        """
        results = []
        
        for srv in server:
            # Construct the API URL
            if start_date and end_date:
                url = f"{self.base_url}/{srv}/{start_date}/{end_date}/0"
            else:
                url = f"{self.base_url}/{srv}/2000-01-01/{datetime.now().strftime('%Y-%m-%d')}/0"
            
            try:
                response = requests.get(url)
                response.raise_for_status()  # Raise an exception for bad status codes
                data = response.json()
                
                # Print API response for debugging
                print(f"\nSearching {srv} with URL: {url}")
                print(f"Total results from API: {data.get('messages', [{}])[0].get('total', 0)}")
                
                # Filter results based on query terms
                query_terms = [term.lower() for term in query.replace('"', '').split(' AND ')]
                
                filtered_results = []
                for paper in data.get('collection', []):
                    text_to_search = (paper.get('title', '') + ' ' + paper.get('abstract', '')).lower()
                    if all(term in text_to_search for term in query_terms):
                        filtered_results.append({
                            'title': paper.get('title', ''),
                            'abstract': paper.get('abstract', ''),
                            'doi': paper.get('doi', ''),
                            'category': paper.get('category', 'Unknown'),
                            'server': srv,
                            'date': paper.get('date', '')
                        })
                
                print(f"Filtered results for query '{query}': {len(filtered_results)}")
                results.extend(filtered_results[:max_results])
                
            except requests.exceptions.RequestException as e:
                print(f"Error accessing {srv} API: {str(e)}")
                continue
        
        return results[:max_results]

# Usage example:
searcher = SimpleBioRxivSearch()

# 1. Simple keyword search
print("\nTesting simple keyword search...")
docs1 = searcher.search(
    query="machine learning",  # Simplified query
    server=["biorxiv"],
    max_results=5
)

# 2. Keyword search with date range
print("\nTesting keyword search with date range...")
docs2 = searcher.search(
    query="CRISPR",  # Simplified query
    server=["biorxiv", "medrxiv"],
    start_date="2024-01-01",
    end_date="2024-02-17",
    max_results=5
)

# 3. Last 30 days search
print("\nTesting last 30 days search...")
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
docs3 = searcher.search(
    query="sequencing",  # Simplified query
    server=["medrxiv"],
    start_date=start_date.strftime("%Y-%m-%d"),
    end_date=end_date.strftime("%Y-%m-%d"),
    max_results=5
)

# Print summary statistics
all_docs = docs1 + docs2 + docs3
categories = {}
servers = {"biorxiv": 0, "medrxiv": 0}

for doc in all_docs:
    # Count by category
    cat = doc['category']
    categories[cat] = categories.get(cat, 0) + 1
    
    # Count by server
    server = doc['server']
    servers[server] += 1

print("\nSummary Statistics")
print("-----------------")
print(f"Total unique papers: {len(all_docs)}")
print("\nPapers by Category:")
for cat, count in sorted(categories.items(), key=lambda x: x[1], reverse=True):
    print(f"{cat}: {count}")

print("\nPapers by Server:")
for server, count in servers.items():
    print(f"{server}: {count}")

# Print sample of results
print("\nSample of Retrieved Papers:")
for i, doc in enumerate(all_docs[:3], 1):
    print(f"\n{i}. {doc['title']}")
    print(f"Server: {doc['server']}")
    print(f"Date: {doc['date']}")
    print(f"DOI: {doc['doi']}")


Testing simple keyword search...

Searching biorxiv with URL: https://api.biorxiv.org/details/biorxiv/2000-01-01/2025-02-19/0
Total results from API: 370698
Filtered results for query 'machine learning': 0

Testing keyword search with date range...

Searching biorxiv with URL: https://api.biorxiv.org/details/biorxiv/2024-01-01/2024-02-17/0
Total results from API: 7235
Filtered results for query 'CRISPR': 2

Searching medrxiv with URL: https://api.biorxiv.org/details/medrxiv/2024-01-01/2024-02-17/0
Total results from API: 1792
Filtered results for query 'CRISPR': 1

Testing last 30 days search...

Searching medrxiv with URL: https://api.biorxiv.org/details/medrxiv/2025-01-20/2025-02-19/0
Total results from API: 1306
Filtered results for query 'sequencing': 8

Summary Statistics
-----------------
Total unique papers: 8

Papers by Category:
genetic and genomic medicine: 4
developmental biology: 1
cell biology: 1
infectious diseases: 1
rheumatology: 1

Papers by Server:
biorxiv: 2
medrxiv