# Biorxiv Loader

- Author: [frimer](https://github.com/brian604)
- Design:
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/14-medrxivLoader.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/14-medrxivLoader.ipynb)

## Overview

This tutorial will introduce you to another archives of health-related and biological-related contents: **medRxiv** and **bioRxiv**, both of which
are operated by the Cold Spring Harbor Laboratory

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Example Queries](#example-queries)

### References

- [medrxivr](https://github.com/ropensci/medrxivr)
    - Access and search medRxiv and bioRxiv
- [Arxiv Langchain](https://python.langchain.com/docs/integrations/providers/arxiv/)
- [medrxiv-langchain](https://github.com/brian604/medrxiv-langchain)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install  --upgrade langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "BiorxivLoader",  # Please set it the same as title
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

**[Note]** This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Example Queries

In this step, we will test out few examples to see if the biorxiv loader works as expected so it has a potential to contribute to `langchain_community`
- We will test the date range from server "biorxiv" for the period from 2024-01-01 to 2024-02-17

In [5]:
from medrxiv_langchain import QueryBuilder, BioRxivLoader
from datetime import datetime, timedelta

# Test date range query
print("Testing date range query...")
query_builder = (QueryBuilder()
                .date_range("2024-01-01", "2024-02-17")
                .from_servers(["biorxiv"]))

loader = BioRxivLoader(query_builder=query_builder, max_results=5)
docs = loader.load()

print(f"\nFound {len(docs)} documents")
for doc in docs[:3]:  # Show first 3 papers
    print(f"\nTitle: {doc.metadata['title']}")
    print(f"Date: {doc.metadata['date']}")
    print(f"Category: {doc.metadata['category']}")
    print("-" * 80)

Testing date range query...

Found 5 documents

Title: Convergent mutations and single nucleotide variants in mitochondrial genomes of modern humans and Neanderthals
Date: 2024-02-07
Category: genomics
--------------------------------------------------------------------------------

Title: IDENTIFICATION OF AN EARLY SUBSET OF CEREBELLAR NUCLEI NEURONS IN MICE
Date: 2024-01-25
Category: developmental biology
--------------------------------------------------------------------------------

Title: Coherent olfactory bulb gamma oscillations arise from coupling independent columnar oscillators
Date: 2024-01-13
Category: neuroscience
--------------------------------------------------------------------------------


In [6]:
# 1. Simple keyword search
print("\nTesting simple keyword search...")
query_builder1 = (QueryBuilder()
                 .from_servers(["biorxiv"])
                 .build())
loader1 = BioRxivLoader(
    query_builder=query_builder1,
    query="machine learning AND genomics",
    max_results=5
)
docs1 = loader1.load()

# 2. Keyword search with date range
print("\nTesting keyword search with date range...")
query_builder2 = (QueryBuilder()
                 .date_range("2024-01-01", "2024-02-17")
                 .from_servers(["biorxiv", "medrxiv"])
                 .build())
loader2 = BioRxivLoader(
    query_builder=query_builder2,
    query="CRISPR AND cancer NOT screening",
    max_results=5
)
docs2 = loader2.load()

# 3. Exact phrase search
print("\nTesting exact phrase search...")
query_builder3 = (QueryBuilder()
                 .last_days(30)
                 .from_servers(["medrxiv"])
                 .build())
loader3 = BioRxivLoader(
    query_builder=query_builder3,
    query='"single cell sequencing"',
    max_results=5
)
docs3 = loader3.load()

# Print summary statistics
all_docs = docs1 + docs2 + docs3
categories = {}
servers = {"biorxiv": 0, "medrxiv": 0}

for doc in all_docs:
    # Count by category
    cat = doc.metadata['category']
    categories[cat] = categories.get(cat, 0) + 1
    
    # Count by server
    server = doc.metadata['server']
    servers[server] += 1

print("\nSummary Statistics")
print("-----------------")
print(f"Total unique papers: {len(all_docs)}")
print("\nPapers by Category:")
for cat, count in sorted(categories.items(), key=lambda x: x[1], reverse=True):
    print(f"{cat}: {count}")

print("\nPapers by Server:")
for server, count in servers.items():
    print(f"{server}: {count}")


Testing simple keyword search...

Testing keyword search with date range...

Testing exact phrase search...

Summary Statistics
-----------------
Total unique papers: 10

Papers by Category:
neuroscience: 1
plant biology: 1
biochemistry: 1
bioinformatics: 1
cancer biology: 1
health informatics: 1
epidemiology: 1
genomics: 1
health economics: 1
physiology: 1

Papers by Server:
biorxiv: 7
medrxiv: 3
