# Building a Research Assistant with Retrieval Augmented Generation (RAG)

This notebook demonstrates how to build a research assistant that can:
1. Search and retrieve papers from PubMed
2. Extract and process their content
3. Use modern language models to answer questions based on the retrieved papers

## Key Components:
- **Paper Retrieval**: Using PubMed scraper
- **Text Embedding**: Using BGE embeddings (state-of-the-art for scientific text)
- **Vector Storage**: Using FAISS for efficient similarity search
- **Language Model**: Using Mistral-7B (optimal for 4xV100 setup)
- **RAG Pipeline**: Combining all components for intelligent question answering

## Setup Requirements
First, let's install the necessary packages:

In [1]:
# Installing PyTorch with CUDA 12.1 support - large download due to GPU dependencies
!pip install torch --index-url https://download.pytorch.org/whl/cu121

# Installing multiple ML libraries - these are large packages with many dependencies:
# - transformers: Hugging Face's ML model library (~500MB)
# - bitsandbytes: For model quantization
# - faiss-gpu: GPU-accelerated similarity search library (~200MB)
# - sentence-transformers: For text embeddings (~100MB)
# - vllm: For fast LLM inference
# - einops: For tensor operations
!pip install transformers accelerate bitsandbytes faiss-gpu sentence-transformers einops
!pip install beautifulsoup4 pdfplumber lxml

## 1. Import Dependencies

In [2]:
import torch  # Deep learning framework for GPU-accelerated tensor operations
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig  # Hugging Face tools for loading and configuring language models
from sentence_transformers import SentenceTransformer  # For generating text embeddings/vectors
import faiss  # Fast similarity search and clustering of dense vectors
import numpy as np  # Numerical computing library for array operations
from typing import List, Dict  # Type hints for better code documentation
from scrape import PaperScraper  # Our existing scraper for retrieving research papers


# Check GPU availability
print(f"GPU Available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

GPU Available: True
Number of GPUs: 1


## 2. Initialize Models

We'll use:
- BGE-Large for embeddings (optimized for scientific text)
- Mistral-7B as our base LLM (excellent performance/resource ratio)

In [3]:
# Login to Hugging Face Hub to access models
from huggingface_hub import login

login("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")


In [4]:
# Initialize embedding model for generating text embeddings
embedding_model = SentenceTransformer('BAAI/bge-large-en-v1.5')  # Load the BGE large model for high quality embeddings
embedding_model.to('cuda')  # Move embedding model to GPU for faster inference

# Configure quantization for efficient GPU usage
bnb_config = BitsAndBytesConfig(  # Configure 4-bit quantization settings
    load_in_4bit=True,  # Enable 4-bit quantization for reduced memory usage
    bnb_4bit_quant_type="nf4",  # Use normalized float4 quantization for better accuracy
    bnb_4bit_compute_dtype=torch.float16,  # Use float16 for compute to balance speed and precision
    bnb_4bit_use_double_quant=True  # Enable double quantization for additional memory savings
)

# Initialize LLM and tokenizer
model_name = "mistralai/Mistral-7B-Instruct-v0.1"  # Specify the Mistral model to use
tokenizer = AutoTokenizer.from_pretrained(model_name)  # Load tokenizer for converting text to tokens
model = AutoModelForCausalLM.from_pretrained(  # Load the language model
    model_name,  # Use the specified Mistral model
    quantization_config=bnb_config,  # Apply the quantization settings
    torch_dtype=torch.float16,  # Use float16 for model weights
    device_map="auto",  # Automatically distribute model across available GPUs
)

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-large-en-v1.5


Downloading shards: 100%|██████████| 2/2 [01:09<00:00, 34.73s/it]
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.27s/it]


## 3. Research Assistant Class

This class combines paper retrieval, embedding, and question answering capabilities:

In [9]:
class ResearchAssistant:
    def __init__(self):
        # Initialize components for paper processing and analysis
        self.scraper = PaperScraper()  # For retrieving papers from PubMed
        self.embedding_model = embedding_model  # For generating text embeddings
        self.tokenizer = tokenizer  # For tokenizing text for the LLM
        self.model = model  # The language model for answering questions
        self.paper_texts = []  # Store the full text of processed papers
        self.paper_metadata = []  # Store metadata (title, authors, etc) for papers
        self.index = None  # Will hold the FAISS similarity search index
        
    def search_papers(self, query: str, max_results: int = 10):
        """Search and download papers for the given query"""
        print(f"Searching for papers about: {query}")
        
        # Get paper IDs from PubMed search
        pmids = self.scraper.search_pubmed(query, max_results)
        # Fetch detailed information for each paper
        papers = self.scraper.fetch_pubmed_details(pmids)
        
        # Process each paper found
        for paper in papers:
            if pdf_url := paper.get('full_text_link'):  # Check if full text PDF is available
                try:
                    # Download PDF to temporary file
                    pdf_path = self.scraper.download_pdf(
                        pdf_url, 
                        f"temp_{paper['pubmed_id']}.pdf"
                    )
                    # Extract plain text from PDF
                    text = self.scraper.extract_text_from_pdf(pdf_path)
                    
                    # Save paper content and metadata
                    self.paper_texts.append(text)
                    self.paper_metadata.append(paper)
                    
                except Exception as e:
                    print(f"Error processing paper {paper['pubmed_id']}: {e}")
                    
        self._build_index()  # Create search index from processed papers
        print(f"Successfully processed {len(self.paper_texts)} papers")
    
    def _build_index(self):
        """Create FAISS index from paper embeddings"""
        # Initialize lists for storing chunks and their metadata
        self.chunks = []
        self.chunk_metadata = []
        
        # Process each paper into chunks
        for text, metadata in zip(self.paper_texts, self.paper_metadata):
            # Split text into paragraphs
            paragraphs = text.split('\n\n')
            # Create overlapping chunks of 3 paragraphs
            for i in range(0, len(paragraphs), 2):
                chunk = ' '.join(paragraphs[i:i+2])
                if len(chunk.split()) > 30:  # Only keep chunks with sufficient content
                    self.chunks.append(chunk)
                    self.chunk_metadata.append(metadata)
        
        # Generate embeddings for all chunks
        embeddings = self.embedding_model.encode(
            self.chunks,
            batch_size=4,  # Process 4 chunks at a time
            show_progress_bar=True,
            convert_to_numpy=True  # Convert to numpy for FAISS compatibility
        )
        
        # Initialize and populate FAISS index
        dimension = embeddings.shape[1]  # Get embedding dimension
        self.index = faiss.IndexFlatL2(dimension)  # Create L2 distance index
        self.index.add(embeddings)  # Add embeddings to index
    
    def answer_question(self, question: str, k: int = 5):
        """Answer a question using RAG"""
        # Convert question to embedding vector
        q_embedding = self.embedding_model.encode([question])[0]
        
        # Find k most similar chunks
        distances, indices = self.index.search(q_embedding.reshape(1, -1), k)
        
        # Build context from relevant chunks - LIMIT TOTAL LENGTH
        context = ""
        used_papers = set()
        total_tokens = 0
        max_tokens = 2048  # Set a reasonable limit for context length
        
        for idx in indices[0]:
            chunk = self.chunks[idx]
            metadata = self.chunk_metadata[idx]
            paper_id = metadata['pubmed_id']
            
            # Only include first chunk from each paper and check token length
            if paper_id not in used_papers:
                chunk_tokens = len(self.tokenizer.encode(chunk))
                if total_tokens + chunk_tokens > max_tokens:
                    break
                    
                used_papers.add(paper_id)
                context += f"\nFrom paper '{metadata['title']}':\n{chunk}\n"
                total_tokens += chunk_tokens
        
        # Construct shorter prompt
        prompt = f"""Answer based on these excerpts. Include citations.

        Excerpts: {context}

        Question: {question}

        Answer: """
            
        # Generate answer using LLM with controlled length
        inputs = self.tokenizer(
            prompt, 
            return_tensors="pt",
            truncation=True,
            max_length=2048  # Hard limit on input length
        ).to("cuda")
        
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=256,  # Limit response length
            temperature=0.7,  # Add some randomness to generation
            num_return_sequences=1,  # Generate one response
            do_sample=True,  # Use sampling instead of greedy decoding
        )
        
        # Extract and clean up the generated answer
        answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return answer.split("Answer: ")[-1].strip()

## 4. Example Usage

Let's demonstrate how to use the Research Assistant:

In [10]:
# Initialize the assistant
assistant = ResearchAssistant()

# Search for papers on a topic
assistant.search_papers(
    query="latest developments in CRISPR gene editing cancer therapy",
    max_results=20
)

# Ask questions
questions = [
    "What are the main challenges in using CRISPR for cancer therapy?",
]

for question in questions:
    print(f"\nQ: {question}")
    print(f"\nA: {assistant.answer_question(question)}")

INFO:scrape:Searching PubMed for : latest developments in CRISPR gene editing cancer therapy


Searching for papers about: latest developments in CRISPR gene editing cancer therapy


INFO:scrape:Found 27 results
INFO:scrape:Fetching details for 27 papers...
INFO:scrape:Successfully processed PMID 37356052
INFO:scrape:Successfully processed PMID 36610813
ERROR:scrape:Error processing PMID 36272261: 429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=36272261&retmode=xml&rettype=full
INFO:scrape:Successfully processed PMID 35337340
ERROR:scrape:Error processing PMID 39708520: 429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=39708520&retmode=xml&rettype=full
INFO:scrape:Successfully processed PMID 38050977
INFO:scrape:Successfully processed PMID 34411650
INFO:scrape:Successfully processed PMID 31739699
INFO:scrape:Successfully processed PMID 36560658
INFO:scrape:Successfully processed PMID 33003295
INFO:scrape:Successfully processed PMID 39317648
INFO:scrape:Successfully processed PMID 35547744
INFO:scrape:Successfully processed PMID 39292

Successfully processed 12 papers

Q: What are the main challenges in using CRISPR for cancer therapy?


Batches: 100%|██████████| 1/1 [00:00<00:00, 55.03it/s]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



A: CRISPR-Cas9 is a powerful tool for precise genome editing, but its application in cancer therapy faces several challenges. One of the main challenges is the difficulty in targeting cancer cells without affecting healthy cells. This is because cancer cells often have mutations that are also present in healthy cells. The specificity of CRISPR-Cas9 depends on the guide RNA (gRNA) used, which must be designed to target the specific mutation present in the cancer cells. However, designing such gRNAs can be challenging, especially when the mutation is present in a highly conserved region of the genome.

Another challenge is the potential for off-target effects, where the gRNA binds to a different DNA sequence than intended, leading to unintended consequences. Off-target effects can occur when the gRNA binds to a non-target sequence with similarity to the intended target. This can lead to the deletion or insertion of nucleotides in the genome, which can have deleterious effects on the cel

## Understanding the Components

1. **Paper Retrieval**:
   - Uses PubMed API to search for relevant papers
   - Downloads PDFs and extracts text

2. **Text Processing**:
   - Splits papers into manageable chunks
   - Maintains metadata for citations

3. **Embedding & Indexing**:
   - Uses BGE-Large embeddings (state-of-the-art for scientific text)
   - FAISS for efficient similarity search

4. **Language Model**:
   - Mistral-7B with 4-bit quantization
   - Optimized for multi-GPU inference

5. **RAG Process**:
   - Embeds user question
   - Retrieves relevant context
   - Generates contextualized answer
