# **Building a Retrieval-Augmented Generation (RAG) System for Biotechnology Applications**

This notebook presents a comprehensive guide to building a Retrieval-Augmented Generation (RAG) system tailored for biotechnology applications. Leveraging advanced information retrieval and generative modeling techniques, the system efficiently navigates and synthesizes vast amounts of scientific data to provide insightful and accurate responses to domain-specific queries.

---

<hr style="border: 3px solid white; width: 100%;">
<hr style="border: 3px solid white; width: 100%;">


---
## **Motivation & Background** <a id="motivatio-&-background"></a>


Large language models (LLMs) have achieved remarkable success, though they still face significant limitations, especially in domain-specific or knowledge-intensive tasks. These limitations often manifest as **hallucinations** —answers that sound plausible but are factually incorrect—when handling queries beyond the model’s training data or requiring current information.

To address these challenges, **Retrieval-Augmented Generation (RAG)** incorporates external knowledge sources. By retrieving relevant document chunks through semantic similarity, RAG mitigates factual inaccuracies and keeps responses up to date. This integration ensures LLMs have the context they need to remain both accurate and current, fostering widespread adoption of RAG in real-world applications.

### Why Use RAG?

1. **Access to Fresh Information**  
   LLMs rely on static training corpora, risking outdated responses. RAG taps into external, dynamic databases—providing the latest facts and cutting-edge data.  

2. **Factual Grounding**  
   LLMs excel at generating fluent text but can falter on factual correctness. By feeding retrieved text directly into the prompt, RAG reduces hallucinations and ensures evidence-based answers.

3. **Scalability & Efficiency**  
   Even with long context windows, LLMs have token limits. RAG’s retrieval stage selectively brings in only the most relevant chunks, which saves costs and tokens while boosting relevance.

4. **Semantic Search & Re-Rankers**  
   Modern RAG systems often leverage vector databases (for semantic search), possible keyword-based fallback, and re-rankers that ensure the top results are truly on-topic.

5. **Quality & Reliability**  
   By anchoring generated content to curated knowledge, RAG helps maintain consistency and accuracy. This is essential for domains like biotechnology, finance, or medicine, where mistakes can have serious consequences.

---



<hr style="border: 3px solid white; width: 100%;">
<hr style="border: 3px solid white; width: 100%;">


# Introduction to Retrieval-Augmented Generation (RAG) <a id="introduction-to-rag"></a>

## What is RAG? <a id="what-is-rag"></a>

**Retrieval-Augmented Generation (RAG)** is an advanced AI framework that integrates **information retrieval** with **generative models** to produce more accurate and contextually relevant responses. Unlike traditional generative models that rely solely on learned parameters, RAG actively retrieves relevant information from external datasets, **enhancing responses with up-to-date and domain-specific knowledge**.

## Core Components <a id="core-components"></a>

### **Indexing**
- Load and preprocess documents  
- Split text into **manageable chunks**  
- Generate **vector embeddings**  
- Store vectors efficiently in **databases**  

### **Retriever**
- Transform queries into **embeddings**  
- Perform **semantic similarity search**  
- Retrieve **relevant documents**  
- Apply **optional ranking and filtering**  

### **Generator**
- **Prompt engineering** for query handling  
- **Context integration** with retrieved documents  
- Generate responses using **LLMs**  
- **Refine and format** outputs  

## Use Cases and Advantages <a id="use-cases-and-advantages"></a>

- **Improved Accuracy**: RAG reduces hallucinations by incorporating real-world data.  
- **Domain-Specific Applications**: Ideal for fields requiring precise and current knowledge, such as **biotechnology**.  
- **Enhanced Interactive Systems**: Strengthens chatbots and virtual assistants with **context-aware responses**.  

---



<hr style="border: 3px solid white; width: 100%;">
<hr style="border: 3px solid white; width: 100%;">

### **RAG Architectural Diagram** <a id="architectural-diagram"></a>


![RAG Paradigm](rag_paradigm.png)

*Figure 1: Overview of the Retrieval-Augmented Generation System Architecture*

---

<hr style="border: 3px solid white; width: 100%;">
<hr style="border: 3px solid white; width: 100%;">

##  **Practical Implementation** <a id="practical-implementation"></a>




### This notebook demonstrates how to build a **Retrieval-Augmented Generation (RAG) system** tailored for biotechnology applications.

### Key Steps:
- Load and process a small subset of **PubMed** data.
- Convert data into **LangChain Document** objects.
- Split documents into **manageable chunks** for semantic retrieval.
- Build a **vector index** using **Chroma** with domain-specific embeddings.
- Construct a **Retrieval-Augmented Generation** (RAG) chain powered by **ChatGPT**.
- Run example queries and display results.

**Note:** For production use, ensure secure management of **environment variables** and **API keys**.


In [1]:
import os

# ------------------------------
# Environment Setup: API Key and Model Name
# ------------------------------
# In this example, we set the OpenAI API key for ChatGPT-based generation.
# For production, load the key securely (e.g., from environment variables or a config file).
os.environ["OPENAI_API_KEY"] = "_"  # Replace "_" with your actual OpenAI API key if needed.
# Ensure the API key is set; otherwise, raise a warning.
if os.environ.get("OPENAI_API_KEY") is None or os.environ.get("OPENAI_API_KEY") == "_":
    print("Warning: OPENAI_API_KEY is not set securely. Please configure it properly for production use.")

# Set the model name used for chat generation.
# You can switch models based on your requirements.
model_name = "gpt-4o-mini"  
# Alternative model example (commented out):
# model_name = "o3-mini-2025-01-31"

## **Data Preprocessing and Indexing** <a id="data-preprocessing-and-indexing"></a>


![Idexing Paradigm](indexing.jpeg)

*Figure 2: Indexing Process*

---

In [2]:
# Import LangChain components for handling documents, text splitting, and chain construction.
from langchain.schema import Document
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
#
# This section converts raw data (PubMed abstracts) into LangChain Document objects,
# which are easier to process for embedding and retrieval.
#
# The load_dataset function fetches a subset of PubMed articles from the "scientific_papers" dataset.
# from datasets import load_dataset 
# pubmed_data = load_dataset('scientific_papers', 'pubmed', split='train[:1%]', trust_remote_code=True)
#
# To save time during demonstrations, we have pre-saved the abstracts from this subset read them in below.
with open("pubmed_abstracts.json", "r", encoding="utf-8") as f:
    pubmed_data = json.load(f)


# ------------------------------
# Converting the dataset to LangChain Document objects.
# We use only the 'abstract' field for demonstration. If an item has an abstract, it is converted into a Document.
# The Document objects include page_content (the abstract) and optional metadata.
# ------------------------------
documents = []
for item in pubmed_data:
    documents.append(Document(page_content=item))
print(f"Loaded {len(documents)} documents from PubMed.")


Loaded 1199 documents from PubMed.


 ------------------------------
## **Splitting Documents into Chunks**
 ------------------------------

In [3]:
# For better retrieval and to respect model token limits, we split longer texts into smaller chunks.
#
# RecursiveCharacterTextSplitter is used here to chunk documents.
#   • chunk_size: The maximum size of each chunk (e.g., 1000 characters).
#   • chunk_overlap: The number of characters of overlap between consecutive chunks for better context retention.
#
# Chunking improves retrieval granularity and helps the underlying LLM to process context effectively.
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,   # Set chunk size according to your data/model limits
    chunk_overlap=200  # Small overlap to ensure context continuity in retrieval
)
docs_split = splitter.split_documents(documents)
print(f"Number of chunks after splitting: {len(docs_split)}")

Number of chunks after splitting: 2189


------------------------------
## **Building the Embedding and Vector Store (Indexing)**
------------------------------

In [4]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# We use OpenAI for general embeddings, but domain-specific embeddings (for biomedical text) may perform better.
embedding_model = OpenAIEmbeddings()

# Create the Chroma vector store using the split document chunks. 
# Chroma handles embedding computation and efficient storage/retrieval.
# The collection_name "pubmed_biotech" organizes the stored vectors.
vectorstore = Chroma.from_documents(
    docs_split,
    embedding_model,
    collection_name="pubmed_biotech"
)

# Create a retriever from the vector store.
# search_kwargs such as k=5 ensure that we retrieve the top 5 most relevant chunks for a given query.
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})


  embedding_model = HuggingFaceEmbeddings(


![Retrieval and Generation Paradigm](prompt-and-generation.jpeg)

*Figure 2: Retrieval and Generation Paradigm*

------------------------------
### **Building the Retrieval-Augmented Generation (RAG) Chain with ChatOpenAI**
------------------------------

In [5]:
# Now, we switch to the generative component that uses ChatGPT for answer generation.
#
# The ChatOpenAI component creates an instance of the chat-model.
# Here, we use a model (gpt-4o-mini or another specified) with temperature=0 for focused, deterministic responses.
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

llm_chatgpt = ChatOpenAI(
    model_name=model_name,                 # Model name as defined earlier
    temperature=0,                         # Temperature 0 for precise answers in domain-specific questions
    openai_api_key=os.getenv("OPENAI_API_KEY")  # Use the API key from the environment
)


  llm_chatgpt = ChatOpenAI(


---------------
## **Define a prompt template for the RAG chain.**
### The template instructs the LLM to use the provided context and answer the question.
### It also instructs the model to state if the context is insufficient.
---------------

In [6]:
template = """Use the following context to answer the question. 
If the context is insufficient, just say so.

Context:
{context}

Question: {question}

Answer:
"""
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)



----------------------
## **Build the RetrievalQA Chain**

### The `chain_type="stuff"` combines all retrieved document chunks into the prompt.  
### Other chain types (e.g., `map_reduce`, `refine`) can also be explored.

### This chain utilizes the previously defined LLM and retriever.
### It first retrieves relevant document chunks from the vector store and then passes them, along with the user query, to the chat-based LLM to generate an answer.

----------------------


In [7]:
rag_chain = RetrievalQA.from_chain_type(
    llm=llm_chatgpt,
    chain_type="stuff",   # This chain type simply concatenates the retrieved documents for the LLM prompt.
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt}
)


# Define a convenience function to run queries on the RAG system.
#
# This function takes a text query, passes it to the RAG chain, and returns the answer.
def query_rag_system(query: str) -> str:
    """Uses ChatGPT for generation with retrieved context. Returns the generated answer."""
    answer = rag_chain.run(query)
    return answer


In [8]:
# Example biotech question
question_1 = "Summarize the latest CRISPR gene editing findings from these abstracts."
answer_1 = rag_chain.run(question_1)
print(f"Q: {question_1}\nA: {answer_1}\n")

question_2 = "Discuss current advancements in immunotherapy for lung cancer."
answer_2 = rag_chain.run(question_2)
print(f"Q: {question_2}\nA: {answer_2}")


  answer_1 = rag_chain.run(question_1)


Q: Summarize the latest CRISPR gene editing findings from these abstracts.
A: The latest findings on CRISPR gene editing highlight several key advancements and challenges. The CRISPR/Cas9 system is recognized for its precision, efficiency, versatility, and ease of use, facilitating both basic research and applied crop improvement. Recent applications have successfully extended CRISPR/Cas9 technology to biallelic mutations in woody perennials, such as Populus. However, challenges remain, particularly in outcrossing species where sequence polymorphisms can hinder the effectiveness of CRISPR/Cas9. Research has demonstrated the system's sensitivity to allelic heterozygosity and has proposed tools and strategies to address these issues. Additionally, the study of CRISPR-Cas systems in bacteria and archaea has enhanced our understanding of virus defense mechanisms, although the ecological role of these systems in nature is still complex and underexplored. Overall, the advancements in CRISPR 

### **Display**
### Function to display Q&A results in an easy-to-read format.
### This is especially useful in a notebook setting to provide visual emphasis using increased font sizes.

In [9]:

from IPython.display import display, HTML

def display_qna(question, answer, q_font_size=18, a_font_size=16):
    """
    Displays the question and answer with increased font sizes.
    """
    html_content = f"""
    <div st
    yle="margin-bottom: 20px;">
        <p style="font-size:{q_font_size}px; font-weight: bold; color: #2E86C1;">Question:</p>
        <p style="font-size:{q_font_size}px; font-weight: bold; color: #2E86C1;">{question}</p>
        <p style="font-size:{a_font_size}px; margin-top: 10px;"><strong>Answer:</strong> {answer}</p>
    </div>
    """
    display(HTML(html_content))


#### **Example Queries**

In [10]:
query1 = "Summarize recent findings on CRISPR gene editing."
answer1 = query_rag_system(query1)
display_qna(query1, answer1)

query2 = "What are the latest advancements in immunotherapy for cancer?"
answer2 = query_rag_system(query2)
display_qna(query2, answer2)

query3 = "What are the latest EGFR mutation lung cancer therapies?"
answer3 = query_rag_system(query3)
display_qna(query3, answer3)

### **Dynamic Index Updating**

The RAG system leverages publicly available scientific abstracts from [PubMed](https://pubmed.ncbi.nlm.nih.gov/) to retrieve and generate insightful responses. This could be extended to multiple datasets to enrich the RAG system's functionality. Extending to multiple datasets (e.g. [BioRxiv](https://www.biorxiv.org/)) involves similar steps: loading, embedding, and indexing. 

Below we merge embeddings from two new documents into the existing Chroma vector store.


In [12]:
new_documents = [
    "New Abstract 1: Adenosine-to-inosine (A-to-I) editing of double-stranded RNA (dsRNA) by ADAR1 is an essential modifier of the immunogenicity of cellular dsRNA. The role of MDA5 in sensing unedited cellular dsRNA and the downstream activation of type I interferon (IFN) signaling are well established. However, we have an incomplete understanding of pathways that modify the response to unedited dsRNA. We performed a genome-wide CRISPR screen and showed that GGNBP2, CNOT10, and CNOT11 interact and regulate sensing of unedited cellular dsRNA. We found that GGNBP2 acts between dsRNA transcription and its cytoplasmic sensing by MDA5. GGNBP2 loss prevented induction of type I IFN and autoinflammation after the loss of ADAR1 editing activity by modifying the subcellular distribution of endogenous A-to-I editing substrates and reducing cytoplasmic dsRNA load. These findings reveal previously undescribed pathways to modify diseases associated with ADAR mutations and may be determinants of response or resistance to small-molecule ADAR1 inhibitors.",
    "New Abstract 2: T cell immunoglobulin and mucin domain-containing protein 3 (TIM-3) is an important immune checkpoint molecule initially identified as a marker of IFN-γ–producing CD4+ and CD8+ T cells. Since then, our understanding of its role in immune responses has significantly expanded. Here, we review emerging evidence demonstrating unexpected roles for TIM-3 as a key regulator of myeloid cell function, in addition to recent work establishing TIM-3 as a delineator of terminal T cell exhaustion, thereby positioning TIM-3 at the interface between fatigued immune responses and reinvigoration. We share our perspective on the antagonism between TIM-3 and T cell stemness, discussing both cell-intrinsic and cell-extrinsic mechanisms underlying this relationship. Looking forward, we discuss approaches to decipher the underlying mechanisms by which TIM-3 regulates stemness, which has remarkable potential for the treatment of cancer, autoimmunity, and autoinflammation."
]


# 1) Convert raw strings into LangChain Document objects
new_docs = [Document(page_content=text) for text in new_documents]

# 2) Split the new documents for chunking (helps with retrieval accuracy)
split_new_docs = splitter.split_documents(new_docs)

# 3) Add these chunks to your vectorstore
#    (This automatically handles embeddings and indexing behind the scenes.)
vectorstore.add_documents(split_new_docs)

# 4) Confirm the addition in a way specific to your vectorstore
#    e.g., check how many docs are in the store
print("Successfully added new documents to the vectorstore!")


Successfully added new documents to the vectorstore!


In [13]:
new_query1 = "What are the roles of GGNBP2, CNOT10, and CNOT11 in regulating the sensing of unedited cellular dsRNA?"
new_answer1 = query_rag_system(new_query1)
display_qna(new_query1, new_answer1)
new_query2 = "How does TIM-3 regulate myeloid cell activity and T cell exhaustion in immune responses?"
new_answer2 = query_rag_system(new_query2)

display_qna(new_query2,new_answer2)


<hr style="border: 3px solid white; width: 100%;">
<hr style="border: 3px solid white; width: 100%;">

## **Expert-Level Insights** <a id="expert-level-insights"></a>

###  **Challenges and Mitigation Strategies** <a id="challenges-and-mitigation-strategies"></a>

Building a robust RAG system involves navigating several challenges. Below are common issues and strategies to mitigate them:

- **Retrieval Accuracy**:
    - *Challenge*: Irrelevant documents may be retrieved.
    - *Mitigation*: Use advanced embedding models and increase the quality of your dataset. Fine-tune the embedding model on domain-specific data to enhance relevance.
    
- **Hallucination in Generation**:
    - *Challenge*: The generative model may produce incorrect or nonsensical answers.
    - *Mitigation*: Fine-tune the generative model on domain-specific datasets and implement constraints during generation, such as length limits or specific tokens.
    
- **Scalability**:
    - *Challenge*: Handling large datasets efficiently.
    - *Mitigation*: Utilize optimized indexing techniques like FAISS's hierarchical navigable small world graphs (HNSW) for faster searches. Distribute the index across multiple machines if necessary.


<hr style="border: 3px solid white; width: 100%;">
<hr style="border: 3px solid white; width: 100%;">

## **Conclusion** <a id="conclusion"></a>

In this tutorial, we've built a **Retrieval-Augmented Generation (RAG)** system tailored for the **biotechnology** and **biomedical research** industries. By integrating advanced retrieval mechanisms with powerful generative models, we've created a tool capable of navigating complex scientific data and providing insightful, evidence-based answers to domain-specific queries. This system showcases the potential of RAG in transforming how professionals interact with vast datasets, ultimately accelerating research and innovation in biotechnology and biomedical fields.

### **Key Takeaways:**

- #### **Chroma Vector Store:**
  - We generated vector embeddings using OpenAI's embedding model and built a Chroma vector store.
  - Employed **Chroma** for efficient similarity searches and vector storage within large datasets.
  - Leveraged Chroma's capabilities to handle scalability and optimize performance for biomedical data.

- #### **Generative Models:**
  - Leveraged OpenAI models (4o-mini) for generating coherent and contextually relevant responses based on retrieved documents.
  - Integrated the RAG chain architecture to mitigate hallucinations, ensuring that large language model (LLM) outputs remain evidence-based.

- #### **Optimization:**
  - Implemented chunking (splitting documents) to overcome token length limitations and ensure context continuity.

- #### **Persistence:**
  - Enabled saving and loading of Chroma vector stores and document lists to streamline future operations and support system scalability.

### **Potential Next Steps:**

- #### **Fine-Tuning Models:**
  - Further enhance the system by fine-tuning the generative model on specialized biotechnology and biomedical datasets to improve domain-specific performance.

- #### **Scalability:**
  - Explore more advanced Chroma indexing techniques and alternative chain types to handle larger datasets and complex retrieval parameters.

- #### **Deployment:**
  - Consider deploying the RAG system as an API or integrating it into applications for real-time information retrieval and generation.
  - Implement persistence strategies, such as saving/loading the vector store, to support scaling up the system for broader use cases.

By following these next steps, you can continue to refine and expand the capabilities of your RAG system, ensuring it remains a valuable tool for advancing research and innovation in the biotechnology and biomedical research sectors.
