# RAG Retrieval Analysis

This notebook tests the retrieval component of the RAG pipeline without using the LLM.

In [1]:
import pandas as pd
from pathlib import Path
from biology_rag import BiologyTextbookRAG
import json
from datetime import datetime

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load and prepare the dataset
df = pd.read_csv('bio_data.csv')
print(f"Loaded {len(df)} questions")
df.head()

Loaded 30 questions


Unnamed: 0,ID,Question Text,Choice A,Choice B,Choice C,Choice D,Correct Choice (0-3),Text Sections,Figures,Explanation
,BIO_CH4_Q1,"Based on Figure 4.14's electron micrograph, ho...",They decrease surface area,They increase surface area,They block molecular movement,They have no functional impact,1,4.3,4.14,The electron micrograph clearly shows that cri...
,BIO_CH4_Q2,"In Figure 4.1, what structural feature demonst...",Cell wall composition,Internal compartmentalization,Membrane thickness,Cell size variation,1,4.1,4.1,The micrographs reveal that nasal sinus cells ...
,BIO_CH4_Q3,"Looking at Figure 4.7, why does the larger cub...",Decreased internal pressure,Reduced membrane flexibility,Insufficient surface area ratio,Increased cellular rigidity,2,4.2,4.7,The diagram visually demonstrates that as the ...
,BIO_CH4_Q4,Based on Figure 4.18's protein trafficking pat...,Normal protein processing,Reversed protein modification,Incomplete protein folding,Enhanced protein sorting,2,4.4,4.18,The diagram shows crucial initial protein fold...
,BIO_CH4_Q5,"In Figure 4.5, how does the nucleoid region's ...",More organized but less efficient,Less compartmentalized but functional,More condensed but less accessible,Less protected but more active,1,"[4.2,4.3]","[4.5,4.8]",The visual comparison shows the prokaryotic nu...


In [3]:
# Initialize RAG system (without LLM)
rag_system = BiologyTextbookRAG(
    pdf_path="biology2e_textbook.pdf",
    project_dir="biology_rag"
)

# Extract sections if needed
if not rag_system.sections:
    print("Extracting sections from PDF...")
    rag_system.extract_sections()

# Create or load vector store
print("Setting up vector store...")
vectorstore = rag_system.create_vectorstore(force_refresh=True)
print(f"Vector store has {vectorstore._collection.count()} documents")

Project directory: biology_rag
Models directory: biology_rag/models
Vector database directory: biology_rag/vector_db
Using cached model from biology_rag/models/llama-2-7b-chat.Q4_K_M.gguf
Extracting sections from PDF...
Reading PDF and extracting sections...


100%|██████████| 1447/1447 [00:13<00:00, 105.84it/s]


Extracted 107 sections from the textbook
Setting up vector store...
Vector store directory: biology_rag/vector_db
Creating new vector store...
Processing 4934 chunks from 107 sections
Created new vector store with 4934 documents
Vector store has 4934 documents


In [4]:
def get_retrieval_results(question: str, k: int = 5) -> dict:
    """Get just the retrieved chunks for a question"""
    # Get documents from retriever
    retrieved_docs = vectorstore.similarity_search(question, k=k)
    
    # Format results
    chunks = []
    for doc in retrieved_docs:
        chunks.append({
            'text': doc.page_content,
            'section': doc.metadata.get('section', 'Unknown'),
            'score': doc.metadata.get('score', 0.0)
        })
    
    return {
        'question': question,
        'retrieved_chunks': chunks
    }

In [5]:
# Process all questions and collect retrieval results
all_results = []

for idx, row in df.iterrows():
    print(f"\nProcessing question {idx + 1}/{len(df)}")
    print(f"Question: {row['Question Text']}")
    
    # Get retrieval results
    results = get_retrieval_results(row['Question Text'])
    
    # Add metadata
    results['question_id'] = row['ID']
    results['correct_sections'] = row['Text Sections']
    results['figures'] = row['Figures']
    
    # Print retrieved chunks
    print("\nRetrieved chunks:")
    for i, chunk in enumerate(results['retrieved_chunks'], 1):
        print(f"\nChunk {i} from section {chunk['section']}:")
        print("-" * 80)
        print(chunk['text'])
        print("-" * 80)
    
    all_results.append(results)


Processing question nan/30
Question: Based on Figure 4.14's electron micrograph, how do cristae contribute to mitochondrial function?

Retrieved chunks:

Chunk 1 from section 4.10:
--------------------------------------------------------------------------------
your c ells don ’t get enough o xygen, they do not mak e much A TP. Instead, pr oducing lactic acid ac companies the small amount of
ATP they mak e in the absenc e of o xygen.
Mitochondria ar e oval-shaped, double membr ane org anelles ( Figur e 4.14 ) that ha ve their o wn ribosomes and DN A. E ach
membr ane is a phospholipid bila yer embedded with pr oteins . The inner la yer has f olds c alled cristae . We call the ar ea
surr ounded b y the f olds the mitochondrial matrix. The cristae and the matrix ha ve diff erent r oles in c ellular r espir ation.
Figur e4.14 This electr on micr ograph sho ws a mit ochondrion thr ough an electr on micr oscope. This or ganel le has an out er membr ane and
an inner membr ane. The inner memb

In [7]:
# Save results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = f"retrieval_results_{timestamp}.json"

with open(output_file, 'w') as f:
    json.dump(all_results, f, indent=2)

print(f"\nResults saved to {output_file}")


Results saved to retrieval_results_20241119_173336.json


In [9]:
def format_rag_prompt(question: str, choices: dict, retrieved_chunks: list) -> str:
    """
    Format a prompt that includes the question, answer choices, and retrieved chunks.
    
    Args:
        question: The question text
        choices: Dictionary of answer choices (e.g., {'A': 'choice text', 'B': '...'})
        retrieved_chunks: List of dictionaries containing retrieved chunks with 'text' and 'section' keys
    
    Returns:
        Formatted prompt string
    """
    # Format the question and choices
    prompt_parts = [
        "Answer the following question using ONLY the provided context below.",
        "\nQuestion:",
        question,
        "\nChoices:"
    ]
    
    for letter, text in choices.items():
        prompt_parts.append(f"{letter}) {text}")
    
    # Add retrieved context
    prompt_parts.extend([
        "\nRelevant Context:",
        "Here are the most relevant sections from the textbook:"
    ])
    
    for i, chunk in enumerate(retrieved_chunks, 1):
        prompt_parts.extend([
            f"\nChunk {i} [Section {chunk['section']}]:",
            "-" * 40,
            chunk['text'],
            "-" * 40
        ])
    
    # Add final instruction
    prompt_parts.extend([
        "\nInstructions:",
        "1. Use ONLY the information from the provided context above",
        "2. Choose the most appropriate answer (A, B, C, or D)",
        "3. If the context doesn't contain enough information, indicate that",
        "\nAnswer: "
    ])
    
    return "\n".join(prompt_parts)

In [11]:
# Example usage in notebook
def process_question(row, retrieved_chunks):
    """Process a single question with its retrieved chunks"""
    # Format choices dictionary
    choices = {
        'A': row['Choice A'],
        'B': row['Choice B'],
        'C': row['Choice C'],
        'D': row['Choice D']
    }
    
    # Generate prompt
    prompt = format_rag_prompt(
        question=row['Question Text'],
        choices=choices,
        retrieved_chunks=retrieved_chunks
    )
    
    return prompt

prompts = []

# In your main processing loop:
for idx, row in df.iterrows():
    results = get_retrieval_results(row['Question Text'])
    prompt = process_question(row, results['retrieved_chunks'])
    print(prompt)
    prompts.append(prompt)
    print("\n" + "="*80 + "\n")

Answer the following question using ONLY the provided context below.

Question:
Based on Figure 4.14's electron micrograph, how do cristae contribute to mitochondrial function?

Choices:
A) They decrease surface area
B) They increase surface area
C) They block molecular movement
D) They have no functional impact

Relevant Context:
Here are the most relevant sections from the textbook:

Chunk 1 [Section 4.10]:
----------------------------------------
your c ells don ’t get enough o xygen, they do not mak e much A TP. Instead, pr oducing lactic acid ac companies the small amount of
ATP they mak e in the absenc e of o xygen.
Mitochondria ar e oval-shaped, double membr ane org anelles ( Figur e 4.14 ) that ha ve their o wn ribosomes and DN A. E ach
membr ane is a phospholipid bila yer embedded with pr oteins . The inner la yer has f olds c alled cristae . We call the ar ea
surr ounded b y the f olds the mitochondrial matrix. The cristae and the matrix ha ve diff erent r oles in c ellular r

In [13]:
print(prompts[0])

Answer the following question using ONLY the provided context below.

Question:
Based on Figure 4.14's electron micrograph, how do cristae contribute to mitochondrial function?

Choices:
A) They decrease surface area
B) They increase surface area
C) They block molecular movement
D) They have no functional impact

Relevant Context:
Here are the most relevant sections from the textbook:

Chunk 1 [Section 4.10]:
----------------------------------------
your c ells don ’t get enough o xygen, they do not mak e much A TP. Instead, pr oducing lactic acid ac companies the small amount of
ATP they mak e in the absenc e of o xygen.
Mitochondria ar e oval-shaped, double membr ane org anelles ( Figur e 4.14 ) that ha ve their o wn ribosomes and DN A. E ach
membr ane is a phospholipid bila yer embedded with pr oteins . The inner la yer has f olds c alled cristae . We call the ar ea
surr ounded b y the f olds the mitochondrial matrix. The cristae and the matrix ha ve diff erent r oles in c ellular r

In [18]:
# save prompts to file add lots of spaces and seperaters between questions and label question number
with open("prompts.txt", "w") as f:
    for prompt in prompts:
        f.write(f"Question {prompts.index(prompt) + 1}:!!!!!!!!!\n\n")
        f.write(prompt)
        f.write("\n\n\n\n")

print("Prompts saved to prompts.txt")

Prompts saved to prompts.txt


In [73]:
print(prompts[18])

Answer the following question using ONLY the provided context below.

Question:
In Figure 7.18, what aspect of GLUT4 trafficking demonstrates its regulation by insulin?

Choices:
A) Permanent membrane location
B) Vesicle-mediated transport
C) Direct diffusion
D) Continuous cycling

Relevant Context:
Here are the most relevant sections from the textbook:

Chunk 1 [Section 7.12]:
----------------------------------------
nucleic acid pr oduction. I n short, the c ell needs to c ontr ol its metabolism.
Regulatory Mechanisms
A variety of mechanisms is used to c ontr ol cellular r espir ation. S ome type of c ontr ol exists at each stage of gluc ose metabolism.
Access of gluc ose to the c ell c an be r egulated using the GLUT (gluc ose tr ansporter) pr oteins that tr ansport gluc ose ( Figur e 7.18 ).7.7 • Regulation o f Cellular Respir ation 197
Diff erent f orms of the GLUT pr otein c ontr ol passage of gluc ose into the c ells of specific tissues .
Figur e7.18 GLUT4 is a gluc ose tr anspo