## **Learning Objectives**

By completing these exercises, you will:

- Understand Retrieval-Augmented Generation (RAG) and its components.
- Load, preprocess, and handle PDF documents effectively.
- Convert textual data into embeddings for efficient retrieval.
- Implement and test document retrieval systems using LangChain and FAISS.
- Integrate retrieval systems with free Language Models (LLMs) from ChatGroq .
- Build an interactive chat-based Q&A system.

---

## **Exercise 1: Setup and Warm-up**

In this exercise, you'll set up your environment and select a suitable language model.

**Steps:**

1. **Load Environment Variables:** Ensure your environment variables (e.g., API keys, tokens) are securely stored and loaded.
2. **Choose LLM:** Select a free LLM model from from ChatGroq. 
3. **Instantiate the Model:** Create an instance of your chosen model.


In [3]:
# Import necessary libraries
from dotenv import load_dotenv
from langchain_groq import ChatGroq

# Step 1: Load environment variables (includes GROQ_API_KEY from .env file)
load_dotenv()

# Step 2: Choose LLM - Using Llama 3.1 8B (replacement for decommissioned llama3-8b-8192)
model_name = "llama-3.1-8b-instant"

# Step 3: Instantiate the model
llm = ChatGroq(model=model_name, temperature=0)

print(f"LLM initialized: {model_name}")



‚úÖ LLM initialized: llama-3.1-8b-instant


---

## **Exercise 2: Data Ingestion**

In this exercise, you'll learn to load PDF data into a Python environment.

**Steps:**

1. **Import PDF Loader:** Use LangChain‚Äôs `PyPDFLoader`.
2. **Load PDF File:** Create a function to read the PDF file.
3. **Display PDF Content:** Print the number of pages and first page content.

In [11]:
# Import PyPDFLoader
from langchain_community.document_loaders import PyPDFLoader

# Example function to load PDF
def load_pdf(file_path):
    """Load a PDF file and return its pages as documents."""
    loader = PyPDFLoader(file_path)
    pages = loader.load()
    return pages

In [None]:
# Load BOTH PDFs and combine them
pdf_paths = [
    "../documents/paracetamol.pdf",
    "../documents/react_paper.pdf"
]

# Load all documents
all_pages = []
for pdf_path in pdf_paths:
    pages = load_pdf(pdf_path)
    all_pages.extend(pages)
    print(f"Loaded: {pdf_path.split('/')[-1]} ({len(pages)} pages)")

print(f"\nTotal pages loaded: {len(all_pages)}")


‚úÖ Loaded: paracetamol.pdf (3 pages)
‚úÖ Loaded: react_paper.pdf (33 pages)

üìÑ Total pages loaded: 36


---

## **Exercise 3: Document Chunking**

This exercise introduces splitting large documents into manageable text chunks.

**Steps:**

1. **Import Text Splitter:** Use `RecursiveCharacterTextSplitter`.
2. **Chunk Document:** Write a function that splits loaded documents into chunks.
3. **Test Function:** Verify by displaying the resulting chunks.


In [13]:
# Import RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Chunking function
def chunk_documents(documents, chunk_size=500, chunk_overlap=50):
    """Split documents into smaller chunks for embedding."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    chunks = text_splitter.split_documents(documents)
    return chunks

In [None]:
# Chunk the documents (using all_pages from Exercise 2)
chunks = chunk_documents(all_pages)

# Display chunk info
print(f"Created {len(chunks)} chunks from {len(all_pages)} pages\n")
print(f"--- Sample Chunk (first one) ---")
print(f"Content: {chunks[0].page_content[:300]}...")
print(f"\nMetadata: {chunks[0].metadata}")

‚úÖ Created 301 chunks from 36 pages

--- Sample Chunk (first one) ---
Content: 202211
178 mm
422 mm
178 mm
422 mm
Front Side Back Side
 Paracetamol 500mg Tablets
178 x 422mm
178 x 30mm
358
202211
NA
Printed LeaÔ¨Çet for  Paracetamol 500mg Tablets, Open size: 178 x 422mm, Folding Size : 178x30mm 
SpeciÔ¨Åcation: 40GSM Bible Paper - Fairmed/Apohilft-Germany 
P4S Complete Solutions
0...

Metadata: {'producer': 'Adobe PDF Library 16.0', 'creator': 'Adobe InDesign 16.4 (Windows)', 'creationdate': '2021-10-12T16:15:55+02:00', 'moddate': '2021-10-12T16:15:57+02:00', 'trapped': '/False', 'source': '/Users/esmahoney/Desktop/AI_PM_course/ds-rag-pipeline/documents/paracetamol.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1'}



---

## **Exercise 4: Embedding and Storage**

In this exercise, you will create embeddings from text chunks and store them efficiently.

**Steps:**

1. **Choose Embedding Model:** Use `sentence-transformers/all-mpnet-base-v2` from Hugging Face.
2. **Generate Embeddings:** Transform document chunks into embeddings.
3. **Store Embeddings:** Save these embeddings using FAISS locally.


In [None]:
# Import libraries
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import os
import time

# Step 1: Choose embedding model
embedding_model_name = "sentence-transformers/all-mpnet-base-v2"

# Step 2 & 3: Generate embeddings and store in FAISS
def embed_and_store(chunks, save_path="faiss_index"):
    """Embed and store document chunks using HuggingFaceEmbeddings."""
    print(f"üîÑ Loading embedding model: {embedding_model_name}")
    embeddings = HuggingFaceEmbeddings(
        model_name=embedding_model_name,
        model_kwargs={'device': 'cpu'}
    )
    
    print(f"üîÑ Creating embeddings for {len(chunks)} chunks... (this may take a minute)")
    start_time = time.time()
    
    # Create FAISS index from documents
    db = FAISS.from_documents(chunks, embeddings)
    
    elapsed = time.time() - start_time
    
    # Save locally
    db.save_local(save_path)
    
    print(f"Embeddings created in {elapsed:.1f} seconds")
    print(f"FAISS index saved to: {os.path.abspath(save_path)}/")
    print(f"Index contains {db.index.ntotal} vectors (dimension: {db.index.d})")
    
    return db, embeddings



In [19]:
# Generate embeddings and save them locally
db, embeddings = embed_and_store(chunks)

üîÑ Loading embedding model: sentence-transformers/all-mpnet-base-v2
üîÑ Creating embeddings for 301 chunks... (this may take a minute)
‚úÖ Embeddings created in 17.3 seconds
‚úÖ FAISS index saved to: /Users/esmahoney/Desktop/AI_PM_course/ds-rag-pipeline/notebooks/faiss_index/
üìä Index contains 301 vectors (dimension: 768)


---

## **Exercise 5: Retrieval from FAISS**

Here, you will learn how to retrieve documents from a vector database using embeddings.

**Steps:**

1. **Load Embeddings:** Load stored embeddings from the FAISS database.
2. **Implement Retrieval:** Create logic to retrieve relevant chunks based on queries.
3. **Test Retriever:** Execute retrieval using sample queries.

In [24]:
# Step 1: Load the FAISS database
# allow_dangerous_deserialization=True is safe since YOU created this index
db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
print(f"‚úÖ Loaded FAISS index with {db.index.ntotal} vectors")

# Step 2: Create retriever
retriever = db.as_retriever(search_kwargs={"k": 3})  # Return top 3 results

# Step 3: Test the retriever WITH SCORES
query = "What is paracetamol used for?"

# Use similarity_search_with_score to get distance scores
# Lower score = more similar (L2 distance), or Higher score = more similar (cosine)
docs_with_scores = db.similarity_search_with_score(query, k=3)

print(f"\nüîç Query: '{query}'")
print(f"üìÑ Found {len(docs_with_scores)} relevant chunks:\n")

for i, (doc, score) in enumerate(docs_with_scores, 1):
    source = doc.metadata.get('source', 'Unknown').split('/')[-1]
    # Score interpretation: lower = better match (L2 distance)
    relevance = "üü¢ High" if score < 0.5 else "üü° Medium" if score < 0.8 else "üî¥ Low"
    print(f"--- Result {i} ---")
    print(f"Score: {score:.4f} ({relevance} relevance)")
    print(f"Source: {source}")
    print(f"Content: {doc.page_content[:200]}...")
    print()

‚úÖ Loaded FAISS index with 301 vectors

üîç Query: 'What is paracetamol used for?'
üìÑ Found 3 relevant chunks:

--- Result 1 ---
Score: 0.7558 (üü° Medium relevance)
Source: paracetamol.pdf
Content: 202211
178 mm
422 mm
178 mm
422 mm
Front Side Back Side
 Paracetamol 500mg Tablets
178 x 422mm
178 x 30mm
358
202211
NA
Printed LeaÔ¨Çet for  Paracetamol 500mg Tablets, Open size: 178 x 422mm, Folding S...

--- Result 2 ---
Score: 0.7811 (üü° Medium relevance)
Source: paracetamol.pdf
Content: 202211
178 mm
422 mm
178 mm
422 mm
Front Side Back Side
 Paracetamol 500mg Tablets
178 x 422mm
178 x 30mm
358
202211
NA
Printed LeaÔ¨Çet for  Paracetamol 500mg Tablets, Open size: 178 x 422mm, Folding S...

--- Result 3 ---
Score: 0.8049 (üî¥ Low relevance)
Source: paracetamol.pdf
Content: von Paracetamol im K√∂rper verlangsamt sein kann.
‚Ä¢ Schlafmitteln wie Phenobarbital, Mitteln gegen Epilepsie wie Pheny-
toin und Carbamazepin, Mitteln gegen Tuberkulose (Rifampicin), 
anderen, m√∂gliche...



In [25]:
# Test retrieval with multiple queries across both documents
test_queries = [
    "What are the side effects of paracetamol?",
    "What is ReAct and how does it work?",
    "What is the recommended dosage?",
    "How does reasoning help language models?",
]

def test_retrieval_with_scores(query, k=3):
    """Test a query and display results with similarity scores"""
    docs_with_scores = db.similarity_search_with_score(query, k=k)
    print(f"üîç Query: '{query}'")
    print(f"{'‚îÄ' * 60}")
    for i, (doc, score) in enumerate(docs_with_scores, 1):
        source = doc.metadata.get('source', 'Unknown').split('/')[-1]
        relevance = "üü¢" if score < 0.5 else "üü°" if score < 0.8 else "üî¥"
        content = doc.page_content[:120].replace(chr(10), ' ')
        print(f"  [{i}] {relevance} Score: {score:.3f} | {source}")
        print(f"      {content}...")
    print()

# Run all test queries
for query in test_queries:
    test_retrieval_with_scores(query)

üîç Query: 'What are the side effects of paracetamol?'
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  [1] üî¥ Score: 0.889 | paracetamol.pdf
      202211 178 mm 422 mm 178 mm 422 mm Front Side Back Side  Paracetamol 500mg Tablets 178 x 422mm 178 x 30mm 358 202211 NA ...
  [2] üî¥ Score: 0.898 | paracetamol.pdf
      202211 178 mm 422 mm 178 mm 422 mm Front Side Back Side  Paracetamol 500mg Tablets 178 x 422mm 178 x 30mm 358 202211 NA ...
  [3] üî¥ Score: 0.910 | paracetamol.pdf
      nahme von Paracetamol hat keinen signifikanten Einfluss auf die  Blutungstendenz. Auswirkungen der Einnahme von Paraceta...

üîç Query: 'What is ReAct and how does it work?'
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  [1] üî¥ Scor

As we can see in these results that the model thinks that side effects and use are similar. but none of these queries scored really high. We will need to adjust this and try for more relevant results before moving on to Exercise 6. 

In [27]:
# Improved retrieval with MMR (Maximal Marginal Relevance) for diversity
# and fetching more results to filter from
retriever = db.as_retriever(
    search_type="mmr",  # Avoids duplicate/similar results
    search_kwargs={
        "k": 5,           # Return top 5
        "fetch_k": 10,    # Fetch 10 candidates, pick best 5
        "lambda_mult": 0.7  # Balance relevance (1.0) vs diversity (0.0)
    }
)

# Better test function with corrected L2 distance thresholds
def test_retrieval_improved(query, k=3, max_score=1.2):
    """Test with L2 distance (lower = better match)"""
    docs_with_scores = db.similarity_search_with_score(query, k=k)
    
    print(f"üîç Query: '{query}'")
    print(f"{'‚îÄ' * 65}")
    
    for i, (doc, score) in enumerate(docs_with_scores, 1):
        source = doc.metadata.get('source', 'Unknown').split('/')[-1]
        # L2 distance: lower = better. Typical good matches are < 1.0
        if score < 0.8:
            relevance = "üü¢ Good"
        elif score < 1.0:
            relevance = "üü° Fair"
        else:
            relevance = "üî¥ Poor"
        
        # Skip results with too much noise
        content = doc.page_content[:150].replace(chr(10), ' ').strip()
        
        print(f"  [{i}] {relevance} (L2: {score:.3f}) | {source}")
        print(f"      {content}...")
    print()

# Test queries
for query in test_queries:
    test_retrieval_improved(query)


üîç Query: 'What are the side effects of paracetamol?'
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  [1] üü° Fair (L2: 0.889) | paracetamol.pdf
      202211 178 mm 422 mm 178 mm 422 mm Front Side Back Side  Paracetamol 500mg Tablets 178 x 422mm 178 x 30mm 358 202211 NA Printed LeaÔ¨Çet for  Paracetamo...
  [2] üü° Fair (L2: 0.898) | paracetamol.pdf
      202211 178 mm 422 mm 178 mm 422 mm Front Side Back Side  Paracetamol 500mg Tablets 178 x 422mm 178 x 30mm 358 202211 NA Printed LeaÔ¨Çet for  Paracetamo...
  [3] üü° Fair (L2: 0.910) | paracetamol.pdf
      nahme von Paracetamol hat keinen signifikanten Einfluss auf die  Blutungstendenz. Auswirkungen der Einnahme von Paracetamol 500 mg Die Apotheke  hilft...

üîç Query: 'What is ReAct and how does it work?'
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

These results are better but still not great. As you can see the first query returns identical text results but they have different scores. this is because of the repeated header/footer on each page (I only know this from some research). 

To fix this let's try rechunking and cleaning up the text. 



In [30]:
# STEP 1: Clean the document text to remove PDF noise/metadata
import re
from langchain.schema import Document

def clean_document_text(doc):
    """Remove PDF metadata noise from document content"""
    text = doc.page_content
    
    # Remove common paracetamol PDF header/footer patterns
    noise_patterns = [
        r'202211\s*\n*178 mm\s*\n*422 mm.*?Bible Paper.*?Solutions\s*\n*\d*\s*\n*Black.*?30mm',  # Full header block
        r'202211\s+178\s*mm\s+422\s*mm.*?30mm',  # Simplified header
        r'Front Side\s+Back Side',
        r'178 x 422mm\s+178 x 30mm',
        r'Printed Lea.*?Folding Size.*?mm',
        r'SpeciÔ¨Åcation:.*?Germany',
        r'P4S Complete Solutions',
        r'\d{6}\s+178\s*mm\s+422\s*mm',  # Dimension patterns
    ]
    
    for pattern in noise_patterns:
        text = re.sub(pattern, '', text, flags=re.DOTALL | re.IGNORECASE)
    
    # Clean up extra whitespace
    text = re.sub(r'\n{3,}', '\n\n', text)  # Multiple newlines ‚Üí double
    text = re.sub(r' {2,}', ' ', text)       # Multiple spaces ‚Üí single
    text = text.strip()
    
    return Document(page_content=text, metadata=doc.metadata)

# Clean all pages
print("üßπ Cleaning document text (removing PDF metadata noise)...")
cleaned_pages = [clean_document_text(page) for page in all_pages]

# Show before/after for first page
print(f"\nüìÑ Original first page length: {len(all_pages[0].page_content)} chars")
print(f"üìÑ Cleaned first page length: {len(cleaned_pages[0].page_content)} chars")
print(f"üóëÔ∏è  Removed {len(all_pages[0].page_content) - len(cleaned_pages[0].page_content)} chars of noise")

# STEP 2: Re-chunk with larger sizes
print("\nüîÑ Re-chunking cleaned documents...")
larger_chunks = chunk_documents(cleaned_pages, chunk_size=1000, chunk_overlap=100)
print(f"‚úÖ Created {len(larger_chunks)} chunks")

# STEP 3: Re-embed and store
print("\nüîÑ Re-embedding chunks (this may take a minute)...")
db_improved, embeddings = embed_and_store(larger_chunks, save_path="faiss_index_improved")

# Update the retriever
retriever = db_improved.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 10, "lambda_mult": 0.7}
)

print("\n‚úÖ Ready! Now let's test with CLEANED data:")


üßπ Cleaning document text (removing PDF metadata noise)...

üìÑ Original first page length: 9982 chars
üìÑ Cleaned first page length: 9640 chars
üóëÔ∏è  Removed 342 chars of noise

üîÑ Re-chunking cleaned documents...
‚úÖ Created 157 chunks

üîÑ Re-embedding chunks (this may take a minute)...
üîÑ Loading embedding model: sentence-transformers/all-mpnet-base-v2
üîÑ Creating embeddings for 157 chunks... (this may take a minute)
‚úÖ Embeddings created in 17.2 seconds
‚úÖ FAISS index saved to: /Users/esmahoney/Desktop/AI_PM_course/ds-rag-pipeline/notebooks/faiss_index_improved/
üìä Index contains 157 vectors (dimension: 768)

‚úÖ Ready! Now let's test with CLEANED data:


In [31]:
# Test the improved retrieval with larger chunks
def test_improved_retrieval(query, k=3):
    """Test retrieval with the improved index"""
    docs_with_scores = db_improved.similarity_search_with_score(query, k=k)
    
    print(f"üîç Query: '{query}'")
    print(f"{'‚îÄ' * 70}")
    
    for i, (doc, score) in enumerate(docs_with_scores, 1):
        source = doc.metadata.get('source', 'Unknown').split('/')[-1]
        relevance = "üü¢ Good" if score < 0.8 else "üü° Fair" if score < 1.0 else "üî¥ Poor"
        
        # Show more content since chunks are bigger now
        content = doc.page_content[:250].replace(chr(10), ' ').strip()
        
        print(f"  [{i}] {relevance} (L2: {score:.3f}) | {source}")
        print(f"      {content}...")
    print()

# Compare results with the same queries
print("=" * 70)
print("IMPROVED RETRIEVAL RESULTS (chunk_size=1000)")
print("=" * 70 + "\n")

for query in test_queries:
    test_improved_retrieval(query)


IMPROVED RETRIEVAL RESULTS (chunk_size=1000)

üîç Query: 'What are the side effects of paracetamol?'
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  [1] üü° Fair (L2: 0.983) | paracetamol.pdf
      (entsprechend 1.000 mg  Paracetamol) 26 kg ‚Äì 32 kg  (Kinder 8 ‚Äì 11 J) ¬Ω Tablette (entsprechend 250 mg  Paracetamol) Andere Darreichungsformen  sind f√ºr diese Patientengruppe  unter Umst√§nden vorteilhafter,  da sie eine genauere Dosie- rung von maxima...
  [2] üî¥ Poor (L2: 1.017) | paracetamol.pdf
      Tabelle. Paracetamol wird in Abh√§ngigkeit von K√∂rpergewicht und  Alter dosiert, in der Regel mit 10 bis 15 mg/kg KG als Einzeldosis, bis  maximal 60 mg/kg KG als Tagesgesamtdosis. Das jeweilige Dosierungsintervall richtet sich nach der Symptomatik  u...
  [3] üî¥ Poor (L2: 1.041) | paracetamol.pdf
      nahme von Para

These results are promising because we can see different results for the queries but the performance is still poor. This may be because our queries are in english, but the paracetamol.pdf is in german. Let's test with some german queries.

In [32]:
# Test with GERMAN queries for the German paracetamol document
german_queries = [
    "Was sind die Nebenwirkungen von Paracetamol?",  # What are the side effects?
    "Wie ist die empfohlene Dosierung?",              # What is the recommended dosage?
    "Wann sollte man Paracetamol nicht einnehmen?",   # When should you not take it?
]

print("=" * 70)
print("GERMAN QUERIES (matching document language)")
print("=" * 70 + "\n")

for query in german_queries:
    test_improved_retrieval(query)

print("\n" + "=" * 70)
print("Tip: The paracetamol PDF is in German!")
print("For better results, query in the same language as your documents.")
print("=" * 70)


GERMAN QUERIES (matching document language)

üîç Query: 'Was sind die Nebenwirkungen von Paracetamol?'
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  [1] üü¢ Good (L2: 0.527) | paracetamol.pdf
      nahme von Paracetamol 500 mg Die Apotheke hilft Schmerztabletten  zusammen mit anderen Arzneimitteln‚Äú). Bei Patienten mit verminderter Glutathionreserve (verursacht durch u. a.  Mangelern√§hrung, Schwangerschaft, Lebererkrankung, Blutvergiftung/ Infek...
  [2] üü¢ Good (L2: 0.538) | paracetamol.pdf
      Tabelle. Paracetamol wird in Abh√§ngigkeit von K√∂rpergewicht und  Alter dosiert, in der Regel mit 10 bis 15 mg/kg KG als Einzeldosis, bis  maximal 60 mg/kg KG als Tagesgesamtdosis. Das jeweilige Dosierungsintervall richtet sich nach der Symptomatik  u...
  [3] üü¢ Good (L2: 0.576) | paracetamol.pdf
      sondere Warfarin

Wow! This is a big improvement. We can see that the model is now able to retrieve results in german. But, the dosage question looks worse. Let's try switch to a multiligual model. Which you can find here: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This should show improvement but we'll have to update the rating ranges to match the model.

In [35]:
# Switch to MULTILINGUAL embedding model with NORMALIZATION
print("üåç Switching to multilingual embedding model...")
print("   Model: paraphrase-multilingual-mpnet-base-v2")
print("   Supports: 50+ languages including English & German\n")

# Create new embeddings with multilingual model
# IMPORTANT: normalize_embeddings=True ensures consistent L2 distance scores
multilingual_model = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"

multilingual_embeddings = HuggingFaceEmbeddings(
    model_name=multilingual_model,
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}  # ‚Üê KEY FIX: Normalize!
)

# Re-embed the cleaned chunks with the multilingual model
print("üîÑ Re-embedding chunks with multilingual model (this may take a minute)...")
import time
start = time.time()

db_multilingual = FAISS.from_documents(larger_chunks, multilingual_embeddings)
db_multilingual.save_local("faiss_index_multilingual")

print(f"‚úÖ Done in {time.time() - start:.1f} seconds")
print(f"üìä Index contains {db_multilingual.index.ntotal} vectors")

# Update for rest of notebook
db_improved = db_multilingual
embeddings = multilingual_embeddings

print("\n‚úÖ Multilingual model ready (with normalized embeddings)!")


üåç Switching to multilingual embedding model...
   Model: paraphrase-multilingual-mpnet-base-v2
   Supports: 50+ languages including English & German

üîÑ Re-embedding chunks with multilingual model (this may take a minute)...
‚úÖ Done in 3.6 seconds
üìä Index contains 157 vectors

‚úÖ Multilingual model ready (with normalized embeddings)!


In [36]:
# Test multilingual model with BOTH English and German queries
def test_multilingual(query, k=3):
    """Test retrieval with multilingual model (normalized embeddings)"""
    docs_with_scores = db_multilingual.similarity_search_with_score(query, k=k)
    
    print(f"üîç Query: '{query}'")
    print(f"{'‚îÄ' * 70}")
    
    for i, (doc, score) in enumerate(docs_with_scores, 1):
        source = doc.metadata.get('source', 'Unknown').split('/')[-1]
        # For NORMALIZED embeddings, L2 distance range is 0-2
        # 0 = identical, sqrt(2) ‚âà 1.41 = orthogonal, 2 = opposite
        relevance = "üü¢ Good" if score < 1.0 else "üü° Fair" if score < 1.3 else "üî¥ Poor"
        content = doc.page_content[:200].replace(chr(10), ' ').strip()
        print(f"  [{i}] {relevance} (L2: {score:.3f}) | {source}")
        print(f"      {content}...")
    print()

# Test with mixed language queries
all_test_queries = [
    # English queries
    "What are the side effects of paracetamol?",
    "What is the recommended dosage?",
    "What is ReAct and how does it work?",
    # German queries  
    "Was sind die Nebenwirkungen von Paracetamol?",
    "Wie ist die empfohlene Dosierung?",
]

print("=" * 70)
print("üåç MULTILINGUAL MODEL RESULTS (Normalized Embeddings)")
print("   Thresholds: üü¢ < 1.0  |  üü° < 1.3  |  üî¥ ‚â• 1.3")
print("=" * 70 + "\n")

for query in all_test_queries:
    test_multilingual(query)


üåç MULTILINGUAL MODEL RESULTS (Normalized Embeddings)
   Thresholds: üü¢ < 1.0  |  üü° < 1.3  |  üî¥ ‚â• 1.3

üîç Query: 'What are the side effects of paracetamol?'
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  [1] üü¢ Good (L2: 0.470) | paracetamol.pdf
      sondere Warfarin. Daher sollte die langfristige Einnahme von Parace- tamol bei Patienten, die mit Antikoagulanzien behandelt werden,  nur unter medizinischer Aufsicht erfolgen. Die gelegentliche Ein-...
  [2] üü¢ Good (L2: 0.620) | paracetamol.pdf
      PLPARA500DESKRNOW1.1 Stillzeit Paracetamol geht in die Muttermilch √ºber. Da nachteilige Folgen f√ºr  den S√§ugling bisher nicht bekannt geworden sind, wird eine Unter - brechung des Stillens in der Rege...
  [3] üü¢ Good (L2: 0.624) | paracetamol.pdf
      Tabelle. Paracetamol wird in Abh√§ngigkeit von K√∂rp

Now we're cooking! these results look awesome as a possum. Let's make this the base index and overwrite the old one so we can move on to Exercise 6.

In [None]:
# Once you're happy with the improved results, save to the original location
# This will OVERWRITE the old index with the improved one

# Save improved index to the original faiss_index folder
db_improved.save_local("faiss_index")
print("Improved index saved to 'faiss_index/' (overwrote original)")

# Update db to point to the improved version for the rest of the notebook
db = db_improved
chunks = larger_chunks

print(f"New index contains {db.index.ntotal} vectors")
print(f"Using {len(chunks)} chunks (chunk_size=1000)")
print("\nReady to continue with Exercise 6!")


‚úÖ Improved index saved to 'faiss_index/' (overwrote original)
üìä New index contains 157 vectors
üìÑ Using 157 chunks (chunk_size=1000)

üéâ Ready to continue with Exercise 6!


---

## **Exercise 6: Connecting Retrieval with LLM**

You'll now connect document retrieval with the Language Model.

**Steps:**

1. **Create Retrieval Chain:** Link your retrieval system to your instantiated LLM.
2. **Test the Chain:** Confirm it works by generating answers from retrieved documents.

In [None]:
# Quick test: Is the LLM working?
print("Testing LLM connection...")

try:
    # Simple test - no RAG, just LLM
    test_response = llm.invoke("Say 'Hello, RAG!' in exactly 3 words.")
    print(f"‚úÖ LLM is working!")
    print(f"   Response: {test_response.content}")
except Exception as e:
    print(f"‚ùå LLM Error: {type(e).__name__}")
    print(f"   {str(e)[:500]}")
    print("\nüí° Fix: Re-run Cell 1 to reinitialize the LLM")


üîç Testing LLM connection...
‚úÖ LLM is working!
   Response: Hello RAG.


In [4]:
# ============================================================
# üîÑ QUICK RELOAD - Run this after kernel restart to skip re-embedding!
# ============================================================

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate

# Load the multilingual embeddings model
multilingual_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

# Load saved FAISS index from disk (fast!)
db_multilingual = FAISS.load_local(
    "faiss_index_multilingual", 
    multilingual_embeddings, 
    allow_dangerous_deserialization=True
)

# Recreate the RAG chain
system_prompt = """You are a helpful assistant that answers questions based on the provided context.
Use ONLY the information from the context to answer. If the answer is not in the context, say so.
If the context is in German, you may translate key points to English in your answer.

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
])

document_chain = create_stuff_documents_chain(llm, prompt)
rag_retriever = db_multilingual.as_retriever(search_kwargs={"k": 3})
rag_chain = create_retrieval_chain(rag_retriever, document_chain)

print("Quick reload complete!")
print("   - Multilingual embeddings loaded")
print("   - FAISS index loaded from disk")
print("   - RAG chain ready")
print("\nYou can now use the chat in Cell 32!")


‚úÖ Quick reload complete!
   - Multilingual embeddings loaded
   - FAISS index loaded from disk
   - RAG chain ready

You can now use the chat in Cell 32!


In [None]:
# Import chain components
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate

# Step 1: Create a prompt template for the LLM
system_prompt = """You are a helpful assistant that answers questions based on the provided context.
Use ONLY the information from the context to answer. If the answer is not in the context, say so.
If the context is in German, you may translate key points to English in your answer.

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
])

# Step 2: Create the document chain (passes docs to LLM)
document_chain = create_stuff_documents_chain(llm, prompt)

# Step 3: Create retriever from our multilingual index
rag_retriever = db_multilingual.as_retriever(search_kwargs={"k": 3})

# Step 4: Create the full retrieval chain
rag_chain = create_retrieval_chain(rag_retriever, document_chain)

print("‚úÖ RAG Chain created!")
print("Input: User question")
print("Step 1: Retrieve relevant documents")
print("Step 2: Pass docs + question to LLM")
print("Output: Generated answer")


‚úÖ RAG Chain created!
   üì• Input: User question
   üîç Step 1: Retrieve relevant documents
   ü§ñ Step 2: Pass docs + question to LLM
   üì§ Output: Generated answer


In [None]:
# Test the RAG chain with sample questions
def ask_question(question):
    """Ask a question and get an answer from the RAG system"""
    print(f"Question: {question}")
    print("‚îÄ" * 60)
    
    # Invoke the chain
    response = rag_chain.invoke({"input": question})
    
    # Display the answer
    print(f"Answer:\n{response['answer']}")
    
    # Show which documents were used
    print(f"\nSources used ({len(response['context'])} chunks):")
    for i, doc in enumerate(response['context'], 1):
        source = doc.metadata.get('source', 'Unknown').split('/')[-1]
        print(f"   [{i}] {source}")
    print("=" * 60 + "\n")

# Test with different questions
test_questions = [
    "What is paracetamol used for?",
    "What are the side effects of paracetamol?",
    "What is ReAct and how does it improve language models?",
]

for q in test_questions:
    ask_question(q)

‚ùì Question: What is paracetamol used for?
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üí¨ Answer:
The context does not explicitly state the main use of paracetamol. However, it is commonly known that paracetamol (also known as acetaminophen) is used as a pain reliever and a fever reducer.

üìö Sources used (3 chunks):
   [1] paracetamol.pdf
   [2] paracetamol.pdf
   [3] paracetamol.pdf

‚ùì Question: What are the side effects of paracetamol?
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üí¨ Answer:
The context does not provide a comprehensive list of side effects of paracetamol. However, it mentions the following potential effects:

- Die Apotheke hilft Schmerztabletten kann die Harns√§ure- sowie die Blutzuckerbes

---

## **Exercise 7: Interactive Chat System**

In the final exercise, build an interactive chat-based query system.

**Steps:**

1. **Create Chat Interface:** Develop a simple function for interactive querying.
2. **Run the Chat:** Allow users to ask questions and receive immediate responses.


In [None]:
# Interactive Chat Function
def chat_with_documents():
    """Interactive chat interface for querying documents"""
    print("=" * 60)
    print("RAG Document Chat System")
    print("=" * 60)
    print("Ask questions about:")
    print("Paracetamol (German medical document)")
    print("ReAct paper (English research paper)")
    print("\nType 'quit', 'exit', or 'q' to end the chat.")
    print("Type 'sources' to show/hide source citations.")
    print("=" * 60 + "\n")
    
    show_sources = True
    
    while True:
        # Get user input
        user_input = input("You: ").strip()
        
        # Check for exit commands
        if user_input.lower() in ['quit', 'exit', 'q', '']:
            print("\nGoodbye! Thanks for chatting.")
            break
        
        # Toggle sources display
        if user_input.lower() == 'sources':
            show_sources = not show_sources
            print(f"Source citations: {'ON' if show_sources else 'OFF'}\n")
            continue
        
        # Process the question
        try:
            print("\nSearching documents...")
            response = rag_chain.invoke({"input": user_input})
            
            # Display answer
            print(f"\nAssistant: {response['answer']}")
            
            # Show sources if enabled
            if show_sources and response.get('context'):
                sources = set(doc.metadata.get('source', 'Unknown').split('/')[-1] 
                             for doc in response['context'])
                print(f"\n   üìö Sources: {', '.join(sources)}")
            
            print()  # Empty line for readability
            
        except Exception as e:
            print(f"\nError: {str(e)[:200]}")
            print("   Try rephrasing your question.\n")

In [6]:
# ============================================================
# SIMPLE CHAT - Just edit the question and re-run this cell!
# ============================================================

# üëá CHANGE YOUR QUESTION HERE and re-run the cell (Shift+Enter)
question = "what is react?"

# ============================================================
print(f"Your Question: {question}")
print("‚îÄ" * 60)

# Get the answer
response = rag_chain.invoke({"input": question})

# Display the answer
print(f"\nAnswer:\n{response['answer']}")

# Show sources
sources = set(doc.metadata.get('source', 'Unknown').split('/')[-1] 
             for doc in response['context'])
print(f"\nSources: {', '.join(sources)}")
print("\n" + "=" * 60)
print("üí° To ask another question: Edit 'question' above and re-run!")

Your Question: what is react?
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Answer:
The text does not explicitly define what "ReAct" is. However, based on the context, it appears to be a system or model that is capable of solving tasks, particularly the HotpotQA task, which involves question answering and reasoning.

In the text, ReAct is mentioned as a system that can be inspected and edited by a human, and it is compared to another system called CoT. ReAct is also described as having a more grounded and trustworthy approach to problem-solving, thanks to its access to an external knowledge base.

It is likely that ReAct is a type of artificial intelligence or machine learning model, but the specific details of its architecture and capabilities are not provided in the text.

Sources: react_paper.pdf

üí° To ask another question: Edit 'question' above 

---

## **Conclusion & Reflection**

After completing these exercises:

- Summarize key concepts learned.
- Reflect on the effectiveness and limitations of the free LLM and RAG system you've built.
- Consider how you might improve or extend your system in practical applications.

---