# RAG Research Agent Demo

This notebook demonstrates the Retrieval-Augmented Generation (RAG) pipeline for healthcare document analysis.

## 1. 🔧 Setup

First, let's import the necessary modules and load environment variables.

In [1]:
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Add parent directory to path to import our modules
sys.path.append(str(Path.cwd().parent))

# Import our modules
from tools import load_and_chunk_pdf
from bedrock_wrapper import embed_texts, generate_answer
from retriever import RAGRetriever
from rag_pipeline import generate_answer_with_rag

# Load environment variables
load_dotenv()

# Get S3 bucket name
S3_BUCKET = os.getenv("S3_BUCKET_NAME")
if not S3_BUCKET:
    raise ValueError("Please set S3_BUCKET_NAME environment variable")

### List Available PDF Files

Let's see what PDF files are available in our S3 bucket.

In [2]:
import boto3

# Initialize S3 client
s3 = boto3.client('s3')

# List objects in the bucket
response = s3.list_objects_v2(Bucket=S3_BUCKET)

print("Available PDF files in bucket:")
for obj in response.get('Contents', []):
    if obj['Key'].lower().endswith('.pdf'):
        print(f" - {obj['Key']}")

Available PDF files in bucket:
 - AdvaMed-AI-White-Paper-Final.pdf
 - isqua-white-paper-on-patient-safety-in-healthcare-organisations.pdf
 - mhs-iv-patient-safety-practices-year-2.pdf


## 2. 📥 Load & Embed Documents

Let's load and process one of the PDFs to demonstrate the document processing pipeline.

In [5]:
# Choose a PDF to process
pdf_key = "AdvaMed-AI-White-Paper-Final.pdf"

# Load and chunk the PDF
print(f"Processing {pdf_key}...")
chunks = load_and_chunk_pdf(S3_BUCKET, pdf_key)
print(f"✅ Created {len(chunks)} chunks")

# Display example chunk
print("Example chunk:")
print(f"Source: {chunks[0]['source']}")
print(f"Chunk ID: {chunks[0]['chunk_id']}")
print(f"Text preview: {chunks[0]['text'][:200]}...")

Processing AdvaMed-AI-White-Paper-Final.pdf...
✅ Created 66 chunks
Example chunk:
Source: AdvaMed-AI-White-Paper-Final.pdf
Chunk ID: 0
Text preview:      The Role of Artificial Intelligence (AI) in Healthcare 

Executive Summary 

Artificial intelligence (AI) applied to healthcare, driven by innovative medical technology, has and will 
continue to...


## 3. 🔍 Run a Query (RAG)

Now let's use the RAG pipeline to answer a question about healthcare AI.

In [7]:
# Example query
query = "What are the key considerations for AI in medical devices according to the FDA?"

# Get answer using RAG
result = generate_answer_with_rag(query)

print("Question:", query)
print("Answer:", result["answer"])
print("Sources:")
for source in result["sources"]:
    print(f" - {source}")

ValueError: Index and chunks not found. Please run embed_and_store_chunks.py first.

### Display Retrieved Chunks

Let's look at the actual chunks that were retrieved to provide context for the answer.

In [9]:
# Initialize retriever
retriever = RAGRetriever()

# Retrieve chunks
chunks = retriever.retrieve(query)

print("Retrieved chunks:")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:")
    print(f"Source: {chunk['source']}")
    print(f"Text: {chunk['text'][:300]}...")

ValueError: Index and chunks not found. Please run embed_and_store_chunks.py first.

## 4. ⚖️ Compare RAG vs. No-RAG (baseline)

Let's compare the RAG answer with a baseline answer that doesn't use retrieved context.

In [10]:
# Get baseline answer (no context)
baseline_answer = generate_answer(query, [])

print("Question:", query)
print("
=== RAG Answer ===")
print(result["answer"])
print("
=== Baseline Answer (No Context) ===")
print(baseline_answer)

SyntaxError: unterminated string literal (detected at line 5) (3961657865.py, line 5)

## 5. ✅ Wrap Up

Let's try one more example query to demonstrate the system's capabilities.

In [11]:
# Another example query
query2 = "What are the best practices for implementing AI in healthcare organizations?"

# Get answer using RAG
result2 = generate_answer_with_rag(query2)

print("Question:", query2)
print("
Answer:", result2["answer"])
print("
Sources:")
for source in result2["sources"]:
    print(f" - {source}")

SyntaxError: unterminated string literal (detected at line 8) (1703145086.py, line 8)