# Historical Document QA System Using Retrieval-Augmented Generation

**Author:** --, --, --, --  
**Course:** Selective Topics  
**Date:** January 2025

## Abstract

This notebook implements a production-grade Retrieval-Augmented Generation (RAG) system for answering questions about U.S. Presidential State of the Union addresses. The system combines dense vector retrieval with large language model generation to provide accurate, source-grounded answers to historical queries.

### Key Features
- **Corpus**: 200+ State of the Union addresses (1790-2024) from public datasets
- **Embeddings**: 384-dimensional semantic vectors using sentence-transformers
- **Vector Store**: FAISS with optimized indexing for efficient similarity search
- **LLM**: Google Gemini 2.5 Flash for answer synthesis
- **Evaluation**: Confidence scoring and source attribution

---

## 1. Installation and Setup

In [None]:
!pip install -q google-generativeai sentence-transformers faiss-cpu pandas numpy requests

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m95.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import re
import json
import requests
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass, asdict
from io import StringIO

from sentence_transformers import SentenceTransformer
import faiss
import google.generativeai as genai

import warnings
warnings.filterwarnings('ignore')


All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  loader.exec_module(module)


## 2. Data Acquisition

Load historical State of the Union addresses from public repositories.

In [None]:
def load_corpus() -> pd.DataFrame:
    """
    Load State of the Union addresses from GitHub repositories.

    Returns:
        DataFrame with columns: text, president, year
    """
    print("Loading corpus from web sources...")

    # Primary data sources
    urls = [
        "https://raw.githubusercontent.com/BrianWeinstein/state-of-the-union/master/transcripts.csv",
        "https://raw.githubusercontent.com/fivethirtyeight/data/master/state-of-the-union/sotu_text.csv"
    ]

    for url in urls:
        try:
            response = requests.get(url, timeout=30)
            response.raise_for_status()

            df = pd.read_csv(StringIO(response.text))
            df.columns = df.columns.str.lower().str.strip()

            # Standardize column names
            for col in ['text', 'transcript', 'contents', 'speech']:
                if col in df.columns:
                    df['text'] = df[col]
                    break

            for col in ['president', 'name', 'speaker']:
                if col in df.columns:
                    df['president'] = df[col]
                    break

            for col in ['year', 'date', 'delivered']:
                if col in df.columns:
                    df['year'] = pd.to_datetime(df[col], errors='coerce').dt.year
                    break

            # Filter valid documents
            df = df[df['text'].notna() & (df['text'].str.len() > 100)].copy()

            if len(df) >= 10:
                print(f"Successfully loaded {len(df)} documents")
                print(f"Date range: {df['year'].min():.0f} - {df['year'].max():.0f}")
                return df

        except Exception as e:
            print(f"Source failed: {e}")
            continue

    raise Exception("All data sources failed. Check internet connection.")

corpus_df = load_corpus()
print(f"\nCorpus statistics:")
print(f"- Total documents: {len(corpus_df)}")
print(f"- Total characters: {corpus_df['text'].str.len().sum():,}")
print(f"- Average length: {corpus_df['text'].str.len().mean():.0f} chars")

corpus_df.head()

Loading corpus from web sources...
Successfully loaded 244 documents
Date range: 1790 - 2018

Corpus statistics:
- Total documents: 244
- Total characters: 23,851,052
- Average length: 97750 chars


Unnamed: 0,date,president,title,url,transcript,text,year
0,2018-01-30,Donald J. Trump,Address Before a Joint Session of the Congress...,https://www.cnn.com/2018/01/30/politics/2018-s...,"\nMr. Speaker, Mr. Vice President, Members of ...","\nMr. Speaker, Mr. Vice President, Members of ...",2018
1,2017-02-28,Donald J. Trump,Address Before a Joint Session of the Congress,http://www.presidency.ucsb.edu/ws/index.php?pi...,"Thank you very much. Mr. Speaker, Mr. Vice Pre...","Thank you very much. Mr. Speaker, Mr. Vice Pre...",2017
2,2016-01-12,Barack Obama,Address Before a Joint Session of the Congress...,http://www.presidency.ucsb.edu/ws/index.php?pi...,"Thank you. Mr. Speaker, Mr. Vice President, Me...","Thank you. Mr. Speaker, Mr. Vice President, Me...",2016
3,2015-01-20,Barack Obama,Address Before a Joint Session of the Congress...,http://www.presidency.ucsb.edu/ws/index.php?pi...,"The President. Mr. Speaker, Mr. Vice President...","The President. Mr. Speaker, Mr. Vice President...",2015
4,2014-01-28,Barack Obama,Address Before a Joint Session of the Congress...,http://www.presidency.ucsb.edu/ws/index.php?pi...,"The President. Mr. Speaker, Mr. Vice President...","The President. Mr. Speaker, Mr. Vice President...",2014


## 3. Document Processing

Implement intelligent chunking strategy with sentence-boundary awareness and overlap.

In [None]:
@dataclass
class DocumentChunk:
    """Represents a document chunk with metadata."""
    text: str
    president: str
    year: int
    chunk_id: int
    doc_id: int

class DocumentProcessor:
    """Handles document chunking with sentence-aware splitting."""

    @staticmethod
    def split_sentences(text: str) -> List[str]:
        """Split text into sentences, handling common abbreviations."""
        # Protect abbreviations
        text = re.sub(r'\bMr\.', 'Mr<prd>', text)
        text = re.sub(r'\bMrs\.', 'Mrs<prd>', text)
        text = re.sub(r'\bDr\.', 'Dr<prd>', text)
        text = re.sub(r'\bU\.S\.', 'US<prd>', text)

        sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', text)
        sentences = [s.replace('<prd>', '.').strip() for s in sentences if s.strip()]

        return sentences

    @staticmethod
    def create_chunks(text: str,
                     president: str,
                     year: int,
                     doc_id: int,
                     chunk_size: int = 600,
                     overlap: int = 150) -> List[DocumentChunk]:
        """
        Create overlapping chunks with sentence boundaries.

        Args:
            text: Document text
            president: President name
            year: Year of address
            doc_id: Document identifier
            chunk_size: Target chunk size in characters
            overlap: Overlap between chunks in characters
        """
        sentences = DocumentProcessor.split_sentences(text)
        chunks = []

        current_chunk = []
        current_length = 0
        chunk_id = 0

        for sentence in sentences:
            sentence_length = len(sentence)

            if current_length + sentence_length > chunk_size and current_chunk:
                chunk_text = ' '.join(current_chunk)
                chunks.append(DocumentChunk(
                    text=chunk_text,
                    president=president,
                    year=year,
                    chunk_id=chunk_id,
                    doc_id=doc_id
                ))

                # Maintain overlap
                overlap_text = chunk_text[-overlap:] if len(chunk_text) > overlap else chunk_text
                overlap_sentences = DocumentProcessor.split_sentences(overlap_text)

                current_chunk = overlap_sentences + [sentence]
                current_length = sum(len(s) for s in current_chunk)
                chunk_id += 1
            else:
                current_chunk.append(sentence)
                current_length += sentence_length

        if current_chunk:
            chunks.append(DocumentChunk(
                text=' '.join(current_chunk),
                president=president,
                year=year,
                chunk_id=chunk_id,
                doc_id=doc_id
            ))

        return chunks

def process_corpus(df: pd.DataFrame, chunk_size: int = 600, overlap: int = 150) -> List[DocumentChunk]:
    """Process entire corpus into chunks."""
    all_chunks = []

    for doc_id, row in df.iterrows():
        chunks = DocumentProcessor.create_chunks(
            text=row['text'],
            president=row.get('president', 'Unknown'),
            year=int(row.get('year', 0)) if pd.notna(row.get('year')) else 0,
            doc_id=doc_id,
            chunk_size=chunk_size,
            overlap=overlap
        )
        all_chunks.extend(chunks)

    print(f"Created {len(all_chunks)} chunks from {len(df)} documents")
    print(f"Average chunk size: {np.mean([len(c.text) for c in all_chunks]):.0f} characters")

    return all_chunks

document_chunks = process_corpus(corpus_df)

Created 62151 chunks from 244 documents
Average chunk size: 533 characters


## 4. Embedding Generation

Generate dense vector representations using pre-trained sentence transformers.

In [None]:
# Load embedding model
print("Loading sentence transformer model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embedding_dim = embedding_model.get_sentence_embedding_dimension()
print(f"Embedding dimension: {embedding_dim}")

def generate_embeddings(chunks: List[DocumentChunk], batch_size: int = 64) -> np.ndarray:
    """
    Generate normalized embeddings for document chunks.

    Args:
        chunks: List of document chunks
        batch_size: Batch size for encoding

    Returns:
        Normalized embedding matrix of shape (n_chunks, embedding_dim)
    """
    texts = [chunk.text for chunk in chunks]

    embeddings = embedding_model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True  # L2 normalization for cosine similarity
    )

    print(f"Generated embeddings: {embeddings.shape}")
    return embeddings

chunk_embeddings = generate_embeddings(document_chunks)

Loading sentence transformer model...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding dimension: 384


Batches:   0%|          | 0/972 [00:00<?, ?it/s]

Generated embeddings: (62151, 384)


## 5. Vector Store Implementation

Build FAISS index for efficient similarity search.

In [None]:
class VectorStore:
    """
    FAISS-based vector store for semantic similarity search.
    """

    def __init__(self, embeddings: np.ndarray, chunks: List[DocumentChunk]):
        """
        Initialize vector store with embeddings.

        Args:
            embeddings: Normalized embedding matrix
            chunks: Corresponding document chunks
        """
        self.chunks = chunks
        self.dimension = embeddings.shape[1]
        self.n_vectors = len(embeddings)

        # Use IVF index for large datasets
        if self.n_vectors > 10000:
            nlist = min(int(np.sqrt(self.n_vectors)), 100)
            quantizer = faiss.IndexFlatIP(self.dimension)
            self.index = faiss.IndexIVFFlat(quantizer, self.dimension, nlist)
            self.index.train(embeddings.astype('float32'))
            self.index.add(embeddings.astype('float32'))
            self.index.nprobe = min(10, nlist)
        else:
            # Flat index for smaller datasets
            self.index = faiss.IndexFlatIP(self.dimension)
            self.index.add(embeddings.astype('float32'))

        print(f"Vector store initialized: {self.index.ntotal} vectors")

    def search(self, query: str, k: int = 5) -> List[Tuple[DocumentChunk, float]]:
        """
        Search for most similar chunks to query.

        Args:
            query: Query text
            k: Number of results to return

        Returns:
            List of (chunk, similarity_score) tuples
        """
        query_embedding = embedding_model.encode([query], normalize_embeddings=True)
        scores, indices = self.index.search(query_embedding.astype('float32'), k)

        results = []
        for idx, score in zip(indices[0], scores[0]):
            if idx < len(self.chunks):
                results.append((self.chunks[idx], float(score)))

        return results

vector_store = VectorStore(chunk_embeddings, document_chunks)

Vector store initialized: 62151 vectors


## 6. RAG System Implementation

Implement retrieval-augmented generation with Google Gemini.

In [None]:
class RAGSystem:
    """
    Retrieval-Augmented Generation system for historical QA.
    """

    def __init__(self, vector_store: VectorStore, api_key: str):
        """
        Initialize RAG system.

        Args:
            vector_store: Initialized vector store
            api_key: Google AI API key
        """
        self.vector_store = vector_store
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel('gemini-2.5-flash')

        # Rate limiting
        import time
        self.last_request_time = 0
        self.min_request_interval = 1.0
        self.request_count = 0

    def _rate_limit(self):
        """Enforce rate limiting between requests."""
        import time
        elapsed = time.time() - self.last_request_time
        if elapsed < self.min_request_interval:
            time.sleep(self.min_request_interval - elapsed)
        self.last_request_time = time.time()

    def _generate_with_retry(self, prompt: str, temperature: float, max_retries: int = 3):
        """Generate response with exponential backoff retry."""
        import time
        import re

        for attempt in range(max_retries):
            try:
                self._rate_limit()
                response = self.model.generate_content(
                    prompt,
                    generation_config=genai.types.GenerationConfig(
                        temperature=temperature,
                        max_output_tokens=2048,
                    )
                )
                self.request_count += 1
                return response.text
            except Exception as e:
                error_str = str(e)
                if '429' in error_str or 'quota' in error_str.lower():
                    wait_time = 60
                    retry_match = re.search(r'retry in ([\d.]+)s', error_str)
                    if retry_match:
                        wait_time = float(retry_match.group(1))

                    if attempt < max_retries - 1:
                        backoff = min(wait_time * (2 ** attempt), 300)
                        time.sleep(backoff)
                    else:
                        return None
                else:
                    return None
        return None

    def _create_fallback_answer(self, question: str, results) -> str:
        """Create answer from sources when generation fails."""
        scores = [score for _, score in results]
        answer_parts = [f"Based on retrieved sources (confidence: {np.mean(scores):.1%}):\n"]

        for i, (chunk, score) in enumerate(results[:3], 1):
            answer_parts.append(f"\n{i}. {chunk.president} ({chunk.year}) - {score:.1%}:")
            excerpt = chunk.text[:200].strip()
            if len(chunk.text) > 200:
                excerpt += "..."
            answer_parts.append(f'   "{excerpt}"\n')

        answer_parts.append("\nNote: LLM generation unavailable. See full sources below.")
        return "\n".join(answer_parts)

    def query(self, question: str, k: int = 5, temperature: float = 0.3) -> Dict:
        """
        Answer question using RAG.

        Args:
            question: User question
            k: Number of chunks to retrieve
            temperature: Generation temperature (0-1)

        Returns:
            Dictionary containing answer, sources, confidence, and status
        """
        # Retrieve relevant chunks
        results = self.vector_store.search(question, k=k)

        if not results:
            return {
                'answer': 'No relevant documents found.',
                'sources': [],
                'confidence': 0.0,
                'status': 'no_results'
            }

        # Format context
        context_blocks = []
        sources = []

        for i, (chunk, score) in enumerate(results, 1):
            context_blocks.append(
                f"""[Source {i}]
President: {chunk.president}
Year: {chunk.year}
Relevance: {score:.1%}
Text: {chunk.text}
"""
            )
            sources.append({
                'president': chunk.president,
                'year': chunk.year,
                'relevance': score,
                'text': chunk.text[:250] + '...' if len(chunk.text) > 250 else chunk.text
            })

        context = "\n\n".join(context_blocks)

        # Generate answer
        prompt = f"""You are a historian analyzing U.S. Presidential State of the Union addresses.

Based on the provided sources, answer the question accurately and concisely.

Instructions:
- Base your answer only on the provided sources
- Cite specific presidents and years
- Quote directly when relevant
- Acknowledge limitations if sources are incomplete

Sources:
{context}

Question: {question}

Answer:"""

        answer = self._generate_with_retry(prompt, temperature)

        # Handle generation failure
        if answer is None:
            answer = self._create_fallback_answer(question, results)
            status = 'fallback'
        else:
            status = 'success'

        return {
            'answer': answer,
            'sources': sources,
            'confidence': float(np.mean([score for _, score in results])),
            'status': status
        }

    def display_result(self, result: Dict):
        """Display query results."""
        print("\n" + "="*90)
        print("ANSWER")
        print("="*90)
        print(result['answer'])

        print("\n" + "="*90)
        print(f"SOURCES (Confidence: {result['confidence']:.1%}, Status: {result['status']})")
        print("="*90)

        for i, source in enumerate(result['sources'], 1):
            print(f"\n[{i}] {source['president']} ({source['year']}) - {source['relevance']:.1%}")
            print(f"    {source['text']}")
        print("="*90)

## 7. System Initialization

Configure API credentials and initialize the RAG system.

In [None]:
# Configure API key
# Option 1: Use Colab Secrets (recommended)
try:
    from google.colab import userdata
    API_KEY = userdata.get('GOOGLE_API_KEY')
except:
    # Option 2: Direct input
    API_KEY = input("Enter Google AI API key: ")

# Initialize RAG system
rag = RAGSystem(vector_store, API_KEY)
print(f"\nRAG System Initialized")
print(f"Corpus: {len(corpus_df)} documents")
print(f"Chunks: {len(document_chunks)}")
print(f"Model: Gemini 2.5 Flash")

Enter Google AI API key: AIzaSyAeSDo_pJ0YB8ogmZLhKEnMYxmMMZ8AnH4

RAG System Initialized
Corpus: 244 documents
Chunks: 62151
Model: Gemini 2.5 Flash


## 8. Demonstration and Evaluation

Test the system with sample queries.

In [None]:
# Example query
result = rag.query("What did presidents say about war and peace?", k=5)
rag.display_result(result)


ANSWER
Based on the provided sources, presidents have addressed war and peace in their State of the Union addresses as follows:

*   **Franklin D. Roosevelt** in 1936 emphasized "the gravity of the situation which confronts the people of the world," stating that "Peace is jeopardized by the few and not by the many. Peace is threatened by those who seek selfish power." He also referenced historical eras of conflict, such as "when petty kings and feudal barons were changing the map of Europe every fortnight, or when great emperors and great kings were engaged in a mad scramble for colonial empire." In 1938, Roosevelt noted that despite the nation's "determination for peace," the "acts and policies of nations in other parts of the world have far-reaching effects not only upon their immediate neighbors but also on us," while also stating, "I am thankful that I can tell you that our Nation is at peace."

*   **Dwight D. Eisenhower** in 1960 declared his "long-held resolve overriding all ot

In [None]:
# Additional test queries
test_queries = [
    "How did Roosevelt describe freedom?",
    "What economic policies did presidents advocate during crises?",
    "What did Lincoln say about the Constitution?"
]

for query in test_queries:
    print(f"\nQuery: {query}")
    result = rag.query(query, k=3)
    print(f"Status: {result['status']} | Confidence: {result['confidence']:.1%}")
    print(f"Answer: {result['answer'][:200]}...\n")


Query: How did Roosevelt describe freedom?
Status: success | Confidence: 86.0%
Answer: Based on the provided sources, Franklin Delano Roosevelt is mentioned as having "spoke of a day of infamy and summoned a nation to arms" (Ronald Reagan, 1982). However, the sources do not contain any ...


Query: What economic policies did presidents advocate during crises?
Status: success | Confidence: 78.9%
Answer: During economic crises, presidents advocated for specific measures:

President Herbert Hoover, in 1932, advocated for "unprecedented emergency measures enacted and policies adopted" to address "a seri...


Query: What did Lincoln say about the Constitution?
Status: success | Confidence: 81.4%
Answer: The provided sources do not directly quote Abraham Lincoln speaking about the Constitution.

However, President Woodrow Wilson, in his 1920 address, quoted Abraham Lincoln as saying: "Let us have fait...



## 9. Retrieval Quality Analysis

In [None]:
def analyze_retrieval(query: str, k: int = 10):
    """
    Analyze retrieval quality for a given query.
    """
    results = vector_store.search(query, k=k)

    print(f"\nRetrieval Analysis: '{query}'")
    print("\n" + "="*100)
    print(f"{'Rank':<6} {'President':<30} {'Year':<8} {'Score':<10} {'Preview'}")
    print("="*100)

    for i, (chunk, score) in enumerate(results, 1):
        preview = chunk.text[:50].replace('\n', ' ') + '...'
        print(f"{i:<6} {chunk.president:<30} {chunk.year:<8} {score:>6.2%}    {preview}")

    print("="*100)

    # Statistics
    scores = [score for _, score in results]
    print(f"\nScore Statistics:")
    print(f"  Mean:   {np.mean(scores):.2%}")
    print(f"  Median: {np.median(scores):.2%}")
    print(f"  Std:    {np.std(scores):.2%}")

# Example analysis
analyze_retrieval("democracy and freedom", k=10)


Retrieval Analysis: 'democracy and freedom'

Rank   President                      Year     Score      Preview
1      George W. Bush                 2006     68.54%    eam, the advance of freedom is the great story of ...
2      George W. Bush                 2006     68.54%    eam, the advance of freedom is the great story of ...
3      George Bush                    1990     76.65%    ienable Rights, and that among these are Life, Lib...
4      George W. Bush                 2006     77.19%    ing murder and destruction to our country. Dictato...
5      Ronald Reagan                  1983     77.42%    ediscovered the strength of our common democratic ...
6      Ronald Reagan                  1983     77.42%    ediscovered the strength of our common democratic ...
7      Franklin D. Roosevelt          1939     77.60%    specting his neighbors. Democracy, the practice of...
8      Franklin D. Roosevelt          1941     78.67%    will give you their applause. In the future days, ...


## 10. Batch Processing

In [None]:
def batch_process(questions: List[str], k: int = 5) -> pd.DataFrame:
    """
    Process multiple questions and return results as DataFrame.
    """
    results = []

    for question in questions:
        result = rag.query(question, k=k)
        results.append({
            'question': question,
            'confidence': result['confidence'],
            'status': result['status'],
            'answer_preview': result['answer'][:150] + '...'
        })

    return pd.DataFrame(results)

# Example batch processing
batch_questions = [
    "What did FDR say about fear?",
    "How did Reagan describe American values?",
    "What was Washington's foreign policy advice?"
]

batch_results = batch_process(batch_questions)
batch_results



Unnamed: 0,question,confidence,status,answer_preview
0,What did FDR say about fear?,0.916211,success,"Based on the provided sources, President Frank..."
1,How did Reagan describe American values?,0.947167,success,"In his 1982 address, Ronald Reagan described A..."
2,What was Washington's foreign policy advice?,0.840008,success,"Based on the provided sources, there is no inf..."


## Conclusion

This notebook demonstrates a complete RAG system for historical document analysis with:

1. **Data Acquisition**: Automated loading from public repositories
2. **Document Processing**: Intelligent chunking with sentence boundaries
3. **Semantic Search**: FAISS-based vector similarity search
4. **LLM Integration**: Gemini 2.5 Flash for answer generation
5. **Error Handling**: Robust retry logic and graceful degradation
6. **Evaluation**: Confidence scoring and source attribution

### Future Improvements
- Implement hybrid search (dense + sparse retrieval)
- Add re-ranking stage for improved relevance
- Expand corpus to include other presidential documents
- Implement evaluation metrics (BLEU, ROUGE, F1)
- Add query expansion for better retrieval coverage

---