## Evaluation Framework

We designed a custom evaluation pipeline to assess our generative models on multiple quality dimensions, combining semantic, stylistic, and structural criteria. The evaluation runs on four core queries and reports aggregate metrics across all responses. Here's how it works:

### 1. Corpus Setup + Embedding Style Vector  
We first load our cleaned Bohr corpus and randomly select up to 200 passages. Using the `all-MiniLM-L6-v2` encoder, we embed each passage and compute the average embedding across the sample. This gives us a reference “style vector” representing the general tone and semantics of Bohr's writing. Later, we use this vector to compute how stylistically aligned a generated answer is.

### 2. Query-Based Evaluation  
We define four standard queries that probe conceptual and scientific themes in Bohr's thought. For each query, we generate a response using the provided model function and record various linguistic and semantic properties.

### 3. Style & Relevance Metrics  
- **StyleSim**: Cosine similarity between the answer’s embedding and the precomputed Bohr style vector. High values indicate stylistic alignment with the corpus.  
- **QuerySim**: Cosine similarity between the query embedding and the response embedding, measuring how semantically relevant the answer is.  
- **Repetition**: Using TF-IDF cosine similarity, we compare all sentence pairs in the answer. High values mean redundant phrasing, which we penalize.

### 4. Fluency & Diversity Metrics  
- **Tokens**: Total number of words in the response  
- **LexDiv**: Lexical diversity — unique words over total words  
- **Distinct-1 / Distinct-2**: Measures how often unigrams and bigrams are repeated  
- **AvgSentLen**: Average number of words per sentence — a proxy for sentence complexity

### 5. Timing  
- **GenTime(s)**: Time in seconds to generate each response — useful for comparing model speed

The final output is a `pandas` DataFrame with per-query metrics and a summary of average values. This lets us compare different models not just by performance or plausibility, but also by elegance, precision, and stylistic authenticity.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import numpy as np
import pandas as pd
import time
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors

# Define evaluation queries
with open('bohr_300_questions.txt', 'r') as f:
    queries = [line.strip().split('. ', 1)[-1] for line in f if line.strip()]

# Load cleaned Bohr's corpus and compute mean style embedding
with open('bohr_cleaned_final.txt', 'r', encoding='utf-8') as f:
    lines = [l.strip() for l in f if l.strip()]

# Initialize embedder and compute mean embedding of a random sample of passages
embedder = SentenceTransformer("all-MiniLM-L6-v2")
sample = np.random.choice(lines, min(len(lines), 200), replace=False)
mean_emb = embedder.encode(sample.tolist(), convert_to_numpy=True).mean(axis=0)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Auxiliary function to compute internal repetition score
def repetition_score(text):
    sentences = [s.strip() for s in text.split('.') if len(s.strip().split()) > 3]
    if len(sentences) < 2:
        return 0.0
    tfidf = TfidfVectorizer().fit_transform(sentences)
    sim_matrix = cosine_similarity(tfidf)
    n = len(sentences)
    repetition = (sim_matrix.sum() - n) / (n * (n - 1))
    return repetition

# Main evaluation function
def evaluate_model_metrics(model_fn, model_name="CustomModel", verbose=False):
    """
    Evaluate textual quality and semantic relevance of a generative model.

    Args:
        model_fn (function): A function that takes a query string and returns a generated response.
        model_name (str): Optional name for the model (used in output).
        verbose (bool): If True, prints each query and generated response.

    Returns:
        dict: Dictionary of average evaluation metrics.
    """
    records = []

    for query in queries:
        start = time.time()
        response = model_fn(query)
        if response is None:
            print(f"[ERROR] No valid response for query: {query}")
            continue

        gen_time = time.time() - start

        if verbose:
            print(f"\n[Query] {query}\n[Response] {response}\n")

        tokens = response.split()
        length = len(tokens)
        lex_div = len(set(tokens)) / length if length else 0
        bigrams = list(zip(tokens, tokens[1:]))
        distinct1 = len(set(tokens)) / length if length else 0
        distinct2 = len(set(bigrams)) / len(bigrams) if bigrams else 0
        sentences = [s.strip() for s in response.split('.') if s.strip()]
        sent_lens = [len(s.split()) for s in sentences]
        avg_sent_len = np.mean(sent_lens) if sent_lens else 0

        # Style similarity to bohr's corpus
        resp_emb = embedder.encode([response], convert_to_numpy=True)[0]
        style_sim = float(np.dot(resp_emb, mean_emb) / (np.linalg.norm(resp_emb) * np.linalg.norm(mean_emb)))

        # Semantic relevance (query-to-response similarity)
        query_emb = embedder.encode([query], convert_to_numpy=True)[0]
        query_sim = float(np.dot(query_emb, resp_emb) / (np.linalg.norm(query_emb) * np.linalg.norm(resp_emb)))

        # Internal repetition
        repetition = repetition_score(response)

        records.append({
            "Query": query,
            "Tokens": length,
            "LexDiv": lex_div,
            "Distinct-1": distinct1,
            "Distinct-2": distinct2,
            "AvgSentLen": avg_sent_len,
            "StyleSim": style_sim,
            "QuerySim": query_sim,
            "Repetition": repetition,
            "GenTime(s)": round(gen_time, 3)
        })

    df = pd.DataFrame(records)
    metric_cols = ["Tokens", "LexDiv", "Distinct-1", "Distinct-2", "AvgSentLen",
                   "StyleSim", "QuerySim", "Repetition", "GenTime(s)"]
    summary = df[metric_cols].mean().to_dict()

    print(f"\n=== Metrics for {model_name} ===")
    for metric, value in summary.items():
        print(f"{metric}: {value:.4f}")

    return summary


The baseline is a generator which is trained just on its own parameters and not referring to files that, in other models, we retrieve, so the baseline answers out of its own pretraining.

In [None]:
!pip install faiss-cpu

In [None]:
import pandas as pd
import numpy as np
import time
import random
from IPython.display import display
from sentence_transformers import SentenceTransformer
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import faiss

# Load FAISS index
print("[EVAL] Building FAISS index...")
passage_embeddings = embedder.encode(lines, convert_to_numpy=True, show_progress_bar=True)
index = faiss.IndexFlatL2(passage_embeddings.shape[1])
index.add(passage_embeddings)
print("[EVAL] FAISS index ready.")

baseline_tokenizer = AutoTokenizer.from_pretrained("gpt2")
baseline_model     = AutoModelForCausalLM.from_pretrained("gpt2")
baseline_pipe      = pipeline(
    "text-generation",
    model=baseline_model,
    tokenizer=baseline_tokenizer,
    device=-1
)

def baseline_chatbot(prompt: str, max_new_tokens: int = 50):
    full = f"You are a helpful assistant. Answer succinctly:\n\nQuestion: {prompt}\nAnswer:"
    out = baseline_pipe(
        full,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=0.9,
        temperature=0.8,
        pad_token_id=baseline_tokenizer.eos_token_id
    )[0]["generated_text"]
    return out.split("Answer:")[-1].strip()
evaluate_model_metrics(baseline_chatbot, model_name="BASELINE", verbose=True)

# MODEL 1: RAG

We’ve built a simple Retrieval-Augmented Generation (RAG) system in six stages to combine information lookup with language generation:

1) **Loading the Cleaned Corpus**  
We read our preprocessed Bohr paragraphs from disk into a list called `passages`, skipping empty lines.

2) **Embedding with SentenceTransformer**  
We initialize the `all-MiniLM-L6-v2` SentenceTransformer, which maps any sentence or paragraph into a 384-dimensional real vector. We compute embeddings in batches to avoid memory issues, and then stack them into a single NumPy array called `passage_embeddings`. We deliberately chose this specific model because it’s lightweight, fast, and surprisingly accurate even without a GPU—ideal for encoding thousands of passages quickly and comparing them semantically to new questions.

3) **Indexing with FAISS**  
We feed all paragraph embeddings into FAISS (`IndexFlatL2`), a library optimized for fast nearest-neighbor search. This lets us retrieve the most semantically similar chunks to a given question using simple Euclidean distance, in just milliseconds, even on CPU.

4) **Preparing the Generator**  
We use the base GPT-2 model (`gpt2`) as our text generator via HuggingFace’s `pipeline`. Even though GPT-2 isn’t instruction-tuned, it still generates coherent English and can synthesize retrieved content when guided with the right prompt.

5) **Retrieval + Generation (`rag_answer`)**  
Given a question, we embed the query, retrieve the top-k most relevant paragraphs, and concatenate them with an instruction prompt like “You are Niels Bohr. Paraphrase and synthesize the ideas…” followed by the user’s question. This full prompt is sent to GPT-2, which generates up to 200 new tokens as the answer. We also print out the retrieved chunks and their distances for debugging purposes.

6) **Silent Version (`rag_answer_silent`)**  
This alternative version hides all the print statements and returns only the cleaned answer string, extracted after the “Answer:” marker. It’s useful for automatic evaluation or frontend integration. We also added a small postprocessing step to remove anything trailing after a newline or unintended repeat prompts.


In [None]:
!pip install faiss-cpu

In [None]:
import os
import time
from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline

corpus_path = "bohr_cleaned_final.txt"
print(f"[1/6] Loading corpus from {corpus_path}...")
with open(corpus_path, 'r', encoding='utf-8') as f:
    passages = [line.strip() for line in f if line.strip()]
print(f"[1/6] Done. {len(passages)} passages loaded.\n")

print("[2/6] Initializing SentenceTransformer embedder...")
embedder = SentenceTransformer('all-MiniLM-L6-v2')
print("[2/6] Embedder ready.\n")

batch_size = 64
all_embeddings = []
print("[3/6] Computing embeddings in batches:")
start_time = time.time()
for i in range(0, len(passages), batch_size):
    batch = passages[i:i+batch_size]
    embs = embedder.encode(batch, convert_to_numpy=True, show_progress_bar=False)
    all_embeddings.append(embs)
    print(f"    • Batch {i//batch_size+1}/{(len(passages)-1)//batch_size+1} done", flush=True)
passage_embeddings = __import__('numpy').vstack(all_embeddings)
elapsed = time.time() - start_time
print(f"[3/6] Embeddings computed: shape={passage_embeddings.shape}, time={elapsed:.1f}s\n")

dim = passage_embeddings.shape[1]
print("[4/6] Building FAISS index (IndexFlatL2)...", flush=True)
index = faiss.IndexFlatL2(dim)
start_time = time.time()
index.add(passage_embeddings)
elapsed = time.time() - start_time
print(f"[4/6] FAISS index built: {index.ntotal} vectors indexed in {elapsed:.2f}s\n", flush=True)

print("[5/6] Initializing text-generation pipeline (GPT-2)...", flush=True)
generator = pipeline('text-generation', model='gpt2', tokenizer='gpt2')
print("[5/6] Generator ready.\n", flush=True)

def rag_answer(query, k=3):
    print(f"[6] Retrieving top {k} passages for query: {query!r}", flush=True)
    q_emb = embedder.encode([query], convert_to_numpy=True)
    D, I = index.search(q_emb, k)
    ctx = [passages[i] for i in I[0]]
    for rank, (dist, text) in enumerate(zip(D[0], ctx), 1):
        print(f"    {rank}. (dist={dist:.3f}) {text[:60]}…", flush=True)

    prompt = (
        "You are Niels Bohr. Paraphrase and synthesize the ideas below in your own words, "
        "avoiding verbatim quotes. Then answer:\n\n"
        + "\n---\n".join(ctx)
        + f"\n\nQuestion: {query}\nAnswer:"
    )
    out = generator(prompt, max_new_tokens=200, do_sample=True, early_stopping=True, top_p=0.9)[0]['generated_text']
    print("[6] Generation complete.\n", flush=True)
    return out

def rag_answer_silent(query, k=3, max_new_tokens=200, top_p=0.8):
    q_emb = embedder.encode([query], convert_to_numpy=True)
    _, I = index.search(q_emb, k)
    context = [passages[i] for i in I[0]]

    prompt = (
        "You are Niels Bohr. Based on the following notes, answer the question simply and clearly:\n\n"
        + "\n---\n".join(context)
        + f"\n\nQuestion: {query}\nAnswer:"
    )

    generated_output = generator(
        prompt,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7,
        top_p=top_p,
        repetition_penalty=1.2,
        pad_token_id=generator.tokenizer.eos_token_id
    )[0]['generated_text']

    # Extract only the part after 'Answer:' and stop at first break or repeated marker
    answer = generated_output.split("Answer:")[-1].strip()
    for stop_token in ["\n\n", "Question:", "Answer:"]:
        if stop_token in answer:
            answer = answer.split(stop_token)[0].strip()
    return answer


# Example Usage
'''for q in queries:
    print("=== QUERY ===")
    print(q)
    answer_verbose = rag_answer(q, k=3)
    answer_silent = rag_answer_silent(q, k=3)
    print("=== ANSWER (verbose) ===")
    print(answer_verbose)
    print("=== ANSWER (silent) ===")
    print(answer_silent)
    print("\n")
'''


In [None]:
evaluate_model_metrics(rag_answer_silent, model_name="RAG", verbose=True)

# MODEL 2: GPT-2 Large RAG Pipeline

This model builds upon our first RAG system, but introduces some important upgrades—most notably a stronger language model (GPT-2 Large) and a more refined prompting strategy for better, more focused responses.

1) **Loading the Cleaned Corpus**  
Just like before, we load our cleaned Bohr corpus into a list of passages. Each passage is a self-contained snippet of text.

2) **Embedding with SentenceTransformer**  
We continue using `all-MiniLM-L6-v2` to embed the corpus into 384-dimensional vectors. This SentenceTransformer strikes the best balance between speed and semantic quality, especially on CPU.

3) **Indexing with FAISS**  
Again, we rely on FAISS’s `IndexFlatL2` to efficiently retrieve the top-k most semantically similar passages to any given query, based on L2 distance.

4) **Upgrading the Generator: GPT-2 → GPT-2 Large**  
Here’s the first major change: we switch from regular `gpt2` to the much more powerful `gpt2-large`. GPT-2 Large has significantly more parameters (774M vs. 124M), which allows it to generate much more coherent and nuanced responses. To make use of the model effectively, we also run it on GPU (`device=0`).

5) **Refined Prompting Strategy**  
Instead of a general “paraphrase and synthesize” prompt, we now use a much more precise instruction: Einstein is imagined answering a curious student in a clear, concise, and complete way—limited to 2-3 sentences. The prompt discourages vague or meta-style outputs and explicitly asks for a self-contained, final answer. This helps constrain GPT-2 Large to generate focused replies.

6) **Trimming the Output for Clarity**  
We extract the generated text and limit the final answer to the first 2-3 complete sentences, which avoids overly long or rambling outputs. We also do light postprocessing to clean up formatting and trailing punctuation.

**Key Differences from Model 1:**
- **Stronger Generator:** GPT-2 Large replaces GPT-2 for better fluency and reasoning.
- **Sharper Prompting:** More detailed, instructional prompt tailored to concise and educational answers.
- **Postprocessing:** Automatic trimming to 2-3 sentences for consistency across outputs.
- **GPU Usage:** The pipeline is explicitly set to run on GPU for faster generation.

Overall, this model delivers noticeably sharper and more polished responses, making it a strong baseline for generating grounded, Einstein-style answers.


In [None]:
import os
import time
from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline
import numpy as np

# Step 1: Load the cleaned corpus
corpus_path = 'bohr_cleaned_final.txt'
print(f"[1/6] Loading corpus from {corpus_path}...")
with open(corpus_path, 'r', encoding='utf-8') as f:
    passages = [line.strip() for line in f if line.strip()]
print(f"[1/6] Done. {len(passages)} passages loaded.\n")

# Step 2: Initialize the embedding model
print("[2/6] Initializing SentenceTransformer embedder...")
embedder = SentenceTransformer('all-MiniLM-L6-v2')
print("[2/6] Embedder ready.\n")

# Step 3: Compute and stack passage embeddings
batch_size = 64
all_embeddings = []
print("[3/6] Computing embeddings in batches:")
start_time = time.time()
for i in range(0, len(passages), batch_size):
    batch = passages[i:i+batch_size]
    embs = embedder.encode(batch, convert_to_numpy=True, show_progress_bar=False)
    all_embeddings.append(embs)
    print(f"    • Batch {i//batch_size+1}/{(len(passages)-1)//batch_size+1} done", flush=True)

passage_embeddings = np.vstack(all_embeddings)
elapsed = time.time() - start_time
print(f"[3/6] Embeddings computed: shape={passage_embeddings.shape}, time={elapsed:.1f}s\n")

# Step 4: Build FAISS index for similarity search
dim = passage_embeddings.shape[1]
print("[4/6] Building FAISS index (IndexFlatL2)...", flush=True)
index = faiss.IndexFlatL2(dim)
start_time = time.time()
index.add(passage_embeddings)
elapsed = time.time() - start_time
print(f"[4/6] FAISS index built: {index.ntotal} vectors indexed in {elapsed:.2f}s\n", flush=True)

# Step 5: Load GPT-2 Large as the text generator
print("[5/6] Initializing text-generation pipeline (GPT-2 Large)...", flush=True)
generator = pipeline(
    "text-generation",
    model="gpt2-large",
    tokenizer="gpt2-large",
    device=0
)
print("[5/6] Generator ready.\n", flush=True)

# Step 6: Define the RAG-style answer generation function
def gpt_2_answer(query, k=3):
    print(f"[6] Retrieving top {k} passages for query: {query!r}", flush=True)
    q_emb = embedder.encode([query], convert_to_numpy=True)
    D, I = index.search(q_emb, k)
    context = [passages[i][:300] for i in I[0]]

    for rank, (dist, text) in enumerate(zip(D[0], context), 1):
        print(f"    {rank}. (dist={dist:.3f}) {text[:60]}…", flush=True)

    prompt = (
        "Niels Bohr is asked a scientific question by a curious student. "
        "Using only the notes below, he gives a clear, concise, and complete answer in 2 to 3 sentences. "
        "He avoids lists, vague generalities, and meta-comments. The answer is final and self-contained.\n\n"
        + "\n---\n".join(context)
        + f"\n\nQuestion: {query}\nAnswer:"
    )

    output = generator(
        prompt,
        max_new_tokens=120,
        temperature=0.45,
        top_p=0.72,
        repetition_penalty=1.5,
        do_sample=True,
        pad_token_id=generator.tokenizer.eos_token_id,
        eos_token_id=generator.tokenizer.eos_token_id,
        return_full_text=False
    )[0]['generated_text']

    # Extract and truncate the output to the first 2–3 complete sentences
    answer = output.strip()
    sentence_split = answer.replace('\n', ' ').split('. ')
    if len(sentence_split) >= 3:
        answer = '. '.join(sentence_split[:3]).strip()
    elif len(sentence_split) >= 2:
        answer = '. '.join(sentence_split[:2]).strip()
    else:
        answer = sentence_split[0].strip()

    if not answer.endswith('.'):
        answer += '.'

    while answer[-1] in [':', ';']:
        answer = answer[:-1].strip() + '.'

    print("[6] Generation complete.\n")
    return answer

# Example usage:

'''for q in queries:
    print("=== QUERY ===")
    print(q)
    answer = gpt_2_answer(q, k=3)
    print("=== ANSWER ===")
    print(answer)
    print("\n")'''



In [None]:
evaluate_model_metrics(gpt_2_answer, model_name="GPT2", verbose=True)

# MODEL 3: GPT-2 Large with Advanced Prompting

In this third model, we build on the previous GPT-2 Large RAG system by introducing a more polished prompting strategy and better output control, aiming for answers that are not just relevant, but stylistically clear, final, and self-contained.

1) **Loading the Cleaned Corpus**  
As with previous models, we load our cleaned Bohr corpus into memory, storing each paragraph as a separate entry in a list called `passages`.

2) **Embedding with SentenceTransformer**  
We continue to use `all-MiniLM-L6-v2` for computing 384-dimensional semantic embeddings of each passage. The model is lightweight yet accurate enough to retrieve relevant chunks from our corpus efficiently, even without a GPU.

3) **Indexing with FAISS**  
We index all passage embeddings using FAISS (`IndexFlatL2`) to enable fast similarity search. This step remains unchanged from previous models.

4) **Generator: GPT-2 Large**  
We again use `gpt2-large` as the generator, running it on GPU for speed. The model's large size allows it to generate more coherent and meaningful responses than the base GPT-2.

5) **Prompt Formatting Upgrade**  
This is the core improvement of this model. We define a dedicated `format_prompt()` function to construct consistent, high-quality prompts. The new prompt emphasizes natural, self-contained language and instructs the model to avoid vague generalities or references to external information. The idea is to simulate how Bohr would respond thoughtfully and clearly to a curious question using only the retrieved notes.

6) **Cleaner Answer Extraction**  
We improve post-processing by cleaning the generated output to extract just the answer portion (after “Answer:”), trimming any trailing artifacts or prompt echoes. We also truncate the text after the last full sentence to ensure brevity and clarity.

**Key Differences from Model 2:**
- **Dedicated prompt formatter** to ensure consistent tone and structure across queries.
- **Better answer cleanup**: more robust text extraction and trimming logic.
- **Tighter sampling controls**: slightly lower temperature and higher repetition penalty for more focused responses.

This model represents our most refined setup so far, blending strong retrieval, a powerful generator, and a well-crafted prompt to produce high-quality Bohr-style answers.


In [None]:
import os
import time
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline

# Step 1: Load the preprocessed Einstein corpus
corpus_path = 'bohr_cleaned_final.txt'
print(f"[1/6] Loading corpus from {corpus_path}...")
with open(corpus_path, 'r', encoding='utf-8') as f:
    passages = [line.strip() for line in f if line.strip()]
print(f"[1/6] Done. {len(passages)} passages loaded.\n")

# Step 2: Initialize the sentence embedder
print("[2/6] Initializing SentenceTransformer embedder...")
embedder = SentenceTransformer('all-MiniLM-L6-v2')
print("[2/6] Embedder ready.\n")

# Step 3: Compute and stack embeddings for all passages
batch_size = 64
all_embeddings = []
print("[3/6] Computing embeddings in batches:")
start_time = time.time()
for i in range(0, len(passages), batch_size):
    batch = passages[i:i+batch_size]
    embs = embedder.encode(batch, convert_to_numpy=True, show_progress_bar=False)
    all_embeddings.append(embs)
    print(f"    • Batch {i//batch_size+1}/{(len(passages)-1)//batch_size+1} done", flush=True)
passage_embeddings = np.vstack(all_embeddings)
elapsed = time.time() - start_time
print(f"[3/6] Embeddings ready (shape={passage_embeddings.shape}) in {elapsed:.1f}s\n")

# Step 4: Build a FAISS L2 similarity index for the passage embeddings
dim = passage_embeddings.shape[1]
print("[4/6] Building FAISS index...")
index = faiss.IndexFlatL2(dim)
index.add(passage_embeddings)
print(f"[4/6] Indexed {index.ntotal} vectors.\n")

# Step 5: Initialize the text generation model (GPT-2 Large)
print("[5/6] Loading generation pipeline...")
generator = pipeline(
    "text-generation",
    model="gpt2-large",
    tokenizer="gpt2-large",
    device=0  # CUDA:0
)
print("[5/6] Generator ready.\n")

# Format the generation prompt with retrieved context and structured instructions
def format_prompt(query, context):
    return (
        "You are Niels Bohr. Based on the following knowledge, answer the question thoughtfully and directly. "
        "Use clear, natural language. Write 2 or 3 complete sentences that stand alone. Do not refer to sources or background information.\n\n"
        + "\n---\n".join(context)
        + f"\n\nQuestion: {query}\nAnswer:"
    )

# Step 6: Define the RAG-based answer function using GPT-2 Large
def rag_advanced_answer(query, k=3):
    print(f"[6] Retrieving top {k} passages for query: {query!r}")
    q_emb = embedder.encode([query], convert_to_numpy=True)
    D, I = index.search(q_emb, k)
    ctx = [passages[i][:300] for i in I[0]]

    for rank, (dist, text) in enumerate(zip(D[0], ctx), 1):
        print(f"    {rank}. (dist={dist:.3f}) {text[:60]}…")

    prompt = format_prompt(query, ctx)
    output = generator(
        prompt,
        max_new_tokens=120,
        temperature=0.4,
        top_p=0.85,
        repetition_penalty=1.7,
        do_sample=True,
        pad_token_id=generator.tokenizer.eos_token_id,
        eos_token_id=generator.tokenizer.eos_token_id
    )[0]['generated_text']

    # Extract the answer text following "Answer:" and clean termination
    answer = output.split("Answer:")[-1].strip()
    for stop_token in ["\n\n", "\nQuestion:", "\nAnswer:", "---"]:
        if stop_token in answer:
            answer = answer.split(stop_token)[0].strip()
    if '.' in answer:
        answer = '.'.join(answer.split('.')[:-1]) + '.'
    print("[6] Generation complete.\n")
    return answer

# Example usage 
'''
for q in queries:
    print("=== QUERY ===")
    print(q)
    answer = rag_advanced_answer(q, k=3)
    print("=== ANSWER ===")
    print(answer)
    print("\n")'''


In [None]:
evaluate_model_metrics(rag_advanced_answer, model_name="RAG advanced", verbose=True)

# MODEL 4: GPT-Neo 2.7B with RAG

In this fourth model, we scale up significantly by replacing GPT-2 with GPT-Neo 2.7B—a powerful open-source language model developed by EleutherAI that brings a big jump in fluency and reasoning capacity. We also increase the number of retrieved passages and optimize the prompt-to-token flow for better long-form generation.

1) **Loading the Corpus**  
As before, we load our pre-cleaned corpus of Bohr passages from disk and store them as a list of paragraphs.

2) **Embedding with SentenceTransformer**  
We keep using `all-MiniLM-L6-v2` to embed each paragraph into 384-dimensional vectors. It's fast and effective for semantic similarity and works well with our retrieval setup.

3) **Indexing with FAISS**  
All embeddings are indexed using FAISS (`IndexFlatL2`), enabling fast L2-based nearest-neighbor search. Nothing changes here compared to earlier models.

4) **Loading GPT-Neo (2.7B)**  
Here’s the major upgrade: we move from GPT-2 Large (774M) to GPT-Neo 2.7B. This model has over 3 times more parameters and was trained on The Pile, making it significantly more capable for general knowledge reasoning and natural language generation. We load both the tokenizer and model using HuggingFace, and run them on GPU if available.

5) **Creating a Custom Generation Pipeline**  
We manually create the generation pipeline by loading the model and tokenizer, and ensure that both are sent to the correct device (GPU if present).

6) **RAG Answer Function (`rag_gptneo`)**  
We increase `k` to 7, retrieving more relevant context chunks per query. These are concatenated with section markers (`---`) and followed by a minimalistic prompt:  
`Question: ... Answer:`  
Unlike earlier models, there is no "Niels Bohr persona" or stylistic guidance—GPT-Neo is expected to synthesize the context into a coherent answer on its own.

7) **Token Management and Truncation**  
We calculate the token length of the input prompt and truncate it if it exceeds the model’s token limit. We manually send the `input_ids` tensor to GPU to ensure the generation runs efficiently.

8) **Answer Extraction**  
We split the final output at the “Answer:” tag (if present), and return the cleaned answer as plain text.

**Key Differences from Model 3:**
- **Much larger model:** GPT-Neo 2.7B instead of GPT-2 Large → higher fluency and reasoning power.
- **Manual token flow:** Explicit GPU handling and input truncation using tokenizer limits.
- **No persona prompt:** More neutral style, letting the model generate based solely on retrieved knowledge.
- **Higher context depth:** `k=7` passages retrieved instead of 3 for a richer information base.

This model pushes the limits of our RAG setup in terms of scale, leveraging a high-capacity generator and feeding it more supporting context to produce deeper, more informative answers.


In [None]:
import os
import time
from sentence_transformers import SentenceTransformer
import faiss
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
import numpy as np

# Step 1: Load the preprocessed Einstein corpus
corpus_path = 'bohr_cleaned_final.txt'
print(f"[1/8] Loading corpus from {corpus_path}...")
with open(corpus_path, 'r', encoding='utf-8') as f:
    passages = [line.strip() for line in f if line.strip()]
print(f"[1/8] Corpus loaded: {len(passages)} passages\n")

# Step 2: Initialize the SentenceTransformer embedder
print("[2/8] Initializing SentenceTransformer...")
embedder = SentenceTransformer('all-MiniLM-L6-v2')
print("[2/8] Embedder ready\n")

# Step 3: Compute and index passage embeddings
print("[3/8] Computing embeddings and building FAISS index...")
batch_size = 64
emb_chunks = []
start = time.time()
for i in range(0, len(passages), batch_size):
    batch = passages[i:i+batch_size]
    embs = embedder.encode(batch, convert_to_numpy=True, show_progress_bar=False)
    emb_chunks.append(embs)
    print(f"    [3/8] Batch {i//batch_size+1}/{(len(passages)-1)//batch_size+1} completed")
passage_embeddings = np.vstack(emb_chunks)
dim = passage_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(passage_embeddings)
print(f"[3/8] Embeddings indexed (shape={passage_embeddings.shape}) in {time.time()-start:.1f}s\n")

# Step 4: Load the GPT-Neo 2.7B model and tokenizer
print("[4/8] Initializing tokenizer and GPT-Neo model...")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
generator_model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
device = 0 if torch.cuda.is_available() else -1
print(f"[4/8] Model and tokenizer loaded (device={'cuda' if device==0 else 'cpu'})\n")

# Step 5: Create the text-generation pipeline
print("[5/8] Building generation pipeline...")
generator = pipeline("text-generation", model=generator_model, tokenizer=tokenizer, device=device)
print("[5/8] Pipeline ready\n")

# Step 6–8: Define the full RAG inference process with GPT-Neo
def rag_gptneo(query, k=7, max_new_tokens=300, top_p=0.9):
    print(f"\n[6/8] Received query: {query!r}")

    # Step 6: Retrieve top-k semantically similar passages
    q_emb = embedder.encode([query], convert_to_numpy=True)
    D, I = index.search(q_emb, k)
    context = [passages[i] for i in I[0]]

    print(f"[6/8] Retrieved top-{k} passages:")
    for rank, (dist, text) in enumerate(zip(D[0], context), 1):
        print(f"    {rank}. (dist={dist:.3f}) {text[:60]}…")

    # Step 7: Prepare the prompt for generation
    context_text = "\n---\n".join(context)
    prompt = f"{context_text}\n\nQuestion: {query}\nAnswer:"

    # Ensure input fits within model's max input size
    max_input_tokens = tokenizer.model_max_length
    input_ids = tokenizer.encode(prompt, return_tensors="pt", truncation=True, max_length=max_input_tokens)
    input_ids = input_ids.to(generator_model.device)
    print(f"[7/8] Prompt ready (tokens={input_ids.shape[1]}/{max_input_tokens}, device={input_ids.device})")

    # Step 8: Generate the response
    output_ids = generator_model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=top_p,
        pad_token_id=tokenizer.eos_token_id
    )
    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Extract the response portion following "Answer:"
    if "Answer:" in generated_text:
        answer = generated_text.split("Answer:")[-1].strip()
    else:
        answer = generated_text.strip()

    print("[8/8] Answer generated")
    return answer

# Example usage
'''
for q in queries:
    print(f"\n=============================\nQuery: {q}")
    answer = rag_gptneo(q)
    print(f"\nAnswer:\n{answer}")
    print("=============================")'''


In [None]:
evaluate_model_metrics(rag_gptneo, model_name="GPT-NEO", verbose=True)

# MODEL 5: GPT-Neo 1.3B with Dynamic Context Truncation

In this fifth model, we introduce a refined version of the GPT-Neo-based RAG pipeline. Instead of simply scaling the generator up, this version prioritizes *efficient use of the input token window* by dynamically adapting how much context to include for each query. We also adopt a more detailed, structured prompt to guide answer style and length.

1) **Corpus and Embedding (unchanged)**  
We reuse the same cleaned Bohr corpus and embed each paragraph using the `all-MiniLM-L6-v2` SentenceTransformer model. All embeddings are indexed using FAISS for fast nearest-neighbor search via L2 distance.

2) **Generator Upgrade: GPT-Neo 1.3B**  
We switch from GPT-Neo 2.7B to the lighter `gpt-neo-1.3B` model. While smaller, this version still provides high-quality outputs and is more manageable in environments with limited GPU memory (e.g., Colab).

3) **Smart Context Assembly**  
Instead of blindly retrieving `k=7` passages and concatenating all of them, we build the input **dynamically**, one paragraph at a time. After each addition, we test whether the resulting prompt (including the final question and instructions) still fits within the model’s token limit. This ensures we use *as much relevant context as possible* without truncation, which is especially important for long-generation models.

4) **Prompting Strategy**  
The prompt tells the model to act as Bohr and provide a clear, thoughtful, and concise answer based on the retrieved "excerpts." It explicitly instructs the model *not to mention the excerpts*, which helps avoid meta comments. The target answer length is 3–5 sentences.

5) **Generation Details**  
We sample up to 400 new tokens using `top_p` sampling with moderate temperature, and apply a `no_repeat_ngram_size` of 3 to prevent redundant phrasing.

6) **Post-processing**  
We extract the answer from the generated output after the “Answer:” token, clean it up, and trim trailing content after known markers (like “---” or repeated prompt sections). We also ensure the answer ends with a full stop and makes grammatical sense.

**Key Differences from Model 4:**
- **Smaller model (1.3B vs 2.7B)**, better for memory-constrained environments.
- **Dynamic context selection** ensures optimal use of token budget with no waste or hard truncation.
- **Stronger prompt structure** guides both tone and content.
- **Improved repetition control** using `no_repeat_ngram_size`.

This model finds a sweet spot between resource efficiency and generation quality. Thanks to smarter context management and focused prompting, it often produces answers that are just as good—if not better—than its larger predecessor.


In [None]:
import gc
import torch
gc.collect()
torch.cuda.empty_cache()

In [None]:
import os
import time
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Step 1: Load the cleaned corpus
print("[1/8] Loading corpus...")
corpus_path = 'bohr_cleaned_final.txt'
with open(corpus_path, 'r', encoding='utf-8') as f:
    passages = [line.strip() for line in f if line.strip()]
print(f"[1/8] Corpus loaded: {len(passages)} passages\n")

# Step 2: Initialize the SentenceTransformer embedder
print("[2/8] Initializing SentenceTransformer...")
embedder = SentenceTransformer('all-MiniLM-L6-v2')
print("[2/8] Embedder ready\n")

# Step 3: Compute passage embeddings and build FAISS index
print("[3/8] Computing embeddings and indexing...")
batch_size = 64
emb_chunks = []
start = time.time()
for i in range(0, len(passages), batch_size):
    batch = passages[i:i+batch_size]
    embs = embedder.encode(batch, convert_to_numpy=True, show_progress_bar=False)
    emb_chunks.append(embs)
    print(f"    [3/8] Batch {i//batch_size+1}/{(len(passages)-1)//batch_size+1} completed")
passage_embeddings = np.vstack(emb_chunks)
dim = passage_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(passage_embeddings)
print(f"[3/8] Indexed {passage_embeddings.shape[0]} vectors in {time.time()-start:.1f}s\n")

# Step 4: Load the GPT-Neo 1.3B model and tokenizer
print("[4/8] Initializing tokenizer and GPT-Neo (1.3B)...")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
generator_model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
generator_model.to(device)
print(f"[4/8] Model and tokenizer ready (device={device})\n")

# Step 5–7: Define the RAG pipeline using GPT-Neo 1.3B
def rag_gptneo_2(query, k=7, max_new_tokens=400, top_p=0.8, temperature=0.7):
    print(f"\n[5/8] Processing query: {query!r}")

    q_emb = embedder.encode([query], convert_to_numpy=True)
    D, I = index.search(q_emb, k)

    total_tokens = 0
    context_chunks = []

    # Dynamically accumulate context passages until input length limit
    for idx in I[0]:
        para_text = passages[idx][:1500]
        test_prompt = (
            f"You are Niels Bohr. Based on the following excerpts, answer the question in your own words. "
            "Provide a clear, thoughtful, and concise answer in 3–5 complete sentences. "
            "Do not refer to the excerpts or sources. Just answer directly.\n\n"
            "Excerpts:\n" + "\n---\n".join(context_chunks + [para_text]) + f"\n\nQuestion: {query}\nAnswer:"
        )
        token_count = len(tokenizer.encode(test_prompt))

        if token_count < tokenizer.model_max_length - max_new_tokens:
            context_chunks.append(para_text)
            total_tokens = token_count
        else:
            break

    print(f"[5/8] Collected {len(context_chunks)} context passages (approx. {total_tokens} tokens)\n")

    context_text = "\n---\n".join(context_chunks)
    prompt = (
        f"You are Niels Bohr. Based on the following excerpts, answer the question in your own words. "
        "Provide a clear, thoughtful, and concise answer in 3–5 complete sentences. "
        "Do not refer to the excerpts or sources. Just answer directly.\n\n"
        "Excerpts:\n" + context_text + f"\n\nQuestion: {query}\nAnswer:"
    )

    max_input_tokens = tokenizer.model_max_length
    input_ids = tokenizer.encode(prompt, return_tensors="pt", truncation=True, max_length=max_input_tokens).to(device)
    print(f"[6/8] Prompt ready (tokens={input_ids.shape[1]}/{max_input_tokens})")

    output_ids = generator_model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=top_p,
        temperature=temperature,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3
    )

    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Extract only the answer portion
    if "Answer:" in generated_text:
        answer = generated_text.split("Answer:")[-1].strip()
    else:
        answer = generated_text.strip()

    for stop_token in ["---", "\n\n", "\nAnswer:"]:
        if stop_token in answer:
            answer = answer.split(stop_token)[0].strip()
    if '.' in answer:
        answer = '.'.join(answer.split('.')[:-1]) + '.'

    print("[7/8] Answer generated")
    return answer

# Example usage

'''for q in queries:
    print(f"\n=============================\nQuery: {q}")
    answer = rag_gptneo_2(q)
    print(f"\nAnswer:\n{answer}")
    print("=============================")'''


In [None]:
evaluate_model_metrics(rag_gptneo_2, model_name="GPT-NEO 2", verbose=True)

# MODEL 6: Flan-T5 Small with Cosine Nearest Neighbors

In this sixth model, we shift from autoregressive generators like GPT-2 and GPT-Neo to a **sequence-to-sequence model**, specifically `Flan-T5 Small` by Google. This model is instruction-tuned out of the box, making it especially well-suited for prompt-based tasks like question answering and paraphrasing. We also replace FAISS with a simpler cosine similarity index using `scikit-learn`.

1) **Loading the Corpus**  
We load the cleaned corpus of Bohr passages into a list, as in all previous models.

2) **Embedding the Passages**  
We use the same lightweight SentenceTransformer (`all-MiniLM-L6-v2`) to embed each passage into a 384-dimensional vector. These embeddings serve as the basis for semantic retrieval.

3) **Retrieval with Cosine Similarity**  
Instead of using FAISS, we use `scikit-learn`’s `NearestNeighbors` with `metric='cosine'`. This is easier to set up and interpretable, and works well when dealing with a few thousand vectors. We retrieve the top `k=5` passages closest to the query embedding.

4) **Switching to Flan-T5**  
We use `google/flan-t5-small`, a small but instruction-finetuned T5 model that’s capable of following natural language prompts. Unlike GPT-based models, Flan-T5 takes an input string and generates an output string in a fully encoder-decoder architecture.

5) **Prompt Design**  
We use a new prompt format tailored to T5-style models:  
“You are Niels Bohr. Based on the following excerpts, answer the question in first person, using your own words.”  
The retrieved context is appended under “Excerpts:”, followed by the question. This style plays to Flan-T5’s strengths—concise, factual, instruction-following answers.

6) **Controlled Generation**  
We use nucleus sampling (`top_p=0.9`) with moderate temperature (`0.7`) to allow creativity while maintaining relevance. The `min_length` and `max_length` parameters help ensure that the answer is reasonably detailed and doesn’t terminate too early.

7) **Answer Extraction**  
The generated output is split on “Answer:” to isolate the response cleanly, which is then printed and returned.

**Key Differences from Previous Models:**
- **Different model architecture:** Flan-T5 is encoder-decoder, not autoregressive.
- **Instruction tuning:** Flan-T5 was pretrained with supervised tasks and follows prompts better out of the box.
- **Simplified retriever:** NearestNeighbors with cosine distance replaces FAISS.
- **Different output style:** More focused, structured, and generally shorter answers, often with a clearer logical flow.

This model trades raw generative power for interpretability and prompt-following ability, making it ideal for concise, structured Einstein-like answers grounded in retrieved content.


In [None]:
import os
import time
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Step 1: Load the cleaned corpus
CLEANED_PATH = 'bohr_cleaned_final.txt'
EMBED_MODEL = 'all-MiniLM-L6-v2'
GEN_MODEL = 'google/flan-t5-small'
K = 5

with open(CLEANED_PATH, 'r', encoding='utf-8') as f:
    passages = [line.strip() for line in f if line.strip()]
print(f"[1] Loaded {len(passages)} passages.")

# Step 2: Embed the passages using a SentenceTransformer
embedder = SentenceTransformer(EMBED_MODEL)
print("[2] Computing embeddings...")
embeddings = embedder.encode(passages, convert_to_numpy=True, show_progress_bar=True)

# Step 3: Build a Nearest Neighbors index using cosine similarity
nn = NearestNeighbors(n_neighbors=K, metric='cosine')
nn.fit(embeddings)
print("[3] NearestNeighbors index ready.")

# Step 4: Load Flan-T5 model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL)
generator = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    device=-1,  # CPU (set to 0 for CUDA if available)
)

# Step 5: Define the RAG answer generation function
def rag_flan_t5(query, k=K, max_length=500, min_length=150, top_p=0.9, temperature=0.7):
    t0 = time.time()
    print(f"\n[RAG-FLAN] Query: {query}")

    # Retrieve top-k relevant passages based on cosine similarity
    q_emb = embedder.encode([query], convert_to_numpy=True)
    dists, idxs = nn.kneighbors(q_emb, n_neighbors=k)
    retrieved = [passages[i] for i in idxs[0]]
    print(f"[RAG-FLAN] Retrieved passages (cosine distances): {dists[0].round(3).tolist()}")

    # Compose the prompt for conditional generation
    context_text = "\n---\n".join(retrieved)
    prompt = (
        "You are Niels Bohr. Based on the following excerpts, answer the question in first person, using your own words.\n\n"
        "Excerpts:\n" + context_text +
        f"\n\nQuestion: {query}\nAnswer:"
    )

    # Generate the answer using the T5 model
    out = generator(
        prompt,
        max_length=max_length,
        min_length=min_length,
        do_sample=True,
        top_p=top_p,
        temperature=temperature,
        pad_token_id=tokenizer.eos_token_id
    )[0]["generated_text"]

    # Extract and clean the answer
    answer = out.split("Answer:")[-1].strip()

    print("\n=== NIELS BOHR'S ANSWER ===")
    print(answer)
    print(f"[RAG-FLAN] Done in {time.time() - t0:.2f}s\n")

    return answer

# Step 6: Run the model if executed as main
'''if __name__ == "__main__":
    query = "What is the nature of time?"
    rag_flan_t5(query)
'''

In [None]:
evaluate_model_metrics(rag_flan_t5, model_name="FLAN 5", verbose=True)