# Simple Retrieval-Augmented Generation (RAG) System
###  Kyle Kitching

---

## 0. Setup and Imports

## 1. Knowledge Base (KB) Creation and Chunking

## 2. Embedding and Indexing

## 3. Retrieval (Similarity Search)

## 4. Generation (Augmented LLM Answering)

## 5. Test Cases (Factual, Foil, Synthesis)

## 6. Results Summary and Notes


In [3]:
# Install required libraries + tf-keras shim so transformers stops complaining
!pip install -q "transformers[torch]" sentence-transformers accelerate tf-keras


In [4]:
# 0. Setup and Imports

import os
os.environ["USE_TF"] = "0"  # tell transformers not to use TensorFlow at all

import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

import textwrap
import torch
import random






In [5]:
# Choose embedding and LLM models

EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
LLM_MODEL_NAME = "google/flan-t5-small"

# Basic reproducibility
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
device


'cpu'

In [6]:
# Load embedding model
embedder = SentenceTransformer(EMBEDDING_MODEL_NAME, device=device)

# Load LLM + tokenizer
tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL_NAME)
llm_model = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL_NAME).to(device)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## 1. Knowledge Base (KB) Creation and Chunking
### 1.1 KB Topic

For this assignment, my custom knowledge base describes the fictional **Washington Sentinels** NFL team's internal **concussion return-to-play protocol**. 

The KB is short (2–3 paragraphs) and is treated as "private" policy text that the base LLM would not normally know. The RAG system will retrieve chunks of this KB to answer questions about how players progress through the concussion protocol.


In [7]:
# 1.2 Create the Knowledge Base text and save to a file

kb_text = """
The Washington Sentinels use a five-stage concussion return-to-play protocol designed to prioritize player safety over game availability. 
Stage 0 is complete physical and cognitive rest until the player is symptom-free at baseline. 
Stage 1 allows light, symptom-limited activity such as walking or stationary cycling for no more than 20 minutes. 
Stage 2 adds light aerobic conditioning and position-specific footwork drills with no contact or risk of collisions. 
Stage 3 permits non-contact football activities, including route running, coverage drills, and walkthroughs at full speed. 
Stage 4 is a full-contact practice, followed by a final medical clearance evaluation. 
Players must remain at each stage for at least 24 hours; if symptoms return at any point, they are moved back one full stage and must be symptom-free for another 24 hours before progressing.

The Sentinels have several club-specific rules that are stricter than the standard league guidelines. 
A player can never be fully cleared on the same calendar day the concussion is diagnosed, even if symptoms quickly resolve. 
Any player who sustains a second diagnosed concussion in the same season must complete a minimum of seven symptom-free days at Stage 0 before re-entering the on-field progression. 
Only the Head Team Physician, or a designated backup physician listed on the weekly medical coverage sheet, may sign the final return-to-play clearance. 
Position coaches, the head coach, and the general manager are not allowed to overrule or pressure medical staff about a player’s clearance status.

Communication and documentation are also tightly controlled. 
All suspected concussions must be logged into the SentinelMed system within two hours of either the in-game event or post-game evaluation. 
Daily practice reports categorize each player as Out, Limited, or Full based on their current stage of the protocol, and these labels must match the medical record in SentinelMed. 
For nationally televised prime-time games, an independent neurological consultant must review the player’s file and co-sign the clearance before the athlete is allowed to participate. 
Failure to follow documentation rules can trigger an internal compliance review and potential fines for staff members who ignored or altered medical information.
""".strip()

# Save to a text file in the repo (for the assignment requirement)
kb_filename = "kb_sentinels_concussion_protocol.txt"
with open(kb_filename, "w", encoding="utf-8") as f:
    f.write(kb_text)

print("KB saved to:", kb_filename)
print("\nRaw KB text:\n")
print(kb_text)


KB saved to: kb_sentinels_concussion_protocol.txt

Raw KB text:

The Washington Sentinels use a five-stage concussion return-to-play protocol designed to prioritize player safety over game availability. 
Stage 0 is complete physical and cognitive rest until the player is symptom-free at baseline. 
Stage 1 allows light, symptom-limited activity such as walking or stationary cycling for no more than 20 minutes. 
Stage 2 adds light aerobic conditioning and position-specific footwork drills with no contact or risk of collisions. 
Stage 3 permits non-contact football activities, including route running, coverage drills, and walkthroughs at full speed. 
Stage 4 is a full-contact practice, followed by a final medical clearance evaluation. 
Players must remain at each stage for at least 24 hours; if symptoms return at any point, they are moved back one full stage and must be symptom-free for another 24 hours before progressing.

The Sentinels have several club-specific rules that are stricter 

In [8]:
# 1.3 Simple chunking: split KB into paragraph-level chunks

# Split on double newlines and strip whitespace
kb_chunks = [para.strip() for para in kb_text.split("\n\n") if para.strip()]

kb_docs = [
    {"id": i, "text": chunk}
    for i, chunk in enumerate(kb_chunks)
]

print(f"Number of chunks: {len(kb_docs)}\n")
for doc in kb_docs:
    print(f"--- Chunk {doc['id']} ---")
    print(textwrap.fill(doc["text"], width=100))
    print()


Number of chunks: 3

--- Chunk 0 ---
The Washington Sentinels use a five-stage concussion return-to-play protocol designed to prioritize
player safety over game availability.  Stage 0 is complete physical and cognitive rest until the
player is symptom-free at baseline.  Stage 1 allows light, symptom-limited activity such as walking
or stationary cycling for no more than 20 minutes.  Stage 2 adds light aerobic conditioning and
position-specific footwork drills with no contact or risk of collisions.  Stage 3 permits non-
contact football activities, including route running, coverage drills, and walkthroughs at full
speed.  Stage 4 is a full-contact practice, followed by a final medical clearance evaluation.
Players must remain at each stage for at least 24 hours; if symptoms return at any point, they are
moved back one full stage and must be symptom-free for another 24 hours before progressing.

--- Chunk 1 ---
The Sentinels have several club-specific rules that are stricter than the sta

## 2. Embedding and Indexing
### 2.1 Embedding the KB Chunks

I use a SentenceTransformer model to convert each KB chunk into a dense vector (embedding). 
These vectors are stored in memory as a simple "vector store", which I query later using cosine similarity.


In [9]:
# 2.2 Compute embeddings for each KB chunk

kb_texts = [doc["text"] for doc in kb_docs]

# Use the SentenceTransformer model to get embeddings
kb_embeddings = embedder.encode(
    kb_texts,
    convert_to_numpy=True,
    normalize_embeddings=True  # normalizes so dot product == cosine similarity
)

print("kb_embeddings shape:", kb_embeddings.shape)


kb_embeddings shape: (3, 384)


## 3. Retrieval (Similarity Search)
### 3.1 Cosine Similarity Retrieval

Given a user query, I:
1. Embed the query using the same SentenceTransformer model.
2. Compute cosine similarity between the query embedding and all KB chunk embeddings.
3. Sort the chunks by similarity score and return the top-k most relevant chunks.


In [10]:
# 3.2 Helper: retrieve top-k most similar chunks for a query

def retrieve_relevant_chunks(query, top_k=2):
    """
    Given a natural language query, return the top_k KB chunks with highest cosine similarity.
    """
    # Embed the query
    query_emb = embedder.encode(
        [query],
        convert_to_numpy=True,
        normalize_embeddings=True
    )[0]  # shape: (embedding_dim,)

    # Cosine similarity = dot product because both sides normalized
    scores = kb_embeddings @ query_emb  # shape: (num_chunks,)

    # Get indices of top_k scores, sorted descending
    top_indices = np.argsort(scores)[::-1][:top_k]

    results = []
    for idx in top_indices:
        results.append({
            "id": kb_docs[idx]["id"],
            "text": kb_docs[idx]["text"],
            "score": float(scores[idx]),
        })
    return results


# 3.3 Quick demo: test retrieval with a sample query
demo_query = "What happens if a player gets a second concussion in the same season?"
retrieved = retrieve_relevant_chunks(demo_query, top_k=2)

print("Query:", demo_query)
print()
for r in retrieved:
    print(f"Chunk {r['id']} | score={r['score']:.3f}")
    print(textwrap.fill(r["text"], width=100))
    print("-" * 100)


Query: What happens if a player gets a second concussion in the same season?

Chunk 1 | score=0.694
The Sentinels have several club-specific rules that are stricter than the standard league
guidelines.  A player can never be fully cleared on the same calendar day the concussion is
diagnosed, even if symptoms quickly resolve.  Any player who sustains a second diagnosed concussion
in the same season must complete a minimum of seven symptom-free days at Stage 0 before re-entering
the on-field progression.  Only the Head Team Physician, or a designated backup physician listed on
the weekly medical coverage sheet, may sign the final return-to-play clearance.  Position coaches,
the head coach, and the general manager are not allowed to overrule or pressure medical staff about
a player’s clearance status.
----------------------------------------------------------------------------------------------------
Chunk 2 | score=0.529
Communication and documentation are also tightly controlled.  All s

## 4. Generation (Augmented LLM Answering)
### 4.1 RAG-style Generation

To answer questions, I use a small T5-style model (`google/flan-t5-small`).  
The pipeline is:

1. Retrieve the top-k relevant KB chunks for the user query.
2. Build a prompt that includes both the retrieved context and the original question.
3. Ask the LLM to answer **using only the provided context**, and to say "I don't know" if the answer is not in the KB.


In [11]:
# 4.2 Helper: run the LLM on a prompt

def run_llm(prompt, max_new_tokens=160):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    output_ids = llm_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False  # deterministic / greedy for this assignment
    )
    answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return answer.strip()


In [12]:
# 4.3 Build a RAG-style prompt that injects retrieved context

def build_rag_prompt(query, retrieved_chunks):
    """
    Build a prompt that:
    - Includes the KB context (1-2 chunks).
    - Asks the model to answer based only on that context.
    """
    context_texts = [c["text"] for c in retrieved_chunks]
    joined_context = "\n\n".join(context_texts)

    prompt = f"""
You are an assistant for answering questions about the Washington Sentinels' internal concussion protocol.
Use ONLY the information in the CONTEXT to answer the QUESTION.
If the answer is not in the context, say "I don't know based on the provided policy."

CONTEXT:
{joined_context}

QUESTION:
{query}

ANSWER:
""".strip()
    return prompt


def answer_with_rag(query, top_k=2, verbose=True):
    """
    Full RAG pipeline:
    1. Retrieve top_k KB chunks.
    2. Build context-augmented prompt.
    3. Generate answer from the LLM.
    """
    retrieved_chunks = retrieve_relevant_chunks(query, top_k=top_k)
    prompt = build_rag_prompt(query, retrieved_chunks)
    answer = run_llm(prompt)

    if verbose:
        print("=== QUERY ===")
        print(query)
        print("\n=== RETRIEVED CHUNKS ===")
        for r in retrieved_chunks:
            print(f"Chunk {r['id']} | score={r['score']:.3f}")
            print(textwrap.fill(r["text"], width=100))
            print()
        print("=== MODEL ANSWER (RAG) ===")
        print(answer)

    return answer, retrieved_chunks


In [13]:
# 4.4 Baseline: plain LLM answer with NO retrieved context
def answer_without_rag(query):
    prompt = f"Answer the following question as best you can:\n\nQuestion: {query}\n\nAnswer:"
    return run_llm(prompt)


In [14]:
# 4.5 Quick test of the RAG pipeline

test_query = "Who is allowed to sign the final concussion return-to-play clearance?"
rag_answer, rag_chunks = answer_with_rag(test_query, top_k=2)

print("\n--- Plain LLM (no RAG) for comparison ---")
plain_answer = answer_without_rag(test_query)
print(plain_answer)


=== QUERY ===
Who is allowed to sign the final concussion return-to-play clearance?

=== RETRIEVED CHUNKS ===
Chunk 1 | score=0.689
The Sentinels have several club-specific rules that are stricter than the standard league
guidelines.  A player can never be fully cleared on the same calendar day the concussion is
diagnosed, even if symptoms quickly resolve.  Any player who sustains a second diagnosed concussion
in the same season must complete a minimum of seven symptom-free days at Stage 0 before re-entering
the on-field progression.  Only the Head Team Physician, or a designated backup physician listed on
the weekly medical coverage sheet, may sign the final return-to-play clearance.  Position coaches,
the head coach, and the general manager are not allowed to overrule or pressure medical staff about
a player’s clearance status.

Chunk 2 | score=0.614
Communication and documentation are also tightly controlled.  All suspected concussions must be
logged into the SentinelMed system with

## 5. Test Cases (Factual, Foil, Synthesis)
### 5.1 Test Case Types

I run three types of test questions:

1. **Factual** – Directly answerable from a single KB chunk.
2. **Foil / General** – Not answerable from the KB; the model should rely on general knowledge or say it cannot answer from the policy.
3. **Synthesis** – Requires combining information from multiple KB chunks or reasoning across details.


In [15]:
# 5.2 Define three test cases

test_cases = [
    {
        "name": "Factual",
        "query": "How long must a player remain at each stage of the Sentinels concussion protocol before progressing?",
    },
    {
        "name": "Foil",
        "query": "What year did the Washington Sentinels win their first Super Bowl?",
    },
    {
        "name": "Synthesis",
        "query": (
            "If a player sustains a second concussion in the same season, "
            "describe the minimum time they must spend at Stage 0 and any extra approvals "
            "needed before they can play in a nationally televised game."
        ),
    },
]

results = []

for case in test_cases:
    print("=" * 80)
    print(f"TEST CASE: {case['name']}")
    print("=" * 80)

    query = case["query"]

    # RAG answer
    rag_answer, retrieved_chunks = answer_with_rag(query, top_k=2, verbose=True)

    # Plain LLM answer (no KB context) for comparison
    print("\n--- Plain LLM (no RAG) ---")
    plain_answer = answer_without_rag(query)
    print(plain_answer)
    print("\n\n")

    # Store summary info for README later
    results.append({
        "name": case["name"],
        "query": query,
        "rag_answer": rag_answer,
        "plain_answer": plain_answer,
        "retrieved_ids": [c["id"] for c in retrieved_chunks],
    })

print("Finished running all test cases.")


TEST CASE: Factual
=== QUERY ===
How long must a player remain at each stage of the Sentinels concussion protocol before progressing?

=== RETRIEVED CHUNKS ===
Chunk 1 | score=0.803
The Sentinels have several club-specific rules that are stricter than the standard league
guidelines.  A player can never be fully cleared on the same calendar day the concussion is
diagnosed, even if symptoms quickly resolve.  Any player who sustains a second diagnosed concussion
in the same season must complete a minimum of seven symptom-free days at Stage 0 before re-entering
the on-field progression.  Only the Head Team Physician, or a designated backup physician listed on
the weekly medical coverage sheet, may sign the final return-to-play clearance.  Position coaches,
the head coach, and the general manager are not allowed to overrule or pressure medical staff about
a player’s clearance status.

Chunk 0 | score=0.754
The Washington Sentinels use a five-stage concussion return-to-play protocol designed

## 6. Results Summary and Notes

### 6.1 High-Level Observations

- **Factual case**:  
  - RAG:  
    - Retrieved chunk IDs: (fill in from `results` for the Factual case)  
    - Answer was (accurate / partially accurate / inaccurate): (your comment).  
  - Plain LLM:  
    - Behavior compared to RAG: (hallucinated details? got it right anyway?).

- **Foil / General case**:  
  - RAG:  
    - Did the model correctly say it did not know based on the policy, or did it hallucinate?  
  - Plain LLM:  
    - Any obvious hallucinations (e.g., making up a Super Bowl year)?

- **Synthesis case**:  
  - RAG:  
    - Did the answer combine multiple rules (e.g., second concussion Stage 0 days + prime-time independent neurologist approval)?  
  - Plain LLM:  
    - How did it compare in terms of detail and alignment with the KB?
