# Lab 2: Retrieval-Augmented Generation (RAG)

Objective: Build a minimal RAG pipeline using `transformers` with an encoder-only embedder (BERT) and a decoder-only generator (gemma 1B). We focus on understanding internal mechanisms, not efficiency.

This lab includes:
- A brief RAG overview
- Indexing Padua documents with BERT embeddings (token-based chunking)
- Retrieving top-n similar chunks by cosine similarity
- Generating short answers with gemma using retrieved context
- Exercises matching course goals

<style>
img[src="images/image.png"] {
    width: 30%,
    height: 20%;
}
</style>

![alt text](image.png)

_img from https://drjulija.github.io/posts/basic-rag/_

## Setup and Imports
We reuse patterns from Lab 1: local `cache_dir`, simple pooling for embeddings, and an optional gemma generation flag.

In [9]:
import os, json
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM

print('Transformers version:', __import__('transformers').__version__)
print('Torch version:', torch.__version__)

BASE_DIR = os.path.join('../../lab2')
DATA_DIR = os.path.join(BASE_DIR, 'data')
CACHE_DIR = os.path.join(BASE_DIR, 'models_cache')
os.makedirs(CACHE_DIR, exist_ok=True)

RUN_gemma = False  # toggle to True to actually generate answers

# Model IDs (adjust if needed)
BERT_ID = 'bert-base-uncased'
GEMMA_ID = 'unsloth/gemma-3-1B-it'

Transformers version: 4.52.3
Torch version: 2.7.0+cu126


## Load Embedder (BERT) and Tokenizer
We use BERT as an encoder-only embedder. Embeddings are computed via mean pooling of the last hidden states across tokens.

In [10]:
tokenizer_bert = AutoTokenizer.from_pretrained(BERT_ID, cache_dir=CACHE_DIR)
model_bert = AutoModel.from_pretrained(BERT_ID, cache_dir=CACHE_DIR)
print('BERT special tokens:', tokenizer_bert.special_tokens_map)
print('BERT dtype:', next(model_bert.parameters()).dtype)
print('BERT device:', next(model_bert.parameters()).device)

BERT special tokens: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
BERT dtype: torch.float32
BERT device: cpu


## Load Generator (GEMMA)
We attempt to load a 1B chat model. If the primary model is unavailable, we fall back to Tinygemma 1.1B. Generation is optional and requires resources (GPU recommended).

In [11]:
def load_model(model_id):
    tok = AutoTokenizer.from_pretrained(model_id, cache_dir=CACHE_DIR)
    mdl = AutoModelForCausalLM.from_pretrained(
        model_id, cache_dir=CACHE_DIR, torch_dtype=torch.float16, device_map='auto'
    )
    return tok, mdl

tokenizer_gemma, model_gemma = load_model(GEMMA_ID)

print('gemma model:', GEMMA_ID)
print('Has chat template?', bool(getattr(tokenizer_gemma, 'chat_template', None)))

gemma model: unsloth/gemma-3-1B-it
Has chat template? True


## Load Documents, Embed them and save to Knowledge Base
We index 25 Padua documents.

In [12]:
docs_path = os.path.join(DATA_DIR, 'kb_docs.json')
with open(docs_path, 'r', encoding='utf-8') as f:
    DOCS = json.load(f)
print('there are', len(DOCS), 'docs')
DOCS[0]['title']

there are 25 docs


'Prato della Valle'

In [13]:
# Build KB chunks
KB_CHUNKS = []
for d in DOCS:
    text_to_encode = d['text']
    encodeing_input = tokenizer_bert(text_to_encode, return_tensors='pt', truncation=False, add_special_tokens=False)
    hidden_states = model_bert(**encodeing_input).last_hidden_state
    embedding = hidden_states.mean(dim=1).squeeze().detach().cpu().numpy()
    KB_CHUNKS.append({
        'doc_id': d['id'],
        'title': d['title'],
        'text': text_to_encode,
        'token_count': len(text_to_encode.split()),
        'embedding': embedding.tolist()
    })

index_path = os.path.join(DATA_DIR, 'kb_index.json')
with open(index_path, 'w', encoding='utf-8') as f:
    json.dump(KB_CHUNKS, f, ensure_ascii=False, indent=2)
len(KB_CHUNKS), KB_CHUNKS[0]['title']

(25, 'Prato della Valle')

## Retrieval: Top-n Similar Chunks
We compute cosine similarity between the query embedding and KB chunk embeddings and return the top-n chunks.

In [14]:
def embed_query(query):
    enc = tokenizer_bert([query], return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        out = model_bert(**enc)
        emb = out.last_hidden_state.mean(dim=1).cpu().numpy()
    return emb.tolist()[0]

def cosine_similarity_matrix(query, kb_emb):
    query = np.array(query).reshape(1, -1)
    kb_emb = np.array(kb_emb)
    query = query / np.clip(np.linalg.norm(query, axis=1, keepdims=True), 1e-12, None)
    kb_emb = kb_emb / np.clip(np.linalg.norm(kb_emb, axis=1, keepdims=True), 1e-12, None)
    return query @ kb_emb.T

def retrieve_top_n(query, kb_emb, n=3):
    q = embed_query(query)
    document_embeddings = [chunk['embedding'] for chunk in kb_emb]
    sims = cosine_similarity_matrix(q, document_embeddings)[0]
    top_idx = np.argsort(-sims)[:n]
    return [{
        'text': kb_emb[i]['text'], 
        'similarity': float(sims[i])} 
        for i in top_idx]

In [15]:
# Load queries
queries_path = os.path.join(DATA_DIR, 'queries.json')
with open(queries_path, 'r', encoding='utf-8') as f:
    QUERIES = json.load(f)


# Demo retrieval on first query
q0 = QUERIES[0]['query']
print('Query:', q0)
top_chunks = retrieve_top_n(q0, KB_CHUNKS, n=3)
for retrieved in top_chunks:
    print(f'Similarity={retrieved["similarity"]:.3f}, text={retrieved["text"][:180]}')

Query: Where can you see Giotto’s frescoes in Padua?
Similarity=0.621, text=Founded in 1222, the University of Padua is renowned for research excellence, diverse faculties, and a vibrant international community.
Similarity=0.618, text=The Scrovegni Chapel contains Giotto’s famous frescoes. Entry is by ticket and timed slots, located near the Church of the Eremitani.
Similarity=0.598, text=Padua Cathedral (Duomo) stands alongside a Romanesque Baptistery with frescoes by Giusto de’ Menabuoi. Regular liturgical celebrations take place.


## Generation: Answer from Retrieved Context
We instruct GEMMA to answer concisely using only the provided context. If the answer is not in the context, the model should say it does not know.

In [None]:
system_prompt = """You are a helpful assistant that answers questions about the city of Padova, Italy. 
Use the retrieved documents to answer the question as best as you can. 
If you don't know the answer, say you don't know."""


RESULTS = []
for query in QUERIES:
    top_chunks = retrieve_top_n(query['query'], KB_CHUNKS, n=3)
    for retrieved in top_chunks:
        print(f'Similarity={retrieved["similarity"]:.3f}, text={retrieved["text"][:180]}')
        
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query['query']},
    ]

    inputs = tokenizer_gemma.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model_gemma.device)

    outputs = model_gemma.generate(**inputs, max_new_tokens=40)
    answer = tokenizer_gemma.decode(outputs[0][inputs["input_ids"].shape[-1]:])
    RESULTS.append({
        'query': query['query'],
        'retrieved': top_chunks,
        'answer': answer
    })

## Excercise

1) You have generated an answer for each of the queries in `lab2/data/queries.json`. Each query has an `expected_answer`. Find a way to use Gemma to determine whether the generated answer is correct or not by prompting the LLM itself.

2) Create a RAG system that answer the queries in `lab2/data/excercise/queries.json` using the documents in `lab2/data/excercise/docs.txt`.
