
## 🧩 QA (Extractive, Generative) and RAG

In this notebook, we’ll explore **Question Answering (QA)** and **Retrieval-Augmented Generation (RAG)** — two key paradigms for connecting large language models with factual information.

We’ll begin with **Extractive QA**, where a model identifies an answer span directly from a passage using *sequence labeling* with a **BERT-style encoder**.  
Then, we’ll move to **Generative QA**, where models like **BART** or **GPT** produce free-form answers in natural language, demonstrating greater flexibility but also higher risk of **hallucination**.

Next, we’ll discuss the **limitations** of both approaches — extractive QA can be rigid and context-limited, while generative QA may generate fluent but incorrect answers.  
To address these issues, we’ll introduce **Retrieval-Augmented Generation (RAG)**, which enriches the model’s context with relevant external knowledge to improve **factuality** and **reduce hallucinations**.


The goal of this notebook is **not just to run QA models**, but to **understand their design trade-offs** and how retrieval-based methods can make generation more trustworthy and grounded in evidence.  

By the end of this notebook, you’ll have a clear understanding of:
- How extractive and generative QA differ in architecture and behavior,  
- Why hallucination occurs in generative systems, and  
- How RAG mitigates these issues by integrating retrieval with generation

## Dataset
Stanford Question Answering Dataset (**SQuAD**) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.



In [None]:
from datasets import load_dataset

dataset = load_dataset("squad")

squad_sample = dataset["train"].select(range(10))

#print first 3 examples
for i in range(3):
    print(f"\nExample {i+1}")
    print(f"Context: {squad_sample[i]['context'][:100]}...")
    print(f"Question: {squad_sample[i]['question']}")
    print(f"Answer: {squad_sample[i]['answers']}")

### Extractive QA Model

The class **`AutoModelForQuestionAnswering`** is part of Hugging Face’s `transformers` library and is specifically designed for **extractive question answering** tasks — where the model identifies an answer **span** within a given context.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

#you can also explore more samples if you have enough resources
#load dataset and pick a case
dataset = load_dataset("squad")
squad_sample = dataset["validation"].select(range(1000))
example = squad_sample[100]
context = example["context"]
question = example["question"]
true_answer = example["answers"]["text"][0]


print(f"Question: {question}")
print(f"Ground Truth Answer: {true_answer}\n")

`deepset/bert-base-cased-squad2` is a BERT base cased example model trained on SQuAD v2. You can also try different models. The model is based on bert model and designed for Extractive QA

For more model details, please refer to this link :https://huggingface.co/deepset/bert-base-cased-squad2

In [None]:
# Load tokenizer and BERT-style QA model
model_name = "deepset/bert-base-cased-squad2"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
inputs = tokenizer(question, context, return_tensors="pt")


----
**`TODO:`**

1. Feed the inputs into the model to obtain logits that represent how likely each token is the start or end of the answer span.
`torch.no_grad()` is used to disable gradient computation during inference.

2. Select the tokens with the highest start and end probabilities using `argmax`.
3. Extract the predicted answer span from the input IDs and convert it back into natural language using the tokenizer. You can use `tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(...))`
4. Display the model’s predicted answer, the true answer

----
**`TODO:`**

Look at the **Example**

`Example`:

- **Context** Marie Curie was awarded the Nobel Prize in Physics in 1903.

- **Question** Did Marie Curie win a Nobel Prize?

- **Answer** Yes



**Discuss**:  can Extractive QA models still answer correctly? Why or why not?

----

[Your Answer]

### Generative QA Model

The class **`AutoModelForCausalLM`** (Causal Language Modeling) is part of Hugging Face’s `transformers` library and is designed for **autoregressive text generation** — where each new token is generated **based on all previously generated tokens**.  
It is typically used with **decoder-only architectures** such as **GPT-2**, **GPT-Neo**, or **LLaMA**.


Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

example = squad_sample[100]
context = example["context"]
question = example["question"]
true_answer = example["answers"]["text"][0]

You can make a prompt like this

In [None]:
prompt = (
    "Answer the question based only on the given context.\n\n"
    f"Context: {context}\n"
    f"Question: {question}\n"
    "Answer:"
)

inputs = tokenizer(prompt, return_tensors="pt")



----
**`TODO:`**

1. **Use the model’s `generate()` method**  
   - Call `model.generate()` to produce new tokens based on your input prompt.  
   - Experiment with parameters such as:  
     - `max_new_tokens` → controls the maximum length of the generated output.  
     - `do_sample`, `top_p`, and `temperature` → adjust randomness and creativity in generation.  
     - `eos_token_id` → defines where the model should stop generating (end-of-sequence token).  


2. **Convert the model’s token IDs back into readable text**  
   - Use `tokenizer.decode(outputs[0], skip_special_tokens=True)` to transform the generated token IDs into plain text.  
   - This step turns the model’s internal numerical predictions into human-readable language.

3. **Extract only the model’s answer**  
   - Remove the original prompt from the decoded text so that you keep just the generated response.  
   - This helps you focus on what the model actually “answered,” rather than the repeated input.

4. **Print and compare results**  


For more details on how to use `generate()` and `decode()`, refer back to **Exercise 08 – Generation**.



----

**`TODO`**: **Discussion** Run the code multipletimes, does the model’s generated answer contain any hallucination — information that was not stated or implied in the given context?

[Your Answer]

----

## RAG

RAG improves large language models (LLMs) by incorporating information retrieval before generating responses.

RAG helps reduce AI hallucinations by blending the LLM process with a web search or other document look-up process to help LLMs stick to the facts.

RAG also reduces the need to retrain LLMs with new data, saving on computational and financial costs.

Please refer to Patrick's paper for more details: https://arxiv.org/abs/2005.11401

**Retrieval-Augmented Generation (RAG)** combines two key components — an **information retriever** and a **text generator** — to produce more factual and grounded answers.  

Here’s the standard workflow:

1. **Indexing / Document Preparation**  
   - Build or load a *knowledge corpus* (e.g., Wikipedia articles, research papers, company documents).  
   - Preprocess and store it in a searchable format (using BM25, dense embeddings, or a vector database).

2. **Retrieval**  
   - When a user asks a question, the retriever finds the top-k most relevant documents from the corpus.  
   - Methods can be lexical (**BM25**) or semantic (**embedding-based models** like Sentence-BERT).

3. **Context Construction**  
   - Combine the retrieved passages into a single context block.  
   - Optionally truncate or rank passages based on relevance or confidence scores.

4. **Generation**  
   - Pass the question + retrieved context to a **generative model**.  
   - The model generates an answer *conditioned on both the query and the evidence*.

We will continue to use Groq for this task. If you’re not familiar with it, please refer to **Exercise 07 – Post-training**.

Let us try without RAG first!

In [None]:
from groq import Groq
import os

GROQ_API_KEY = os.getenv("GROQ_API_KEY")

client = Groq(
    api_key=GROQ_API_KEY,
)

model_name = "llama-3.3-70b-versatile"

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "The most recent score between Barcelona and Real Madrid.",
        }
    ],
    model=model_name
)

print(chat_completion.choices[0].message.content)

The answer is incorrect because the large language model **does not have access to the most up-to-date knowledge**.

### 1. build the index

Building an index usually involves preprocessing a corpus (cleaning, tokenizing, or embedding the documents) and then storing them in a searchable structure such as a BM25 index or a vector database for fast retrieval during queries. This is a very small demo dataset — the data comes from daily news articles collected from Google.


In [None]:
docs = [
    "Real Madrid 2-1 Barcelona (Oct 26, 2025) Game Analysis",
    "Bill Gates calls for climate fight to shift focus",
    "Fawlty Towers episode to air on BBC One in tribute to the late Prunella Scales",
    "Climate Change Falls Over 20% Behind Top Global Concern in 2025",
    "53.5 of EU services exports by large enterprises"
]

### 2. Retrieval

Here, we try to retrieve by `BM25`, BM25 (BM is an abbreviation of best matching) is a ranking function used by search engines to estimate the relevance of documents to a given search query. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. 

For more information, refer to https://en.wikipedia.org/wiki/Okapi_BM25




You should install BM25 first.

In [None]:
pip install rank_bm25

Here is a Retrieval Demo

In [None]:
from rank_bm25 import BM25Okapi
import re


def tokenize(text):
    return re.findall(r"[a-zA-Z]+", text.lower())


tokenized_corpus = [tokenize(d) for d in docs]

# build the index
bm25 = BM25Okapi(tokenized_corpus)

# retrieval function: return top-k documents and their BM25 scores
def retrieve_bm25(query, k=1):
    q_tokens = tokenize(query)
    scores = bm25.get_scores(q_tokens)             
    topk_idx = sorted(range(len(scores)), key=lambda i: -scores[i])[:k]
    return [(docs[i], float(scores[i])) for i in topk_idx]


query = "What is the score of the latest football match between Real Madrid and Barcelona?"
results = retrieve_bm25(query, k=1)
for doc, score in results:
    print(f"Document: {doc}\nBM25 Score: {score}\n")

### 3. Context Construction

----

**`TODO:`**: 

You should write a prompt that includes not only the question but also the context information. 

This helps the model ground its answer in the provided evidence instead of relying on memorized knowledge.

A well-structured prompt makes it clear what information the model should use and what task it should perform.

In RAG, including both the context and the question ensures the model produces factually accurate, context-aware answers.

If you’re not familiar with writing prompts, please refer to **Exercise 07 – Post-training**.

----

In [None]:
prompt = 



### 4. **Generation**  
   
Pass the question + retrieved context to a **generative model**.  

----
**`ToDo:`**: Construct a new request using our custom prompt

**Goal**: Combine the retrieved context from BM25 and the query into a single prompt,
then send it to the model for generation.

You can follow the code at the begining of RAG section: ` client.chat.completions.create(...)`

----