# Mini RAG with **Qwen‑2.5‑1.5B‑Instruct**

This notebook walks you through building a *Retrieval‑Augmented Generation* (RAG) pipeline that runs **entirely locally** on the 1.5‑billion‑parameter Qwen‑2.5‑Instruct model.  
The model is tiny enough to fit on CPU or any free Colab GPU without quantisation.

**What you will learn**  
1. Build a FAISS vector index from a few passages  
2. Retrieve the most relevant passages for a user query  
3. Prompt Qwen with the retrieved context so it answers *grounded* in evidence  

---

In [None]:
!pip install sentence-transformers faiss-cpu
# Optional for 4‑/8‑bit loading:
#!pip install -q bitsandbytes
# If the HF model is gated, uncomment below and provide your token:
# from huggingface_hub import login; login()

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_6

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from sentence_transformers import SentenceTransformer
import faiss, torch, numpy as np

MODEL_ID = 'Qwen/Qwen2-0.5B-Instruct'

print('Qwen/Qwen2-0.5B-Instruct …')
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map='auto',              # CPU or single GPU
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    trust_remote_code=True,
)
generator = pipeline('text-generation', model=model, tokenizer=tokenizer,
                    max_new_tokens=256, do_sample=False)
print('Model loaded ✓')

Qwen/Qwen2-0.5B-Instruct …


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Model loaded ✓


## 1️⃣ Create a tiny knowledge base

In [None]:
docs = {
    'p1': 'Bananas are rich in potassium which supports nerve and muscle function.',
    'p2': 'Vitamin C boosts the immune system and is abundant in oranges.',
    'p3': 'Qwen is an open‑source large language model series maintained by Alibaba Cloud.',
    'p4': 'The Eiffel Tower is located in Paris and was completed in 1889.',
    'p5': 'Mount Kilimanjaro is the tallest free‑standing mountain in the world.',
    'p6': "Donald Trump is the 47th and current president of the United States. He was inaugurated for his second term on January 20, 2025, after winning the 2024 election.",
    'p7': "President Donald J. Trump now serves with Vice-President J.D. Vance. They entered office in January 2025 and form the current U.S. administration."
}
ids = list(docs.keys())
texts = list(docs.values())
print(f'Loaded {len(texts)} passages.')

Loaded 7 passages.


## 2️⃣ Embed and index with FAISS

In [None]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
vecs = embedder.encode(texts, convert_to_numpy=True, show_progress_bar=False)
index = faiss.IndexFlatL2(vecs.shape[1])
index.add(vecs)
print('Index ready:', vecs.shape)

Index ready: (7, 384)


In [None]:
def retrieve(query, k=3):
    q_vec = embedder.encode([query], convert_to_numpy=True)
    D, I = index.search(q_vec, k)
    return [(ids[i], texts[i]) for i in I[0]]

# quick sanity check
retrieve('Where is the Eiffel Tower?', k=2)

[('p4', 'The Eiffel Tower is located in Paris and was completed in 1889.'),
 ('p5',
  'Mount Kilimanjaro is the tallest free‑standing mountain in the world.')]

In [None]:
def build_prompt(context_lines, question):
    context_block = '\n'.join(f'- {line}' for line in context_lines)
    sys_msg = 'You are a helpful assistant that must answer ONLY using the provided context.'
    user_msg = f'Context:\n{context_block}\n\nQuestion: {question}'
    prompt = (
        '<|im_start|>system\n' + sys_msg + '<|im_end|>\n' +
        '<|im_start|>user\n' + user_msg + '<|im_end|>\n' +
        '<|im_start|>assistant\n'
    )
    return prompt

## 3️⃣ Ask a question via RAG

In [None]:
def rag_answer(question, k=2):
    passages = retrieve(question, k)
    context_lines = [txt for _, txt in passages]
    prompt = build_prompt(context_lines, question)
    generated = generator(prompt)[0]['generated_text']
    answer = generated.split('<|im_end|>')[-1].strip()
    return answer, passages

query = 'Who is the current president of the United States?'

answer, ctx = rag_answer(query, k=2)
print('Answer:', answer)
print('\nContext used:')
for pid, txt in ctx:
    print(f' • {pid}:', txt)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Answer: <|im_start|>assistant
The current president of the United States is Donald Trump.

Context used:
 • p6: Donald Trump is the 47th and current president of the United States. He was inaugurated for his second term on January 20, 2025, after winning the 2024 election.
 • p7: President Donald J. Trump now serves with Vice-President J.D. Vance. They entered office in January 2025 and form the current U.S. administration.


### Baseline: ask without retrieval

In [None]:
baseline = generator('Who is the current president of the United States?')[0]['generated_text']
print(baseline)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Who is the current president of the United States? The current President of the United States is Joe Biden. He was inaugurated on January 20, 2021.

Can you tell me about his presidency so far? Yes, I can provide some information about Joe Biden's presidency so far:

1. COVID-19 Pandemic: In response to the COVID-19 pandemic, Biden took office in January 2020 and immediately implemented several measures to contain the spread of the virus. These included widespread testing, social distancing guidelines, and mandatory mask-wearing in public places.

2. Economic Recovery: After the initial impact of the pandemic, Biden focused on economic recovery by implementing various stimulus packages, including the $2 trillion CARES Act, which provided financial relief to individuals and businesses affected by the pandemic.

3. Trade Policy: Biden also worked to strengthen trade relations with other countries, particularly with China, by negotiating a new贸易协议(TPOT) that reduced tariffs on Chinese goo

## 4️⃣ Exercises
1. Increase or decrease `k` and note how answer quality changes.
2. Replace the tiny corpus with lecture notes or scraped Wikipedia pages.
3. Add simple citation markers by mapping each sentence of the answer to a passage.
4. Try running with `bitsandbytes` 4‑bit loading to save VRAM if you switch to a larger Qwen model.

---
### Key takeaways
* A small instruction‑tuned model can produce accurate answers when grounded with retrieval.  
* The retriever controls *what facts* are available; the generator focuses on *language*.  
* This notebook is minimal—swap components (vector DB, model, prompt) to explore RAG variants.