### Dataset: Demo Corpus

We will use a tiny mixed-domain corpus (AI, Climate, Biomedical, Materials) stored in `data/demo_corpus.jsonl`.


In [1]:
from pathlib import Path
import pandas as pd

DATA_PATH = 'data/demo_corpus.jsonl'
df = pd.read_json(DATA_PATH, lines=True)
docs = df.to_dict('records')
print(f'Loaded {len(docs)} docs from {DATA_PATH}')
display(df[['id','title','year','topics']].head())


Loaded 18 docs from data/demo_corpus.jsonl


Unnamed: 0,id,title,year,topics
0,2508.05366,Can Language Models Critique Themselves? Inves...,2025,"[NLP, Retrieval, Language Model, Biomedical]"
1,2508.07326,Nonparametric Reaction Coordinate Optimization...,2025,"[ML, Climate]"
2,2508.07654,MLego: Interactive and Scalable Topic Explorat...,2025,"[Databases, IR]"
3,2508.07798,Generative Inversion for Property-Targeted Mat...,2025,"[Materials, ML]"
4,2508.0814,Data-Efficient Biomedical In-Context Learning:...,2025,"[NLP, Retrieval, Language Model, Biomedical]"


# Module 3: Building a Simple RAG Pipeline

*Part of the RCD Workshops series: Retrieval-Augmented Generation (RAG) for Advanced Research Applications*

---

In this module, we'll connect retrieval and generation to build a working RAG pipeline end-to-end.
We'll use our small example corpus (from Module 2), a retrieval component, and a 7B LLM, to show how RAG works in practice.


![RAG pipeline](rag_pipeline_graphviz.png)


## 3.1 Setting up the LLM
For RAG, we need a language model that can read our prompt and generate an answer using retrieved context. We'll use Qwen-7B (open-source, Hugging Face) for this pipeline.

> **Note:** You need a GPU (ideally A100 or similar) to load a 7B model at usable speed.

We'll use the `transformers` library. Loading may take a while (model is ~14GB in 16-bit mode).


In [1]:
# Install dependencies (uncomment if needed)
# !pip install transformers accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'Qwen/Qwen-7B'   # Or Qwen/Qwen-7B-Chat for instruct
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype='auto'
)
# Model and tokenizer are now ready for inference.


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Downloading tokenizer_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported.

![RAG pipeline](rag_pipeline_graphviz.png)


In [None]:
import numpy as np
# Assume: encoder, docs, doc embeddings, index from Module 2
# Example question for climate economics
query = "According to recent studies, how much could global GDP decline at 3°C of warming, and which regions are hit hardest?"
query_vec = encoder.encode([query])
query_vec = query_vec / np.linalg.norm(query_vec)  # normalize for cosine

top_k = 2
D, I = index.search(query_vec, k=top_k)
retrieved_indices = I[0]
print("Retrieved doc indices:", retrieved_indices)
retrieved_texts = [docs[i] for i in retrieved_indices]
print("Top-1 Retrieved text snippet:", retrieved_texts[0][:60], "...")


### Building the Prompt
To maximize answer quality, prompt your LLM with clear instructions and insert the most relevant docs just before the user's question.
A simple format is to list docs like [Document 1], [Document 2], then give the question.


In [None]:
prompt_intro = "You are a research assistant. Use the following documents to answer the question.\n"
docs_section = ""
for idx, text in enumerate(retrieved_texts, start=1):
    docs_section += f"[Document {idx}]\n{text}\n\n"
question_section = f"Question: {query}\nAnswer:"

prompt = prompt_intro + docs_section + question_section
print("Prompt sent to LLM:\n")
print(prompt)


### LLM: Answering with Retrieved Information
Now, send the composed prompt to your language model.
> This step may be slow unless you're on a GPU-ready machine, but shows the full RAG loop!
If working on CPU or want to skip, use a smaller LLM (optionally ask facilitator for alternatives).


In [None]:
input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(model.device)
outputs = model.generate(input_ids, max_length=256,
                         temperature=0.2, do_sample=False)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nGenerated Answer:\n", answer[len(prompt):])


### Try it yourself!
Modify the `query` above (in the RAG pipeline code cell) to something your document can answer -- or to something *none* of the docs cover.
What happens? How does the retrieval affect the model's output?

> *Reflection: What are the main components of a simple RAG pipeline? (List at least two)*


In [None]:
from utils import create_answer_box
create_answer_box('📝 **Your Answer:** The RAG pipeline consists of ...', question_id='mod3_pipeline_components')


---

**Note on Prompt Lengths & Context:**
Models like Qwen-7B support long context windows (up to 8K tokens or more), but you often need to truncate or focus your retrieved docs.
Too much, and the model may ignore key info; too little, and you could miss relevant context.

That's why retrieval *quality* is just as important as the LLM itself!



Congratulations—You now have a basic, working RAG pipeline!
In the next module, we'll explore how to improve retrieval quality and tackle more advanced scenarios.
