# Module 3: Building a Simple RAG Pipeline

*Part of the RCD Workshops series: Retrieval-Augmented Generation (RAG) for Advanced Research Applications*

---

In this module, we'll connect retrieval and generation to build a working RAG pipeline end-to-end.
We'll use our small example corpus (from Module 2), a retrieval component, and an 8B LLM, to show how RAG works in practice.


![RAG pipeline](rag_pipeline_graphviz.png)


### Dataset: Demo Corpus

We will use a tiny mixed-domain corpus (AI, Climate, Biomedical, Materials) stored in `data/demo_corpus.jsonl`.


In [11]:
from pathlib import Path
import pandas as pd

DATA_PATH = 'data/demo_corpus.jsonl'
df = pd.read_json(DATA_PATH, lines=True)
docs = df.to_dict('records')
print(f'Loaded {len(docs)} docs from {DATA_PATH}')
display(df[['id','title','year','topics']].head())


Loaded 18 docs from data/demo_corpus.jsonl


Unnamed: 0,id,title,year,topics
0,2508.05366,Can Language Models Critique Themselves? Inves...,2025,"[NLP, Retrieval, Language Model, Biomedical]"
1,2508.07326,Nonparametric Reaction Coordinate Optimization...,2025,"[ML, Climate]"
2,2508.07654,MLego: Interactive and Scalable Topic Explorat...,2025,"[Databases, IR]"
3,2508.07798,Generative Inversion for Property-Targeted Mat...,2025,"[Materials, ML]"
4,2508.0814,Data-Efficient Biomedical In-Context Learning:...,2025,"[NLP, Retrieval, Language Model, Biomedical]"


## 3.1 Setting up the LLM
For RAG, we need a language model that can read our prompt and generate an answer using retrieved context. We'll use Qwen-7B (open-source, Hugging Face) for this pipeline.

> **Note:** You need a GPU (ideally A100 or similar) to load a 7B model at usable speed.

We'll use the `transformers` library. Loading may take a while (model is ~14GB in 16-bit mode).


In [12]:
# Install dependencies (uncomment if needed)
# !pip install transformers accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = 'Qwen/Qwen3-0.6B'  # Or Qwen/Qwen3-8B
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype='auto'
)
MODEL_READY = True
print(f'Loaded LLM: {model_name}')


Loaded LLM: Qwen/Qwen3-0.6B


![RAG pipeline](rag_pipeline_graphviz.png)


In [13]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Build chunked passage index from abstracts
def chunk_text(text, max_chars=400):
    text = (text or '').strip()
    if not text:
        return []
    return [text[i:i+max_chars].strip() for i in range(0, len(text), max_chars)]

encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

chunk_texts = []
chunk_meta = []
for d in docs:
    abs_text = d.get('abstract', '')
    pieces = chunk_text(abs_text, max_chars=400)
    for j, t in enumerate(pieces):
        if not t:
            continue
        chunk_texts.append(t)
        chunk_meta.append({'doc_id': d.get('id'), 'title': d.get('title'), 'chunk_id': j})

embs = encoder.encode(chunk_texts)
embs = np.array([v/np.linalg.norm(v) for v in embs], dtype='float32')
index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs)

In [14]:
# Example question for climate economics
query = "According to recent studies, how exactly does replanting trees to replenish forests help to fight against climate change?"
q = encoder.encode([query])[0]
q = (q/np.linalg.norm(q)).astype('float32')
D, I = index.search(np.array([q]), k=2)
retrieved_indices = I[0]
print('Retrieved chunk indices:', retrieved_indices)
retrieved_texts = [chunk_texts[i] for i in retrieved_indices]
retrieved_meta = [chunk_meta[i] for i in retrieved_indices]
print('Top-1 Retrieved text snippet:', retrieved_texts[0][:160].replace('\n',' '), '...')

Retrieved chunk indices: [39 40]
Top-1 Retrieved text snippet: Afforestation and reforestation are popular strategies for mitigating climate change by enhancing carbon sequestration. However, the effectiveness of these effo ...


### Building the Prompt
To maximize answer quality, prompt your LLM with clear instructions and insert the most relevant docs just before the user's question.
A simple format is to list docs like [Document 1], [Document 2], then give the question.


In [20]:
# Build messages for chat template aware models (e.g., Qwen3)
system_msg = (
    "You are a research assistant. Ground your answer in the provided documents. "
    "Cite document numbers inline when useful. If unsure, say you don't know."
)

docs_lines = []
for i, text in enumerate(retrieved_texts, start=1):
    docs_lines.append(f'[Document {i}]\n{text}\n')
context_block = "".join(docs_lines)
user_msg = f"Context:\n{context_block}\nQuestion: {query}"

messages = [
    { 'role': 'system', 'content': system_msg },
    { 'role': 'user',   'content': user_msg },
]

# Render with chat template
rendered_prompt = None

rendered_prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True)

print('Prompt (rendered):\n')
print(rendered_prompt)


Prompt (rendered):

<|im_start|>system
You are a research assistant. Ground your answer in the provided documents. Cite document numbers inline when useful. If unsure, say you don't know.<|im_end|>
<|im_start|>user
Context:
[Document 1]
Afforestation and reforestation are popular strategies for mitigating climate change by enhancing carbon sequestration. However, the effectiveness of these efforts is often self-reported by project developers, or certified through processes with limited external validation. This leads to concerns about data reliability and project integrity. In response to increasing scrutiny of voluntary carbon m
[Document 2]
arkets, this study presents a dataset on global afforestation and reforestation efforts compiled from primary (meta-)information and augmented with time-series satellite imagery and other secondary data. Our dataset covers 1,289,068 planting sites from 45,628 projects spanning 33 years. Since any remote sensing-based validation effort relies on th

### LLM: Answering with Retrieved Information
Now, send the composed prompt to your language model.
> This step may be slow unless you're on a GPU-ready machine, but shows the full RAG loop!
If working on CPU or want to skip, use a smaller LLM.


In [21]:
if 'MODEL_READY' in globals() and MODEL_READY:
    input_ids = tokenizer(rendered_prompt, return_tensors='pt').input_ids.to(model.device)
    outputs = model.generate(input_ids, max_new_tokens=512, temperature=0.2, do_sample=False)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Try to strip the prompt if possible (best-effort)
    try:
        start = answer.find(rendered_prompt)
        if start != -1:
            answer = answer[start + len(rendered_prompt):]
    except Exception:
        pass
    print('\nGenerated Answer:', answer.strip())
else:
    print('\n[Skipped LLM generation] Model not loaded. Review the previous cell output for setup instructions.')



Generated Answer: system
You are a research assistant. Ground your answer in the provided documents. Cite document numbers inline when useful. If unsure, say you don't know.
user
Context:
[Document 1]
Afforestation and reforestation are popular strategies for mitigating climate change by enhancing carbon sequestration. However, the effectiveness of these efforts is often self-reported by project developers, or certified through processes with limited external validation. This leads to concerns about data reliability and project integrity. In response to increasing scrutiny of voluntary carbon m
[Document 2]
arkets, this study presents a dataset on global afforestation and reforestation efforts compiled from primary (meta-)information and augmented with time-series satellite imagery and other secondary data. Our dataset covers 1,289,068 planting sites from 45,628 projects spanning 33 years. Since any remote sensing-based validation effort relies on the integrity of a planting site's ge

### Try it yourself!
Modify the `query` above (in the RAG pipeline code cell) to something your document can answer -- or to something *none* of the docs cover.
What happens? How does the retrieval affect the model's output?


In [None]:
from utils import create_answer_box
create_answer_box('In the above code in this notebook, what does the line `q = encoder.encode([query])[0]` do?', question_id='encoder_question')

create_answer_box('In the above code in this notebook, what does the line `D, I = index.search(np.array([q]), k=2)` do?', question_id='index_question')


---

**Note on Prompt Lengths & Context:**
Models like Qwen3-8B support long context windows (up to 32K tokens or more), but you often need to truncate or focus your retrieved docs.
Too much, and the model may ignore key info; too little, and you could miss relevant context.

That's why retrieval *quality* is just as important as the LLM itself!



Congratulations—You now have a basic, working RAG pipeline!
In the next module, we'll explore how to improve retrieval quality and tackle more advanced scenarios.


---

## Streamlined RAG (Library-Based)

The above walkthrough showed a ground-up RAG pipeline. Below is a concise version using a popular orchestration library to wire up embeddings, a vector store, a retriever, and an LLM chain.

This mirrors what many teams do in practice.


In [15]:
# Optional: install helpers if missing
# Recommended: install compatible LangChain packages
# %pip install -U "langchain>=0.2.16" "langchain-core>=0.2.38" "langchain-community>=0.2.16" "langchain-huggingface>=0.0.6" "langchain-text-splitters>=0.2.2" jsonpatch

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS as LCFAISS
from langchain_community.llms import HuggingFacePipeline
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from transformers import pipeline as hf_pipeline

print('Building vector store with LangChain (auto-chunk + FAISS) ...')
emb = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

# Wrap raw records as LangChain Documents, carrying metadata
raw_docs = []
for d in docs:
    text = (d.get('abstract') or '').strip()
    if not text:
        continue
    md = { 'doc_id': d.get('id'), 'title': d.get('title'), 'year': d.get('year') }
    raw_docs.append(Document(page_content=text, metadata=md))
print(f'- Loaded {len(raw_docs)} source documents')

# Split automatically with a standard text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=40, separators=['\n\n','\n',' ', ''])
docs_split = splitter.split_documents(raw_docs)
print(f'- Created {len(docs_split)} chunks (chunk_size=400, overlap=40)')

# Build FAISS vector store directly from Documents
vs = LCFAISS.from_documents(docs_split, embedding=emb)
retriever = vs.as_retriever(search_type='similarity', search_kwargs={'k': 2})

print('Wrapping Transformers model as an LLM pipeline...')
gen = hf_pipeline(
    'text-generation',
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.2,
    do_sample=False,
)
llm = HuggingFacePipeline(pipeline=gen)

prompt = ChatPromptTemplate.from_messages([
    ('system', 'You are a research assistant. Use the provided context to answer. Cite titles when helpful. If unsure, say you don\'t know.'),
    ('human', 'Context:\n{context}\n\nQuestion: {input}')
])

doc_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, doc_chain)

print('Querying streamlined RAG chain...')
# create_retrieval_chain expects the user query under key "input"
res = rag_chain.invoke({'input': query})
answer = res.get('answer') or res.get('output_text', '')
print('\nAnswer:', answer.strip())

print('\nTop sources:')
for i, d in enumerate(res.get('context', []), 1):
    md = getattr(d, 'metadata', {}) or {}
    title = md.get('title', '')
    doc_id = md.get('doc_id', '')
    print(f'- Source {i}: {title[:80]} (id={doc_id})')


Building vector store with LangChain (auto-chunk + FAISS) ...
- Loaded 18 source documents
- Created 74 chunks (chunk_size=400, overlap=40)


Device set to use cpu
`generation_config` default values have been modified to match model-specific defaults: {'do_sample': True}. If this is not desired, please set these values explicitly.


Wrapping Transformers model as an LLM pipeline...
Querying streamlined RAG chain...

Answer: System: You are a research assistant. Use the provided context to answer. Cite titles when helpful. If unsure, say you don't know.
Human: Context:
Afforestation and reforestation are popular strategies for mitigating climate change by enhancing carbon sequestration. However, the effectiveness of these efforts is often self-reported by project developers, or certified through processes with limited external validation. This leads to concerns about data reliability and project integrity. In response to increasing scrutiny of voluntary carbon

increasing scrutiny of voluntary carbon markets, this study presents a dataset on global afforestation and reforestation efforts compiled from primary (meta-)information and augmented with time-series satellite imagery and other secondary data. Our dataset covers 1,289,068 planting sites from 45,628 projects spanning 33 years. Since any remote sensing-based 