# Overview

This hands‑on tutorial begins with a quick baseline RAG prototype for Roman‑Empire question‑answering and uses it as a **stepping stone toward the real objective: running a fully local Retrieval‑Augmented Generation system powered by a fine‑tuned small language model (SLM) that you train with the distil labs platform and then serve on your own hardware.**

* **Chapter 1 – Build a local RAG system**

  * Launch a lightweight chat model locally with **vLLM**.
  * Chunk a Wikipedia article, embed the chunks with **HuggingFace sentence‑transformers**, and store them in an **in‑memory vector store**.
  * Glue retrieval and generation together in a minimal **RAG class**, then test the loop end‑to‑end.

* **Chapter 2 – Specialize an SLM for local RAG**

  * Collect a *tiny* labelled QA set plus the unstructured passages you just indexed.
  * Upload them to **distil labs**, evaluate a powerful teacher LLM, and fine‑tune a 135 M‑parameter student model.

* **Chapter 3 – Rerun the RAG loop with the fine‑tuned model**

  * Serve the freshly trained weights locally with vLLM.
  * Drop the tuned model into the same RAG class and observe the quality gains.

By the end you will have a fully working RAG system that you can run on a laptop, plus a recipe for squeezing extra accuracy out of modest hardware by distilling knowledge from a large model into a small one.

# Notebook Setup

In [None]:
%%script bash

pip install vllm==0.7.3
pip install langchain-core langchain_community langchain-openai langchain-huggingface
pip install wikipedia pandas numpy requests rich pyyaml rouge_score

In [None]:
%env TOKENIZERS_PARALLELISM=false

# Chapter 1: Build a local RAG system

In this first chapter **we will build** a baseline RAG prototype that runs entirely on your machine: we will serve a small chat model with vLLM, index the Roman Empire Wikipedia article, and glue retrieval to generation with a concise helper class. This will give us a reference point whose quality we’ll aim to surpass after fine‑tuning in Chapter 2.

### Load our local model

Start the inference server with the chat model. If the base command does not work for you, you can also try:

```bash
python -m vllm.entrypoints.openai.api_server \
    --model HuggingFaceTB/SmolLM2-135M-Instruct \
    --device cpu \
    --port 8000 2>&1 | tee chat-model.log
```

In [None]:
%%script bash --bg

vllm serve HuggingFaceTB/SmolLM2-135M-Instruct --device cpu --port 8000 2>&1 | tee chat-model.log

Make requests against our server using a standard Langchain interface:

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model="HuggingFaceTB/SmolLM2-135M-Instruct"
)

response = llm.invoke([
    {"role": "user", "content": "What is today's date?"}
])
print(response.content)

To kill the local server, run the following snippet in a new cell:

```bash
! pgrep -al -f "vllm serve"
! pkill -al -f "vllm serve"
```


### Index our target dataset

This section walks through loading the **Wikipedia article on the Roman Empire** into an in‑memory vector store (adapted from [https://python.langchain.com/docs/tutorials/rag/](https://python.langchain.com/docs/tutorials/rag/)):

In [None]:
from langchain_community.document_loaders import WikipediaLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

# Load and chunk the Wikipedia page
loader = WikipediaLoader(
    query="Roman Empire",
    lang="en",
    load_max_docs=1,
    load_all_available_meta=False,
    doc_content_chars_max=int(1e7)
)
docs = loader.load()
docs[0].metadata.pop("summary")  # Remove the long summary

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# Embed and index the chunks
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")
vector_store = InMemoryVectorStore(embeddings)
indexed = vector_store.add_documents(documents=all_splits)

In [None]:
all_splits[0]

### Define the RAG logic

Now that our dataset is indexed and the chat model is live, we can **wire retrieval and generation together**. In this section we implement a bite‑sized `RAG` helper class that

1. fetches the top‑k passages most similar to the user’s question,
2. feeds those passages and the question into the language model via a structured prompt, and
3. returns a concise answer.

With this plumbing in place, answering a question becomes a single‑function call.

In [None]:
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from langchain_core.vectorstores import InMemoryVectorStore


class RAG:
    def __init__(self, vector_store: InMemoryVectorStore, llm: ChatOpenAI):
        self.vector_store = vector_store
        self.llm = llm

        self.SYSTEM_PROMPT = """
You are a problem solving model working on task_description XML block:

<task_description>
Answer the question using information from one of the context paragrpahs.
The answer should should contain all important information from the context paragraph but stay short, one sentence maximum.
</task_description>

You will be given a single task with context in the context XML block and the task in the question XML block
Solve the task in question block based on the context in context block.
Generate only the answer, do not generate anything else
"""

        self.PROMPT_TEMPLATE = """
Now for the real task, solve the task in question block based on the context in context block.
Generate only the solution, do not generate anything else.

<context>
{context}
</context>

<question>
{question}
</question>
"""

    def retrieve(self, question: str, k: int = 3):
        return self.vector_store.similarity_search(question, k=k)

    def generate(self, question: str, context_docs):
        context = "\n\n".join(doc.page_content for doc in context_docs)
        messages = [
            {"role": "system", "content": self.SYSTEM_PROMPT},
            {"role": "user", "content": self.PROMPT_TEMPLATE.format(context=context, question=question)},
        ]
        return self.llm.invoke(messages).content

    def answer(self, question: str):
        return self.generate(question, self.retrieve(question))

In [None]:
rag = RAG(vector_store=vector_store, llm=llm)

In [None]:
print(rag.answer("When did the Roman Empire collapse?"))
# print(rag.answer("What were the main metals mined in the Iberian Peninsula during the Roman Empire?"))
# print(rag.answer("How did the seating arrangement in Roman amphitheatres reflect social hierarchy?"))

### Test our RAG system

In [None]:
import pandas as pd
import numpy as np
from rouge_score import rouge_scorer


rouge = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
test_dataset = pd.read_csv("data/test.csv")

results = []
for idx, example in test_dataset.iterrows():
    reference = example["answer"]
    prediction = rag.answer(example["question"])
    metric = rouge.score(
        target=reference,
        prediction=prediction,
    )
    results.append(metric["rougeL"].fmeasure)

print(f"Aggregate metric: {np.mean(results)} +/- {np.std(results)}")
# Aggregate metric: 0.1618807343679535 +/- 0.10965436802120945

# Chapter 2: Specialize an SLM for local RAG

In this chapter we will fine‑tune a small language model with **distil labs** in three steps:

1. **Data upload** – upload the job description, train/test splits, unstructured contexts, and a YAML config.
2. **Teacher evaluation** – verify that a large model can solve the task; its accuracy becomes the benchmark.
3. **SLM training** – distil the teacher’s behaviour into a 135 M‑parameter student.

### Authentification

In [None]:
import json
import os
import requests


def distil_bearer_token(DL_USERNAME: str, DL_PASSWORD: str) -> str:
    response = requests.post(
        "https://cognito-idp.eu-central-1.amazonaws.com",
        headers={
            "X-Amz-Target": "AWSCognitoIdentityProviderService.InitiateAuth",
            "Content-Type": "application/x-amz-json-1.1",
        },
        data=json.dumps({
            "AuthParameters": {
                "USERNAME": DL_USERNAME,
                "PASSWORD": DL_PASSWORD,
            },
            "AuthFlow": "USER_PASSWORD_AUTH",
            "ClientId" : "4569nvlkn8dm0iedo54nbta6fd",
        })
    )
    response.raise_for_status()
    return response.json()["AuthenticationResult"]["AccessToken"]


DL_USERNAME="DL_USERNAME"
DL_PASSWORD="DL_PASSWORD"

AUTH_HEADER = {"Authorization": distil_bearer_token(DL_USERNAME, DL_PASSWORD)}
print("Success")

### Data Upload

The data for this example should be stored in the data_location directory. Lets first take a look at the current directory to make sure all files are available. Your current directory should look like:
```
├── README.md
├── rag-tutorial.ipynb
└── data
    ├── job_description.json
    ├── test.csv
    ├── train.csv
    └── unstructured.csv
```

#### Job Description
A job description explains the question answering task in plain english and follows the general structure below:

In [None]:
import json
from pathlib import Path
import rich.json

with open(Path("data").joinpath("job_description.json")) as fin:
    rich.print(rich.json.JSON(fin.read()))

#### Train/test set

We need a small train dataset to begin distil labs training and a testing dataset that we can use to evaluate the performance of the fine-tuned model. Here, we use the train and test datasets from the data_location directory where each is a CSV file with below 100 (question, answer) pairs.

In [None]:
from pathlib import Path
from IPython.display import display

import pandas as pd

print("# --- Train set")
train = pd.read_csv(Path("data").joinpath("train.csv"))
display(train)

print("# --- Test set")
test = pd.read_csv(Path("data").joinpath("test.csv"))
display(test)

#### Unstructured dataset
The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. Here, we will use the indexed contexts to create the unstructured dataset.

In [None]:
import pandas as pd

contexts_dataframe = pd.DataFrame([split.model_dump() for split in all_splits])
contexts_dataframe = contexts_dataframe.rename(columns={"page_content": "context"})[["context"]]
contexts_dataframe.to_csv("data/contexts.jsonl")
contexts_dataframe

#### Data upload

In [None]:
import json

import requests
import yaml
from pathlib import Path

# Specify the config
config = {
    "base": {"task": "question-answering-open-book"},
    "setup": {"student_model_name": "HuggingFaceTB/SmolLM2-135M-Instruct"},
    "synthgen": {
        "data_generation_strategy": "qa-open-book",
    },
    "tuning": {
        "num_train_epochs": 8,
    },
}

# Package your data
data_dir = Path("data")
data = {
    "job_description.json": open(data_dir / "job_description.json", encoding="utf-8").read(),
    "train.csv": open(data_dir / "train.csv", encoding="utf-8").read(),
    "test.csv": open(data_dir / "test.csv", encoding="utf-8").read(),
    "unstructured.csv": open(data_dir / "unstructured.csv", encoding="utf-8").read(),
    "config.yaml": yaml.dump(config),
}

# Upload data
response = requests.post(
    "https://api.distillabs.ai/uploads",
    data=json.dumps(data),
    headers={"content-type": "application/json", **AUTH_HEADER},
)
upload_id = response.json().get("id")
print(f"Upload successful. ID: {upload_id}")


### Teacher Evaluation
Before training an SLM, distil labs validates whether a large language model can solve your task:

In [None]:
from pprint import pprint

# Start teacher evaluation
response = requests.post(
    f"https://api.distillabs.ai/teacher-evaluations/{upload_id}",
    headers=AUTH_HEADER
)

teacher_evaluation_id = response.json().get("id")
pprint(response.json())

Poll the status endpoint until it completes, then inspect the accuracy. If accuracy is unsatisfactory, refine the job description.


In [None]:
response = requests.get(
    f"https://api.distillabs.ai/teacher-evaluations/{teacher_evaluation_id}/status",
    headers=AUTH_HEADER
)
pprint(response.json())

### SLM Training
Once the teacher evaluation completes successfully, start the SLM training:

In [None]:
import time
from pprint import pprint

# Start SLM training
response = requests.post(
    f"https://api.distillabs.ai/trainings/{upload_id}",
    headers=AUTH_HEADER,
)

pprint(response.json())
slm_training_job_id = response.json().get("id")

We can analyze the status of the training job using the `jobs` API. The following code snippets displays the current status of the job we started before. When the job is finished (`status=complete`), we can use the `jobs` API again to get the benchmarking result - the accuracy of the LLM and the accuracy of the fine-tuned SLM. We can achieve this using:

In [None]:
import json
from pprint import pprint

response = requests.get(
    f"https://api.distillabs.ai/trainings/{slm_training_job_id}/status",
    headers=AUTH_HEADER,
)
pprint(response.json())

When the job is finished (`status=complete`), we can use the `jobs` API again to get the benchmarking result - the accuracy of the LLM and the accuracy of the fine-tuned SLM. We can achieve this using:

In [None]:
from pprint import pprint

response = requests.get(
    f"https://api.distillabs.ai/trainings/{slm_training_job_id}/evaluation-results",
    headers=AUTH_HEADER,
)

pprint(response.json())


### Download Your Model
Once training is complete, download your model for deployment.

In [None]:
# Get model download URL
response = requests.get(
    f"https://api.distillabs.ai/trainings/{slm_training_job_id}/model",
    headers=AUTH_HEADER
)

pprint(response.json())


# Chapter 3: Build a local RAG system with the fine‑tuned model

Now that we have a small language model fine‑tuned specifically for Roman‑Empire question‑answering, we can rebuild our RAG pipeline around it. This domain‑specialized LLM will provide more accurate, context‑aware answers than our baseline model while still running entirely on local hardware. In this chapter we’ll serve the tuned weights with vLLM, swap the model into our RAG helper class, and validate the performance gains.

### Serve a local fine-tuned LLM
Kill all previous instances of vllm serving and start a new model. We need to direct the vllm server at the location of the new model which will be different depending on your settings

In [None]:
! pgrep -al -f "vllm serve"
! pkill -al -f "vllm serve"

In [None]:
%%script bash --bg

vllm serve <YOUR_LOCATION> --device cpu --port 8000 2>&1 | tee chat-tuned-model.log

In [None]:
from langchain_openai import ChatOpenAI
import pprint

tuned_llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model="<YOUR_LOCATION>")

response = tuned_llm.invoke([{
    "role": "user",
    "content": "When did the roman empire fall?"
}])

response.content

### Plug the new model into RAG
With the fine‑tuned weights now running locally, the last step is to swap the specialized LLM into our existing RAG helper class. The retrieval component remains identical—still fetching the most relevant passages about the Roman Empire—while the generation step now leverages a model that has been trained on our domain‑specific data. This simple drop‑in replacement should yield noticeably sharper and more accurate answers.

In [None]:
tuned_rag = RAG(vector_store=vector_store, llm=tuned_llm)
print(tuned_rag.answer("When did the roman empire collapse?"))

### Test our RAG system

In [None]:
import pandas as pd
import numpy as np
from rouge_score import rouge_scorer


rouge = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
test_dataset = pd.read_csv("data/test.csv")

results = []
for idx, example in test_dataset.iterrows():
    reference = example["answer"]
    prediction = tuned_rag.answer(example["question"])
    metric = rouge.score(
        target=reference,
        prediction=prediction,
    )
    results.append(metric["rougeL"].fmeasure)

print(f"Aggregate metric: {np.mean(results)} +/- {np.std(results)}")
# Aggregate metric: 0.253503946795671 +/- 0.16689695917279943