# Overview

This hands‑on tutorial begins with a quick baseline RAG prototype for Roman‑Empire question‑answering and uses it as a **stepping stone toward the real objective: running a fully local Retrieval‑Augmented Generation system powered by a fine‑tuned small language model (SLM) that you train with the distil labs platform and then serve on your own hardware.**

* **Chapter 1 – Build a local RAG system**

  * Launch a lightweight chat model locally with **vLLM**.
  * Chunk a Wikipedia article, embed the chunks with **HuggingFace sentence‑transformers**, and store them in an **in‑memory vector store**.
  * Glue retrieval and generation together in a minimal **RAG class**, then test the loop end‑to‑end.

* **Chapter 2 – Specialize an SLM for local RAG**

  * Collect a *tiny* labelled QA set plus the unstructured passages you just indexed.
  * Upload them to **distil labs**, evaluate a powerful teacher LLM, and fine‑tune a 135 M‑parameter student model.

* **Chapter 3 – Rerun the RAG loop with the fine‑tuned model**

  * Serve the freshly trained weights locally with vLLM.
  * Drop the tuned model into the same RAG class and observe the quality gains.

By the end you will have a fully working RAG system that you can run on a laptop, plus a recipe for squeezing extra accuracy out of modest hardware by distilling knowledge from a large model into a small one.

# Notebook Setup

In [None]:
%%script bash

pip install vllm==0.7.3
pip install langchain-core langchain_community langchain-openai langchain-huggingface
pip install wikipedia pandas requests rich pyyaml

In [14]:
%env TOKENIZERS_PARALLELISM=false

env: TOKENIZERS_PARALLELISM=false


# Chapter 1: Build a local RAG system

In this first chapter **we will build** a baseline RAG prototype that runs entirely on your machine: we will serve a small chat model with vLLM, index the Roman Empire Wikipedia article, and glue retrieval to generation with a concise helper class. This will give us a reference point whose quality we’ll aim to surpass after fine‑tuning in Chapter 2.

### Load our local model

Start the inference server with the chat model. If the base command does not work for you, you can also try:

```bash
python -m vllm.entrypoints.openai.api_server \
    --model HuggingFaceTB/SmolLM2-135M-Instruct \
    --device cpu \
    --port 8001 2>&1 | tee chat-model.log
```

In [16]:
%%script bash --bg

vllm serve HuggingFaceTB/SmolLM2-135M-Instruct --device cpu --port 8000 2>&1 | tee chat-model.log

Make requests against our server using a standard Langchain interface:

In [17]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model="HuggingFaceTB/SmolLM2-135M-Instruct"
)

response = llm.invoke([
    {"role": "user", "content": "What is today's date?"}
])
print(response.content)

The date today will take place at 04:02UTC (Coordinated Universal Time).


To kill the local server, run the following snippet in a new cell:

```bash
! pgrep -al -f "vllm serve"
! pkill -al -f "vllm serve"
```


### Index our target dataset

This section walks through loading the **Wikipedia article on the Roman Empire** into an in‑memory vector store (adapted from [https://python.langchain.com/docs/tutorials/rag/](https://python.langchain.com/docs/tutorials/rag/)):

In [5]:
from langchain_community.document_loaders import WikipediaLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

# Load and chunk the Wikipedia page
loader = WikipediaLoader(
    query="Roman Empire",
    lang="en",
    load_max_docs=1,
    load_all_available_meta=False,
    doc_content_chars_max=int(1e7)
)
docs = loader.load()
docs[0].metadata.pop("summary")  # Remove the long summary

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# Embed and index the chunks
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2", show_progress=True)
vector_store = InMemoryVectorStore(embeddings)
indexed = vector_store.add_documents(documents=all_splits)

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

In [6]:
all_splits[0]

Document(metadata={'title': 'Roman Empire', 'source': 'https://en.wikipedia.org/wiki/Roman_Empire'}, page_content="The Roman Empire ruled the Mediterranean and much of Europe, Western Asia and North Africa. The Romans conquered most of this during the Republic, and it was ruled by emperors following Octavian's  assumption of effective sole rule in 27 BC. The western empire collapsed in 476 AD, but the eastern empire lasted until the fall of Constantinople in 1453.")

### Define the RAG logic

Now that our dataset is indexed and the chat model is live, we can **wire retrieval and generation together**. In this section we implement a bite‑sized `RAG` helper class that

1. fetches the top‑k passages most similar to the user’s question,
2. feeds those passages and the question into the language model via a structured prompt, and
3. returns a concise answer.

With this plumbing in place, answering a question becomes a single‑function call.

In [7]:
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from langchain_core.vectorstores import InMemoryVectorStore


class RAG:
    def __init__(self, vector_store: InMemoryVectorStore, llm: ChatOpenAI):
        self.vector_store = vector_store
        self.llm = llm

        self.SYSTEM_PROMPT = (
            "You are a question‑answering model working on the task below.\n"
            "<task_description>Answer the question using information in the context.</task_description>"
        )

        self.PROMPT_TEMPLATE = (
            "Now answer based on the context.\n"
            "<context>\n{context}\n</context>\n\n<question>\n{question}\n</question>"
        )

    def retrieve(self, question: str, k: int = 3):
        return self.vector_store.similarity_search(question, k=k)

    def generate(self, question: str, context_docs):
        context = "\n\n".join(doc.page_content for doc in context_docs)
        messages = [
            {"role": "system", "content": self.SYSTEM_PROMPT},
            {"role": "user", "content": self.PROMPT_TEMPLATE.format(context=context, question=question)},
        ]
        return self.llm.invoke(messages).content

    def answer(self, question: str):
        return self.generate(question, self.retrieve(question))

In [8]:
rag = RAG(vector_store=vector_store, llm=llm)
print(rag.answer("When did the Roman Empire collapse?"))


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The Roman Empire collapsed after the death of Odoacer, who had once been a ruler. He was one of the last emperors of the Western Roman Empire, and his brief rule was followed by the death of Romulus Augustulus in 476 AD. The empire's fragmentation was the continuing effects of the collapse of the Western Roman Empire, which occurred due to changes in empire-building theories between Augustus and Odoacer's reign.


### Test our RAG system

In [9]:
import pandas as pd
test_dataset = pd.read_csv("data/test.csv")

In [18]:
example = test_dataset.sample(n=1).iloc[0]
print("QUESTION:", example["question"].strip())
print("ANSWER:", example["answer"].strip())
print("PREDICTED:", rag.answer(example["question"]))

QUESTION: What was the linguistic diversity of the Roman Empire and how did Latin interact with local languages?
ANSWER: The Roman Empire was deliberately multilingual with Latin and Greek as dominant languages among the elite, while local languages like Coptic, Punic, Gaulish, and Aramaic continued to be used in various regions, with Latin gradually replacing some languages while incorporating elements of others through natural linguistic evolution.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

PREDICTED: There is no information about linguistic diversity of the Roman Empire and how it interacted with local languages. The Romans did not follow linguistic imperialism to achieve power or political superiority. However, there is evidence that local languages, particularly in Egypt, Fulham, and Syria, were favored by Roman administrators for use of internal administrative documents, official communications, private letters, and policy matters. Palmyrene interpreters served the military and had their own language for quotations and proclamations. In Africa, Palmyrene soldiers used their dialect of Aramaic for inscriptions, an attempt by the Roman Empire to inform the right knowledge. Linguistic diversity would have made Rome's various officials and officials' efforts to use local languages common throughout the empire.

As for the main language and writing system of the Empire, Latin has had far greater worldwide influence and standardization than any other language. In the Warrin

# Chapter 2: Specialize an SLM for local RAG

In this chapter we will fine‑tune a small language model with **distil labs** in three steps:

1. **Data upload** – upload the job description, train/test splits, unstructured contexts, and a YAML config.
2. **Teacher evaluation** – verify that a large model can solve the task; its accuracy becomes the benchmark.
3. **SLM training** – distil the teacher’s behaviour into a 135 M‑parameter student.

### Authentification

In [19]:
import json
import os
import requests


def distil_bearer_token(DL_USERNAME: str, DL_PASSWORD: str) -> str:
    response = requests.post(
        "https://cognito-idp.eu-central-1.amazonaws.com",
        headers={
            "X-Amz-Target": "AWSCognitoIdentityProviderService.InitiateAuth",
            "Content-Type": "application/x-amz-json-1.1",
        },
        data=json.dumps({
            "AuthParameters": {
                "USERNAME": DL_USERNAME,
                "PASSWORD": DL_PASSWORD,
            },
            "AuthFlow": "USER_PASSWORD_AUTH",
            "ClientId" : "4569nvlkn8dm0iedo54nbta6fd",
        })
    )
    response.raise_for_status()
    return response.json()["AuthenticationResult"]["AccessToken"]


DL_USERNAME="jacek@distillabs.ai"
DL_PASSWORD="JacekPassword1!"
# DL_USERNAME = os.environ["DL_USERNAME"]
# DL_PASSWORD = os.environ["DL_PASSWORD"]

AUTH_HEADER = {"Authorization": distil_bearer_token(DL_USERNAME, DL_PASSWORD)}
print("Success")

Success


### Data Upload

The data for this example should be stored in the data_location directory. Lets first take a look at the current directory to make sure all files are available. Your current directory should look like:
```
├── README.md
├── rag-tutorial.ipynb
└── data
    ├── job_description.json
    ├── test.csv
    ├── train.csv
    └── unstructured.csv
```

#### Job Description
A job description explains the question answering task in plain english and follows the general structure below:

In [20]:
import json
from pathlib import Path
import rich.json

with open(Path("data").joinpath("job_description.json")) as fin:
    rich.print(rich.json.JSON(fin.read()))

#### Train/test set

We need a small train dataset to begin distil labs training and a testing dataset that we can use to evaluate the performance of the fine-tuned model. Here, we use the train and test datasets from the data_location directory where each is a CSV file with below 100 (question, answer) pairs.

In [21]:
from pathlib import Path
from IPython.display import display

import pandas

print("# --- Train set")
train = pandas.read_csv(Path("data").joinpath("train.csv"))
display(train)

print("# --- Test set")
test = pandas.read_csv(Path("data").joinpath("test.csv"))
display(test)

# --- Train set


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,question,answer,context
0,0,71,\nWhat was the social status of artists in Rom...,"Despite the value placed on art, artists had l...",== Arts ==\n\nGreek art had a profound influen...
1,1,1,\nWhat was the fundamental distinction in Roma...,"According to Gaius, the fundamental distinctio...",=== Legal status ===\n\nAccording to the juris...
2,2,15,\nWhat changes occurred in the composition of ...,The Roman Senate gradually evolved from being ...,"In the time of Nero, senators were still prima..."
3,3,32,\nHow did Romans prefer to transport their com...,Romans preferred to transport commodities by w...,The Empire completely encircled the Mediterran...
4,4,33,"\nAccording to Cassius Dio, which Roman empero...","According to Cassius Dio, Commodus's accession...",In the view of contemporary Greek historian Ca...
5,5,52,\nHow did social status affect justice and pun...,"In Roman society, honestiores (upper classes) ...",As the republican principle of citizens' equal...
6,6,83,\nWhat was the main source of an emperor's pow...,The military was the practical source of an em...,The Imperial cult of ancient Rome identified e...
7,7,2,\nHow did ancient Romans handle large financia...,Ancient Romans used a system of banks througho...,"in private lending, both as creditors and borr..."
8,8,30,\nWhat was the fundamental principle of Roman ...,Roman religion was based on the principle of d...,"Roman religion was practical and contractual, ..."
9,9,53,\nWhat were the three main components of the R...,The three main components of the Roman militar...,== Government and military ==\n\nThe three maj...


# --- Test set


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,question,answer,context
0,0,95,\nHow did the seating arrangement in Roman amp...,The amphitheatre seating reflected Roman socia...,Circuses were the largest structure regularly ...
1,1,78,"\nWhat was meant by Juvenal's ""bread and circu...","Juvenal's ""bread and circuses"" reference meant...",When Juvenal complained that the Roman people ...
2,2,57,\nWhat was the Mediterranean Sea called by the...,"The Mediterranean Sea was called ""Mare Nostrum...",The Empire completely encircled the Mediterran...
3,3,49,\nWhat were the main centers of glass producti...,Egypt and the Rhineland had become noted for f...,=== Decorative arts ===\n\nDecorative arts for...
4,4,45,\nWhat was the primary currency used in the ea...,The sestertius (HS) was the basic unit of curr...,The early Empire was monetized to a near-unive...
5,5,69,\nWhat role did music play in ancient Roman so...,"Music was integral to Roman society, being pre...","Although sometimes regarded as foreign, music ..."
6,6,22,\nHow was primary education structured in anci...,"Primary education in ancient Rome, available o...",Traditional Roman education was moral and prac...
7,7,76,\nWhat were the main metals mined in the Iberi...,The main metals mined in the Iberian Peninsula...,The main mining regions of the Empire were the...
8,8,77,\nWhat were the main resources mined in the Ib...,The main resources mined in the Iberian Penins...,The main mining regions of the Empire were the...
9,9,44,\nWhat were the main trade commodities in the ...,The main trade commodities in the Roman Empire...,=== Trade and commodities ===\n\nRoman provinc...


#### Unstructured dataset
The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. Here, we will use the indexed contexts to create the unstructured dataset.

In [22]:
contexts_dataframe = pd.DataFrame([split.model_dump() for split in all_splits])
contexts_dataframe = contexts_dataframe.rename(columns={"page_content": "context"})[["context"]]
# contexts_dataframe.to_csv("data/contexts.jsonl")
contexts_dataframe

Unnamed: 0,context
0,The Roman Empire ruled the Mediterranean and m...
1,"By 100 BC, the city of Rome had expanded its r..."
2,The first two centuries of the Empire saw a pe...
3,moved the imperial seat from Rome to Byzantium...
4,"Due to the Empire's extent and endurance, its ..."
...,...
167,== Legacy ==
168,Several states claimed to be the Roman Empire'...
169,on the throne of the Roman Empire. He even lau...
170,The Roman Empire's control of the Italian Peni...


#### Data upload

In [23]:
import json

import requests
import yaml
from pathlib import Path

# Specify the config
config = {
    "base": {"task": "question-answering-open-book"},
    "setup": {"student_model_name": "HuggingFaceTB/SmolLM2-135M-Instruct"},
    "synthgen": {
        "data_generation_strategy": "qa-open-book",
        "generation_target": 128,
        "generation_iteration_size": 16,
    },
    "tuning": {
        "num_train_epochs": 3,
        "use_lora": False,
    },
}

# Package your data
data_dir = Path("data")
data = {
    "job_description.json": open(data_dir / "job_description.json").read(),
    "train.csv": open(data_dir / "train.csv").read(),
    "test.csv": open(data_dir / "test.csv").read(),
    "unstructured.csv": open(data_dir / "unstructured.csv").read(),
    "config.yaml": yaml.dump(config),
}

# Upload data
response = requests.post(
    "https://api.distillabs.ai/uploads",
    data=json.dumps(data),
    headers={"content-type": "application/json", **AUTH_HEADER},
)
upload_id = response.json().get("id")
print(f"Upload successful. ID: {upload_id}")


Upload successful. ID: bce40ea7-800b-4029-9e7a-4bc2be891dca


### Teacher Evaluation
Before training an SLM, distil labs validates whether a large language model can solve your task:

In [24]:
from pprint import pprint

# Start teacher evaluation
response = requests.post(
    f"https://api.distillabs.ai/teacher-evaluations/{upload_id}",
    headers=AUTH_HEADER
)

teacher_evaluation_id = response.json().get("id")
pprint(response.json())

{'id': 'b18808b5-364b-491a-9488-3b70e7a88c12',
 'upload_id': 'bce40ea7-800b-4029-9e7a-4bc2be891dca'}


Poll the status endpoint until it completes, then inspect the accuracy. If accuracy is unsatisfactory, refine the job description.


In [31]:
# Check evaluation results
# teacher_evaluation_id = "b18808b5-364b-491a-9488-3b70e7a88c12"
response = requests.get(
    f"https://api.distillabs.ai/teacher-evaluations/{teacher_evaluation_id}/status",
    headers=AUTH_HEADER
)
pprint(response.json())

{'evaluation_predictions_download_url': '',
 'logs': '2025-05-14 08:59:55,839\tINFO job_manager.py:530 -- Runtime env is '
         'setting up.\n'
         '(evaluate_teacher pid=128479, ip=10.2.4.86) '
         'INFO:distillib.control.qa.entrypoint:Starting teacher evaluation '
         '...\n'
         '(evaluate_teacher pid=128479, ip=10.2.4.86) 2025/05/14 09:00:13 INFO '
         'mlflow.tracking.fluent: Experiment with name '
         "'85ffc3fe-fdfd-4dfd-b0eb-b08492325f98' does not exist. Creating a "
         'new experiment.\n'
         '(evaluate_teacher pid=128479, ip=10.2.4.86) INFO:absl:Using default '
         'tokenizer.\n'
         '(evaluate_teacher pid=128479, ip=10.2.4.86) [nltk_data] Downloading '
         'package wordnet to /home/ray/nltk_data...\n'
         '(evaluate_teacher pid=128479, ip=10.2.4.86) [nltk_data]   Package '
         'wordnet is already up-to-date!\n'
         '(evaluate_teacher pid=128479, ip=10.2.4.86) [nltk_data] Downloading '
         'packag

### SLM Training
Once the teacher evaluation completes successfully, start the SLM training:

In [26]:
import time
from pprint import pprint

# Start SLM training
response = requests.post(
    f"https://api.distillabs.ai/trainings/{upload_id}",
    headers=AUTH_HEADER,
)

pprint(response.json())
slm_training_job_id = response.json().get("id")

{'id': '7f1e7be5-58f0-434e-81a6-c6ac5322346c',
 'upload_id': 'bce40ea7-800b-4029-9e7a-4bc2be891dca'}


We can analyze the status of the training job using the `jobs` API. The following code snippets displays the current status of the job we started before. When the job is finished (`status=complete`), we can use the `jobs` API again to get the benchmarking result - the accuracy of the LLM and the accuracy of the fine-tuned SLM. We can achieve this using:

In [34]:
import json
from pprint import pprint

# slm_training_job_id = "7f1e7be5-58f0-434e-81a6-c6ac5322346c"
response = requests.get(
    f"https://api.distillabs.ai/trainings/{slm_training_job_id}/status",
    headers=AUTH_HEADER,
)
response = response.json()
response["status"].pop("logs")
response

{'training_id': '7f1e7be5-58f0-434e-81a6-c6ac5322346c',
 'message': '',
 'status': {'dag': 'JOB_RUNNING',
  'start_time': '2025-05-14T16:00:14.643000Z',
  'end_time': None,
  'tasks': [{'evaluate_teacher': {'status': 'JOB_SUCCESS', 'reason': ''}},
   {'generate_synthetic_data': {'status': 'JOB_SUCCESS', 'reason': ''}},
   {'finetune_student': {'status': 'JOB_PENDING', 'reason': ''}}]}}

When the job is finished (`status=complete`), we can use the `jobs` API again to get the benchmarking result - the accuracy of the LLM and the accuracy of the fine-tuned SLM. We can achieve this using:

In [28]:
from pprint import pprint
response = requests.get(
    f"https://api.distillabs.ai/trainings/{slm_training_job_id}/evaluation-results",
    headers=AUTH_HEADER,
)

pprint(response.json())


{'evaluation_results': None,
 'finetuned_student_evaluation_predictions_download_url': '',
 'message': 'Training not completed successfully: '
            'a8a6f99f-e4ba-4e52-a57c-69aca0c5ef03. Use the '
            '/trainings/<training-id>/status endpoint for more details.',
 'results': {},
 'training_id': 'a8a6f99f-e4ba-4e52-a57c-69aca0c5ef03'}


### Download Your Model
Once training is complete, download your model for deployment.

In [33]:
# Get model download URL
response = requests.get(
    f"https://api.distillabs.ai/trainings/{slm_training_job_id}/model",
    headers=AUTH_HEADER
)

pprint(response.json())


{'message': 'Training not completed successfully: '
            '7f1e7be5-58f0-434e-81a6-c6ac5322346c. Use the '
            '/trainings/<training-id>/status endpoint for more details.',
 's3_url': None,
 'training_id': '7f1e7be5-58f0-434e-81a6-c6ac5322346c'}


# Chapter 3: Build a local RAG system with the fine‑tuned model

Now that we have a small language model fine‑tuned specifically for Roman‑Empire question‑answering, we can rebuild our RAG pipeline around it. This domain‑specialized LLM will provide more accurate, context‑aware answers than our baseline model while still running entirely on local hardware. In this chapter we’ll serve the tuned weights with vLLM, swap the model into our RAG helper class, and validate the performance gains.

### Serve a local fine-tuned LLM
Kill all previous instances of vllm serving and start a new model. We need to direct the vllm server at the location of the new model which will be different depending on your settings

In [None]:
! pgrep -al -f "vllm serve"
! pkill -al -f "vllm serve"

In [None]:
%%script bash --bg

vllm serve ~/Downloads/model --device cpu --port 8000 2>&1 | tee chat-model.log

In [None]:
from langchain_openai import ChatOpenAI
import pprint

tuned_llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model="~/Downloads/model")

response = tuned_llm.invoke([{
    "role": "user",
    "content": "When did the roman empire fall?"
}])

response.content

### Plug the new model into RAG
With the fine‑tuned weights now running locally, the last step is to swap the specialized LLM into our existing RAG helper class. The retrieval component remains identical—still fetching the most relevant passages about the Roman Empire—while the generation step now leverages a model that has been trained on our domain‑specific data. This simple drop‑in replacement should yield noticeably sharper and more accurate answers.

In [None]:
tuned_rag = RAG(vector_store=vector_store, llm=tuned_llm)

print(tuned_rag.answer("When did the roman empire collapse?"))

### Test our RAG system

In [None]:
import pandas as pd
test_dataset = pd.read_csv("data/test.csv")

In [None]:
test_case = test_dataset.sample(n=1).to_dict(orient="records").pop()

print("QUESTION:", test_case["question"])
print("ANSWER:", test_case["answer"])
print("PREDICTED:", tuned_rag.retrieval_augmented_generate(test_case["question"]))