# RAG with LLaMA.cpp Tutorial

# Introduction: Why RAG with LLMs Matters


In the age of large language models (LLMs), one critical limitation remains: their knowledge is frozen at the time of training. **Retrieval-Augmented Generation (RAG)** solves this problem by allowing LLMs to dynamically fetch up-to-date, domain-specific, or private information from external sources like documents or databases.

This approach enhances accuracy, reduces hallucinations, and enables real-time applications in business, education, research, and more. By combining fast local inference via `llama-cpp-python` with smart document retrieval, we create powerful and private AI systems that go far beyond traditional chatbots.

## Step 1: Install Required Libraries

In [1]:
!pip install llama-cpp-python chromadb tiktoken huggingface_hub sentence_transformers

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.9.tar.gz (67.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting chromadb
  Downloading chromadb-1.0.12-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-4.8.0-py3-none-any.whl.metadata (5.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.0-cp311-cp311-manylinux_2_27_x86_

Step 2: Import Required Modules

In [None]:
from llama_cpp import Llama
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
from chromadb.utils.embedding_functions import EmbeddingFunction
import os
import time


- `llama_cpp.Llama`: This is the Python binding to the LLaMA C++ inference engine, used for running a local LLM.
- `chromadb`: A simple and fast embedding-based vector store used here for document retrieval.
- `OpenAIEmbeddingFunction`: Used to create embeddings for the documents; can be replaced with other providers.

Step 3: Set Up Chroma Vector Store

In [None]:
token = "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" #@param {type:"string"}
os.environ["HUGGINGFACEHUB_API_TOKEN"] = token

# Recommended way to initialize a persistent Chroma client
chroma_client = chromadb.PersistentClient(path=".chroma")

# Load a small, fast, local model
local_model = SentenceTransformer("all-MiniLM-L6-v2")

# Wrap it in ChromaDB's expected interface
class LocalEmbeddingFunction(EmbeddingFunction):
    def __call__(self, texts):
        return local_model.encode(texts).tolist()

embedding_func = LocalEmbeddingFunction()

# Create or get the collection
collection = chroma_client.get_or_create_collection(
    name="rag-tutorial",
    embedding_function=embedding_func
)


- The vector store is initialized using DuckDB as the backend.
- A collection is created (or retrieved if it already exists) where document embeddings will be stored.
- `text-embedding-ada-002` is a performant embedding model from OpenAI.

Step 4: Ingest Documents into the Vector Store

In [None]:
documents = [
    "The capital of France is Paris.",
    "The Moon is Earth's only natural satellite.",
    "Python is a widely used programming language for machine learning."
]
collection.add(documents=documents, ids=["doc1", "doc2", "doc3"])


Here we add 3 simple documents into the vector store with IDs for later retrieval.
In a real application, you would preprocess and chunk PDFs, HTML, etc.

Step 5: Define a Function to Retrieve Relevant Context from Chroma

In [None]:
def get_context(query, top_k=2):
    results = collection.query(query_texts=[query], n_results=top_k)
    return "\n".join(results["documents"][0])


This function uses ChromaDB to retrieve the top `k` most relevant documents to the user's query.
We return them concatenated as context.

Step 6: Load LLaMA Model Locally with llama-cpp-python

In [None]:
from huggingface_hub import hf_hub_download

# TinyLlama requires no authentication or approval
# Replace with desired repo and filename
model_path = hf_hub_download(
    repo_id="TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF",
    filename="tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf",
    local_dir="./models",
    local_dir_use_symlinks=False
)
print("Downloaded model path:", model_path)

# RAM requirements: 8GB, overkill
# llm = Llama(model_path="./models/llama-2-7b-chat.ggmlv3.q4_0.bin", n_ctx=2048)

# smaller model, quantized version 2-3GB, powerful desktop
#llm = Llama(model_path="./models/llama-2-7b-chat.Q3_K_S.gguf", n_ctx=1024)

# Phi-2 model, 1.7B parameter, teaching code logic, 2.3GB
#llm = Llama(model_path="./models/phi-2.Q4_K_M.gguf", n_ctx=512)

# TinyLlama, 1.1B parameter, classroom demo, 1.3GB
llm = Llama(model_path="./models/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf", n_ctx=512)



Loads the quantized model file (in ggml format). Make sure it's compatible with llama.cpp.
`n_ctx` is the maximum token context size.

Step 7: Define a Prompt Template and Generate Answer

In [None]:
def rag_prompt(query, context):
    return f"""
    Answer the question using only the context below.
    Context:
    {context}

    Question: {query}
    Answer:"""


- `rag_prompt()` creates a structured prompt where the LLM is instructed to use only the given context.
- `generate_answer()` performs the full RAG pipeline: retrieval, prompt construction, and generation.

# Step 8: Try an Example Query

Here's a classroom-robust version of generate_answer(), designed to be:

- Readable for students

- Transparent (shows what's happening under the hood)

- Safe from weird LLM behaviors (like answering more than one question)

- Easy to debug and extend

- With Logging and Safe Stops

In [None]:
def generate_answer(query, top_k=2, max_tokens=128):
    """
    Generate an answer using the RAG pipeline with logging and safety.

    Parameters:
    - query: the question to answer
    - top_k: how many documents to retrieve from ChromaDB
    - max_tokens: how many tokens to generate in the response

    Returns:
    - A string containing the model's answer
    """

    # Retrieve context
    context = get_context(query, top_k=top_k)
    print("\n--- Retrieved Context ---")
    print(context)

    # Construct prompt
    prompt = f"""You are a helpful assistant. Use the context below to answer the question concisely.

Context:
{context}

Question: {query}
Answer:"""
    print("\n--- Prompt Sent to Model ---")
    print(prompt)

    # Generate model output
    output = llm(prompt, max_tokens=max_tokens, stop=["\nQ:", "\nQuestion:", "\n###"])

    # Log full output structure
    print("\n--- Raw Model Output ---")
    print(output)

    # Extract text and clean it
    answer = output["choices"][0]["text"].strip()

    print("\n--- Final Answer ---")
    print(answer)

    return answer


In [None]:

query = "What is the capital of France?"
response = generate_answer(query)
print("Response:", response)

# Conclusion

With just a few components—embedding function, vector store, and a local LLM—you now have a working RAG pipeline.
This pattern is highly extensible: you can swap in your own PDFs, use local embedding models, fine-tune the prompt, or even stream results in real time.

In future applications, RAG will power:
- Personal assistants with access to your private notes and documents
- Customer support bots with up-to-date knowledge bases
- Academic or enterprise search engines
- Local-first and privacy-preserving AI applications

### Students who master RAG will be equipped to build the next generation of intelligent, context-aware systems.
