# RAG Starter - Building a Question-Answering System

This notebook demonstrates how to build a **Retrieval-Augmented Generation (RAG)** system using:
- **Hugging Face** for the language model (LLM)
- **Minsearch** for document retrieval
- **LangChain** for orchestration

## What is RAG?

RAG combines two key components:
1. **Retrieval**: Finding relevant documents from a knowledge base
2. **Generation**: Using an LLM to generate answers based on the retrieved context

This approach allows the LLM to provide accurate, context-aware answers without having all knowledge built into its parameters.

## Step 1: Load Environment Variables

First, we'll load our Hugging Face API token from the `.env` file. This token allows us to access Hugging Face's inference endpoints.

In [None]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Verify the token is loaded (shows only first few characters for security)
token = os.getenv('HUGGINGFACEHUB_API_TOKEN')
print(f"Token loaded: {token[:10]}..." if token else "Token not found!")

## Step 2: Load and Prepare Document Dataset

We'll download a dataset of FAQ documents from various courses. Each document contains:
- `question`: The FAQ question
- `text`: The answer to the question
- `section`: The section of the course
- `course`: The course name

In [2]:
import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [7]:
documents[2]

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'section': 'General course-related questions',
 'question': 'Course - Can I still join the course after the start date?',
 'course': 'data-engineering-zoomcamp'}

Let's examine a sample document to understand the data structure:

In [9]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

index.fit(documents)

<minsearch.minsearch.Index at 0x7c43acd9fc20>

## Step 3: Create a Search Index with Minsearch

**Minsearch** is a lightweight, pure-Python search library that implements text-based search.

We'll create an index that:
- Searches across `question`, `text`, and `section` fields
- Filters by the `course` keyword
- Uses boost weights to prioritize question matches

## Step 4: Initialize Hugging Face LLM

We'll use **Meta's Llama 3.2 3B Instruct** model through Hugging Face's Inference API. This allows us to:
- Use powerful models without local GPU resources
- Access models via simple API calls
- Leverage Hugging Face's serverless infrastructure

The model is wrapped with **LangChain** for easier message handling and integration.

In [11]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

### Define Search Function

This function retrieves the top 5 most relevant documents for a given query:
- **boost**: Weights different fields (questions are 3x more important than sections)
- **filter_dict**: Limits search to specific course
- **num_results**: Returns top 5 matches

In [12]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

### Build Prompt for LLM

This function creates a structured prompt for the LLM by:
1. Taking the user's question
2. Formatting the retrieved documents as context
3. Instructing the LLM to answer based only on the provided context

This ensures the LLM stays grounded in the retrieved facts and doesn't hallucinate.

In [None]:
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

# Hugging Face model setup
repo_id = "meta-llama/Llama-3.2-3B-Instruct"
llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    task="text-generation",
    max_new_tokens=512,
    do_sample=False,
    repetition_penalty=1.03,
)
client = ChatHuggingFace(llm=llm, verbose=False)

In [None]:
from langchain_core.messages import HumanMessage, SystemMessage

def llm(prompt):
    response = client.invoke(
        [
            SystemMessage(content="You are a helpful assistant."),
            HumanMessage(content=prompt),
        ]
    )
    return response.content

### LLM Wrapper Function

This function sends our prompt to the Hugging Face LLM using LangChain's message format:
- **SystemMessage**: Sets the role/behavior of the AI assistant
- **HumanMessage**: Contains the actual user prompt
- Returns only the text content of the response

In [15]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

## Step 5: Complete RAG Pipeline

Now we combine everything into a single RAG function that:

1. **Retrieves** relevant documents using the search function
2. **Builds** a context-rich prompt from the retrieved documents
3. **Generates** an answer using the LLM

This is the complete RAG workflow!

In [17]:
rag('how do I run kafka?')

'To run Kafka with Java, you need to execute the following command in your project directory:\n\n```sh\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n```\n\nIf you\'re using Python and encounter the "Module \'kafka\' not found" error when trying to run `producer.py`, you should create a virtual environment, activate it, and install the required packages as per `requirements.txt`:\n\n1. Create a virtual environment (run only once):\n   ```sh\n   python -m venv env\n   ```\n\n2. Activate the virtual environment:\n   - On MacOS/Linux:\n     ```sh\n     source env/bin/activate\n     ```\n   - On Windows:\n     ```sh\n     env\\Scripts\\activate\n     ```\n\n3. Install the necessary packages:\n   ```sh\n   pip install -r ../requirements.txt\n   ```\n\nEnsure your Docker images are running if required. To deactivate the virtual environment when done, use:\n```sh\ndeactivate\n```'

## Step 6: Test the RAG System

Let's test our RAG system with some questions about the Data Engineering Zoomcamp course:

In [18]:
rag('the course has already started, can I still enroll?')

'Yes, you can still enroll in the course even if it has already started. You are also eligible to submit the homework assignments. However, make sure to pay attention to the deadlines for turning in the final projects to avoid leaving everything until the last minute.'

## Summary

### What We Built
- A complete RAG (Retrieval-Augmented Generation) system
- Text-based document search using Minsearch
- LLM-powered question answering with Hugging Face

### Why Hugging Face Endpoints?
Instead of running models locally (which requires expensive GPUs), we use Hugging Face's **Inference API**:
- ✓ **No local GPU needed** - runs on Hugging Face's infrastructure
- ✓ **Fast deployment** - no model downloads or setup
- ✓ **Cost-effective** - pay per request or use free tier
- ✓ **Scalable** - handles multiple requests automatically
- ✓ **Easy switching** - change models by just updating `repo_id`

### Next Steps
- Try `03_rag.ipynb` to see how **vector search** improves retrieval
- Experiment with different questions
- Try other Hugging Face models (see available models at https://huggingface.co/models)

### Key Takeaways
1. RAG combines retrieval + generation for better answers
2. The LLM only sees context from retrieved documents (prevents hallucination)
3. Hugging Face endpoints make using powerful models accessible without local hardware