# RAG (Retrieval-Augmented Generation) - Complete Workflow

## What is RAG?
RAG combines **retrieval** (searching for relevant information) with **generation** (using an LLM to create answers). Instead of the LLM answering from its training data alone, it first retrieves relevant context from a knowledge base.

## RAG Workflow (3 Main Steps):
1. **📚 INDEXING**: Store and organize documents in a searchable format
2. **🔍 RETRIEVAL**: Search the index to find relevant context for the user's question
3. **🤖 GENERATION**: Send the question + retrieved context to the LLM to generate an answer

## This Notebook:
- Uses **Hugging Face Llama-3.2-3B-Instruct** as the LLM
- Demonstrates indexing with **MinSearch** (simple) and **Elasticsearch** (advanced)
- Shows complete RAG pipeline step-by-step

**Prerequisites:** Run Hugging Face login from `setup.ipynb` first.

---
## 🔧 Setup & Installation
Install required packages for RAG pipeline.

In [36]:
!pip install minsearch langchain-huggingface


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [37]:
import minsearch
import json

---
## 📚 STEP 1: LOAD & PREPARE DATA
Before indexing, we need to load our knowledge base (FAQ documents).

In [38]:
# Load raw documents from JSON file
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [39]:
# Flatten the structure: each document gets its course label
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [40]:
# Inspect a sample document
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

---
## 🤖 STEP 2: SETUP LLM (Large Language Model)
Configure Hugging Face LLM that will generate answers based on retrieved context.

In [41]:
# Import Hugging Face libraries
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage

In [42]:
# Initialize Llama-3.2-3B model via Hugging Face API
llm_model = HuggingFaceEndpoint(
    repo_id="meta-llama/Llama-3.2-3B-Instruct",
    task="text-generation",
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True,
    repetition_penalty=1.03,
)

# Create chat client
client = ChatHuggingFace(llm=llm_model, verbose=False)

---
## 📇 STEP 3: INDEXING - Create Searchable Knowledge Base
Index documents using MinSearch (simple in-memory search engine).

In [43]:
# Create index with searchable fields
index = minsearch.Index(
    text_fields=["question", "text", "section"],  # Fields to search in
    keyword_fields=["course"]  # Fields for exact filtering
)

In [44]:
# Fit the index with our documents (this creates the searchable index)
index.fit(documents)

<minsearch.minsearch.Index at 0x726bbf78e300>

In [45]:
# Sample question to test
q = 'the course has already started, can I still enroll?'

### 🧪 Test: LLM Without Context (No RAG)
Let's see what the LLM answers without any context - it will likely be generic or incorrect.

In [46]:
# Ask LLM directly without retrieved context
messages = [("human", q)]
response = client.invoke(messages)
response.content

"I'm not aware of any specific course you're referring to. Could you please provide more context or information about the course, such as the name, provider, or institution? That way, I can try to help you with your enrollment inquiry."

Notice: The answer above is generic because the LLM doesn't have specific course information. **This is why we need RAG!**

---
## 🔍 STEP 4: RETRIEVAL - Search for Relevant Context
Define a search function to find relevant documents from our indexed knowledge base.

In [47]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

In [48]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [49]:
def llm(prompt):
    messages = [
        ("human", prompt)
    ]
    response = client.invoke(messages)
    return response.content

In [50]:
query = 'how do I run kafka?'

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [51]:
rag(query)

'Based on the provided contexts, I can help you with the following questions:\n\n1. How do I run Kafka?\n\n   Context: \n   - For Java Kafka: In the project directory, run: `java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java`\n   - For Python Kafka: Not explicitly mentioned in the provided contexts. However, it\'s implied that running the Python files in a virtual environment (as described in the context for `producer.py`) would work.\n\n2. How do I run producer.py?\n\n   Context: \n   Solution from Alexey: Create a virtual environment and run `requirements.txt` and the Python files in that environment.\n   To create a virtual env and install packages (run only once):\n   `python -m venv env`\n   `source env/bin/activate`\n   `pip install -r ../requirements.txt`\n   To activate it (you\'ll need to run it every time you need the virtual env):\n   `source env/bin/activate`\n   To deactivate it:\n   `deactivate`\n\n3. How do I install the neces

In [52]:
rag('the course has already started, can I still enroll?')

'You can still enroll in the course after it has started.'

In [53]:
from elasticsearch import Elasticsearch

docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

run this for elasticsearch in 


In [57]:
es_client = Elasticsearch('http://localhost:9200') 

In [58]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [59]:
from tqdm.auto import tqdm

In [60]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [61]:
query = 'I just disovered the course. Can I still join it?'

In [62]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [63]:
def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [64]:
rag(query)

"You can join the course after the start date, even if you don't register."