# RAG - Retrieval & Search


Install Packages


In [1]:
!uv pip install -q \
    python-dotenv==1.2.1 \
    pandas==2.3.2 \
    pandas-stubs==2.3.2.250827 \
    numpy==2.3.2 \
    matplotlib==3.10.6 \
    seaborn==0.13.2 \
    scikit-learn==1.7.1 \
    tqdm==4.67.1 \
    requests==2.32.5 \
    litellm==1.78.5 \
    elasticsearch==8.19.3

Download sample search engine


In [2]:
!test -f minsearch.py || wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/refs/heads/main/minsearch.py

Import packages


In [None]:
import json

import litellm
import minsearch
import requests
from dotenv import load_dotenv
from elasticsearch import Elasticsearch
from tqdm import tqdm

load_dotenv()

True

Download documents


In [None]:
docs_url = "https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json"
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course["course"]
    for doc in course["documents"]:
        doc["course"] = course_name
        documents.append(doc)

documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

## Search Engine


Instance search engine


In [None]:
index = minsearch.Index(
    text_fields=["section", "question", "text"], keyword_fields=["course"]
)

Defining question


In [None]:
q = "The course has already started, can I still enroll?"

Vectorizing documents


In [None]:
index.fit(documents)

<minsearch.Index at 0x7fcb212b6550>

Search


In [None]:
boost = {"question": 3.0, "section": 0.5}

results = index.search(
    query=q,
    boost_dict=boost,
    filter_dict={"course": "data-engineering-zoomcamp"},
)
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

## Generation


In [None]:
messages = [
    {
        "role": "user",
        "content": q,
    },
]


completion = litellm.completion(
    model="gemini/gemini-2.5-flash",
    messages=messages,
)

response = completion.choices[0].message.content
print(response)

It's definitely worth checking! While it depends heavily on the specific course, institution, and how long it's been running, late enrollment is often possible.

Here's what you should consider and do:

1.  **How long ago did it start?**
    *   **A few days to a week:** Very likely possible. You might just have a bit of catching up to do.
    *   **1-3 weeks:** Still often possible, especially if the course isn't heavily cumulative or has a modular structure. You'll need to commit to catching up quickly.
    *   **More than 3-4 weeks (or a significant portion of the course):** Less likely, but not impossible. It depends on the course's intensity, missed assignments, and the institution's policies.

2.  **Type of Course/Institution:**
    *   **Universities/Colleges:** Often have strict "add/drop" deadlines, but sometimes exceptions are made with instructor permission and/or late fees. You'd contact the admissions office, registrar, or the specific department.
    *   **Online Platform

Prompt template


In [None]:
prompt_template = """
You're a course teaching assistant. Answer the question based on the CONTEXT.
Use only the facts from the CONTEXT when answering the QUESTION.
If the context doesn't contain the answer, output NONE

QUESTION: {question}

CONTEXT:
{context}
"""

Context


In [None]:
context = ""

for doc in results:
    context = (
        context
        + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    )

In [None]:
prompt = prompt_template.format(question=q, context=context).strip()

In [None]:
messages = [
    {
        "role": "user",
        "content": prompt,
    },
]


completion = litellm.completion(
    model="gemini/gemini-2.5-flash",
    messages=messages,
)

response = completion.choices[0].message.content
print(response)

Yes, even if you don't register, you're still eligible to submit the homeworks. However, be aware that there will be deadlines for turning in the final projects. You don't need a confirmation email, as you are accepted, and you can just start learning and submitting homework without registering. Registration is just to gauge interest before the start date.


Define search function


In [None]:
def search(query):
    boost = {"question": 3.0, "section": 0.5}

    results = index.search(
        query=query,
        filter_dict={"course": "data-engineering-zoomcamp"},
        boost_dict=boost,
        num_results=5,
    )

    return results

Search sample


In [None]:
results = search("How do I run Kafka?")
results

[{'text': 'In the project directory, run:\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java',
  'section': 'Module 6: streaming with kafka',
  'question': 'Java Kafka: How to run producer/consumer/kstreams/etc in terminal',
  'course': 'data-engineering-zoomcamp'},
 {'text': "Solution from Alexey: create a virtual environment and run requirements.txt and the python files in that environment.\nTo create a virtual env and install packages (run only once)\npython -m venv env\nsource env/bin/activate\npip install -r ../requirements.txt\nTo activate it (you'll need to run it every time you need the virtual env):\nsource env/bin/activate\nTo deactivate it:\ndeactivate\nThis works on MacOS, Linux and Windows - but for Windows the path is slightly different (it's env/Scripts/activate)\nAlso the virtual environment should be created only to run the python file. Docker images should first all be up and running.",
  'section': 'Module 6: streaming wi

In [None]:
def build_prompt(query, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the question based on the CONTEXT.
    Use only the facts from the CONTEXT when answering the QUESTION.

    QUESTION: {question}

    CONTEXT:
    {context}
    """
    context = ""

    for doc in search_results:
        context = (
            context
            + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
        )

    prompt = prompt_template.format(question=query, context=context).strip()

    return prompt

In [None]:
def llm(prompt):
    messages = [
        {
            "role": "user",
            "content": prompt,
        },
    ]

    completion = litellm.completion(
        model="gemini/gemini-2.5-flash",
        messages=messages,
    )

    return completion.choices[0].message.content

In [None]:
query = "How do I run kafka?"


def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer


print(rag(query))

I'm sorry, but the provided context does not contain information on how to run Kafka itself. It only provides instructions for running Java producers/consumers/kstreams, setting up Python virtual environments for Kafka clients, or troubleshooting errors related to client applications.


## Elasticsearch

```sh
docker run --rm -it \
  --name elasticsearch \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "xpack.security.http.ssl.enabled=false" \
  -e "xpack.security.transport.ssl.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
  docker.elastic.co/elasticsearch/elasticsearch:8.5.1
```


In [None]:
es_client = Elasticsearch(
    "http://localhost:9200",
)

In [None]:
es_client.info()

ObjectApiResponse({'name': 'dccb7778e1c5', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'S0MEfvWuQYCo5YItsMnDsg', 'version': {'number': '8.5.1', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'c1310c45fc534583afe2c1c03046491efba2bba2', 'build_date': '2022-11-09T21:02:20.169855900Z', 'build_snapshot': False, 'lucene_version': '9.4.1', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [None]:
index_settings = {
    "settings": {"number_of_shards": 1, "number_of_replicas": 0},
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
        }
    },
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [None]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|██████████| 948/948 [00:36<00:00, 25.67it/s]


In [None]:
query = "I just discovered the course. Can I still join in?"


def elasticsearch_query(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields",
                    }
                },
                "filter": {"term": {"course": "data-engineering-zoomcamp"}},
            }
        },
    }
    response = es_client.search(index=index_name, body=search_query)

    result_docs = []
    for hit in response["hits"]["hits"]:
        result_docs.append(hit["_source"])

    return result_docs

In [None]:
elasticsearch_query(query)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, the slack channel remains open and you can ask questions there. But always sDocker containers exit code w search the channel first and second, check the FAQ (this document), most likely all your questions are already answered here.\nYou can also tag the bot @ZoomcampQABot to help you conduct the search, but don’t rely on its answers 100%, it is pretty good though.',
  'section': 'General course-related questions',
  'question': 'Course - Can I get support if I take the course in the self-paced mode?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and s

In [None]:
query = "How do I run kafka?"


def rag(query):
    search_results = elasticsearch_query(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer


print(rag(query))

To run Kafka components:

For **Java Kafka** applications (like producers, consumers, or KStreams), navigate to your project directory and execute a command similar to this example:
```bash
java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java
```

For **Python Kafka** files (e.g., `producer.py`), you need to set up a virtual environment first:
1.  **Create a virtual environment** (run only once):
    ```bash
    python -m venv env
    ```
2.  **Activate the virtual environment**:
    *   On MacOS/Linux:
        ```bash
        source env/bin/activate
        ```
    *   On Windows:
        ```bash
        env/Scripts/activate
        ```
3.  **Install necessary packages** (run only once):
    ```bash
    pip install -r ../requirements.txt
    ```
    After these steps, you can run your Python files within this activated environment.
    To deactivate the environment when finished, run:
    ```bash
    deactivate
    ```
