# RAG - Retrieval & Search


Install Packages


In [27]:
!uv pip install -q \
    python-dotenv==1.2.1 \
    pandas==2.3.2 \
    pandas-stubs==2.3.2.250827 \
    numpy==2.3.2 \
    matplotlib==3.10.6 \
    seaborn==0.13.2 \
    scikit-learn==1.7.1 \
    tqdm==4.67.1 \
    requests==2.32.5 \
    litellm==1.78.5

Download sample search engine


In [2]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/refs/heads/main/minsearch.py

--2026-01-24 15:23:20--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/refs/heads/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8003::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4350 (4.2K) [text/plain]
Saving to: ‘minsearch.py’


2026-01-24 15:23:20 (23.5 MB/s) - ‘minsearch.py’ saved [4350/4350]



Import packages


In [None]:
import json

import litellm
import minsearch
import requests
from dotenv import load_dotenv

load_dotenv()

True

Download documents


In [None]:
docs_url = "https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json"
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course["course"]
    for doc in course["documents"]:
        doc["course"] = course_name
        documents.append(doc)

documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

## Search Engine


Instance search engine


In [None]:
index = minsearch.Index(
    text_fields=["section", "question", "text"], keyword_fields=["course"]
)

Defining question


In [None]:
q = "The course has already started, can I still enroll?"

Vectorizing documents


In [None]:
index.fit(documents)

<minsearch.Index at 0x7fb5c6bed5d0>

Search


In [None]:
boost = {"question": 3.0, "section": 0.5}

results = index.search(
    query=q,
    boost_dict=boost,
    filter_dict={"course": "data-engineering-zoomcamp"},
)
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

## Generation


In [None]:
messages = [
    {
        "role": "user",
        "content": q,
    },
]


completion = litellm.completion(
    model="gemini/gemini-2.5-flash",
    messages=messages,
)

response = completion.choices[0].message.content
print(response)

It's definitely worth checking! While some courses have strict enrollment deadlines, many institutions and platforms allow for late enrollment, especially within the first week or two of a course starting.

Here's what you should do and what factors will likely influence the decision:

**1. What You Should Do Immediately:**

*   **Contact the Admissions/Registrar's Office (for academic institutions):** They handle all enrollment procedures and will know the official late registration policy.
*   **Contact the Course Coordinator or Instructor (if you know who it is):** They can tell you if it's feasible to catch up and if they are willing to accept late enrollees.
*   **Check the Course Platform/Website:** Often, late enrollment policies and deadlines are posted there.

**2. Factors That Influence Late Enrollment:**

*   **Institutional Policy:** This is the biggest factor. Some places have a hard "no" after a certain date, while others have a grace period.
*   **Type of Course:**
    *

Prompt template


In [None]:
prompt_template = """
You're a course teaching assistant. Answer the question based on the CONTEXT.
Use only the facts from the CONTEXT when answering the QUESTION.
If the context doesn't contain the answer, output NONE

QUESTION: {question}

CONTEXT:
{context}
"""

Context


In [None]:
context = ""

for doc in results:
    context = (
        context
        + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    )

In [None]:
prompt = prompt_template.format(question=q, context=context).strip()

In [None]:
messages = [
    {
        "role": "user",
        "content": prompt,
    },
]


completion = litellm.completion(
    model="gemini/gemini-2.5-flash",
    messages=messages,
)

response = completion.choices[0].message.content
print(response)

Yes, even if you don't register, you're still eligible to submit the homeworks. However, be aware that there will be deadlines for turning in the final projects.


Define search function


In [None]:
def search(query):
    boost = {"question": 3.0, "section": 0.5}

    results = index.search(
        query=query,
        filter_dict={"course": "data-engineering-zoomcamp"},
        boost_dict=boost,
        num_results=5,
    )

    return results

Search sample


In [None]:
results = search("How do I run Kafka?")
results

[{'text': 'In the project directory, run:\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java',
  'section': 'Module 6: streaming with kafka',
  'question': 'Java Kafka: How to run producer/consumer/kstreams/etc in terminal',
  'course': 'data-engineering-zoomcamp'},
 {'text': "Solution from Alexey: create a virtual environment and run requirements.txt and the python files in that environment.\nTo create a virtual env and install packages (run only once)\npython -m venv env\nsource env/bin/activate\npip install -r ../requirements.txt\nTo activate it (you'll need to run it every time you need the virtual env):\nsource env/bin/activate\nTo deactivate it:\ndeactivate\nThis works on MacOS, Linux and Windows - but for Windows the path is slightly different (it's env/Scripts/activate)\nAlso the virtual environment should be created only to run the python file. Docker images should first all be up and running.",
  'section': 'Module 6: streaming wi

In [None]:
def build_prompt(query, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the question based on the CONTEXT.
    Use only the facts from the CONTEXT when answering the QUESTION.

    QUESTION: {question}

    CONTEXT:
    {context}
    """
    context = ""

    for doc in results:
        context = (
            context
            + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
        )

    prompt = prompt_template.format(question=query, context=context).strip()

    return prompt

In [None]:
def llm(prompt):
    messages = [
        {
            "role": "user",
            "content": prompt,
        },
    ]

    completion = litellm.completion(
        model="gemini/gemini-2.5-flash",
        messages=messages,
    )

    return completion.choices[0].message.content

In [66]:
query = "How do I run kafka?"


def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer


print(rag(query))

Based on the context provided:

To run Java Kafka applications (like producers, consumers, or KStreams), navigate to the project directory and execute a command similar to this:
`java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java`

To run Python Kafka applications (like `producer.py`), you should:
1. Create and activate a virtual environment:
   `python -m venv env`
   `source env/bin/activate` (or `env/Scripts/activate` on Windows)
2. Install necessary packages:
   `pip install -r ../requirements.txt`
3. Ensure that Docker images are first all up and running before running your Python files.

To deactivate the virtual environment, use `deactivate`.
