Extension to the rag-intro.ipynb notebook.

In [1]:
import os

In [2]:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

In [3]:
from openai import OpenAI

client = OpenAI(api_key = OPENAI_API_KEY)

response = client.responses.create(
                model = "gpt-4.1",
                temperature = 0,
                instructions = "Give your best answer for the following question:",
                input = f"""
Is it too late to join the course?
""")

In [4]:
response

Response(id='resp_684f428fca1481a3a14d9eb16e4862ac02c84587df2f02cc', created_at=1750024847.0, error=None, incomplete_details=None, instructions='Give your best answer for the following question:', metadata={}, model='gpt-4.1-2025-04-14', object='response', output=[ResponseOutputMessage(id='msg_684f4290431081a3a75ad7aece3dc7c702c84587df2f02cc', content=[ResponseOutputText(annotations=[], text='Whether it’s too late to join a course depends on several factors:\n\n1. **Course Start Date:** If the course has just started or is about to start, you may still be able to join.\n2. **Enrollment Policy:** Some courses allow late enrollment within a certain window, while others do not.\n3. **Type of Course:** Online and self-paced courses often have more flexible start dates, while in-person or cohort-based courses may have stricter deadlines.\n4. **Instructor/Institution Policy:** Sometimes, instructors or institutions make exceptions for latecomers, especially if you contact them directly.\n\n*

/Users/lalo/Projects/DataTalks/llmzc-2025/01-intro/minsearch.py:10: UserWarning: Now minsearch is installable via pip: 'pip install minsearch'. Remove the downloaded file and re-install it with pip.

In [5]:
import minsearch

In [6]:
import json

In [7]:
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [8]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [9]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [10]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

SELECT * WHERE course = 'data-engineering-zoomcamp';

In [11]:
q = 'the course has already started, can I still enroll?'

In [12]:
index.fit(documents)

<minsearch.minsearch.Index at 0x11e420440>

In [13]:
from openai import OpenAI

In [14]:
client = OpenAI()

In [15]:
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": q}]
)

response.choices[0].message.content

"That depends on the specific course and the institution offering it. Many institutions have late enrollment policies or add/drop periods where students can still register for courses after they have started. You should check the course's enrollment policies, contact the admissions office, or reach out to the course instructor to see if late enrollment is possible. Keep in mind that enrolling late might require catching up on missed material, so ask about available resources to help you do so."

In [16]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

In [17]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [18]:
def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [19]:
query = 'how do I run kafka?'

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [20]:
rag(query)

'To run Kafka in a Java environment, navigate to your project directory and execute the following command in the terminal:\n\n```bash\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n```\n\nReplace `<jar_name>` with the actual name of your JAR file.'

In [21]:
rag('the course has already started, can I still enroll?')

"Yes, you can still enroll in the course after it has started. Even if you don't register, you are eligible to submit the homework. However, keep in mind that there will be deadlines for submitting the final projects, so it's important not to delay until the last minute."

In [22]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [25]:
! ../elastic-start-local/start.sh

[1A[1B[0G[?25l[+] Running 2/3
 [33m⠋[0m Container es-local-dev      Starting                                    [34m0.1s [0m
 [32m✔[0m Container kibana_settings   [32mCreated[0m                                     [34m0.0s [0m
 [32m✔[0m Container kibana-local-dev  [32mCreated[0m                                     [34m0.0s [0m
[?25h[1A[1A[1A[1A[0G[?25l[+] Running 2/3
 [33m⠙[0m Container es-local-dev      Waiting                                     [34m0.2s [0m
 [32m✔[0m Container kibana_settings   [32mCreated[0m                                     [34m0.0s [0m
 [32m✔[0m Container kibana-local-dev  [32mCreated[0m                                     [34m0.0s [0m
[?25h[1A[1A[1A[1A[0G[?25l[+] Running 2/3
 [33m⠹[0m Container es-local-dev      Waiting                                     [34m0.3s [0m
 [32m✔[0m Container kibana_settings   [32mCreated[0m                                     [34m0.0s [0m
 [32m✔[0m Container kibana-loca

In [26]:
from elasticsearch import Elasticsearch

In [27]:
es_client = Elasticsearch('http://localhost:9200', basic_auth=("elastic", "IkqvXZGr")) 

In [28]:
es_client

<Elasticsearch(['http://localhost:9200'])>

In [29]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

BadRequestError: BadRequestError(400, 'resource_already_exists_exception', 'index [course-questions/my5FFPA1SsCUcUzpruNVjg] already exists')

In [30]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [31]:
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [32]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|████████████████████████████████████████| 948/948 [00:02<00:00, 329.63it/s]


In [33]:
query = 'I just disovered the course. Can I still join it?'

In [34]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [35]:
def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [36]:
rag(query)

"Yes, you can still join the course even if you discovered it after the start date. You are eligible to submit the homeworks without registration. However, keep in mind that there are deadlines for the final projects, so it's important not to leave everything for the last minute."