In [1]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-06-28 23:06:04--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Connecting to 127.0.0.1:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py.1’


2024-06-28 23:06:05 (79.4 KB/s) - ‘minsearch.py.1’ saved [3832/3832]



In [2]:
!wget https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json

--2024-06-28 23:06:05--  https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json
Connecting to 127.0.0.1:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 658332 (643K) [text/plain]
Saving to: ‘documents.json.1’


2024-06-28 23:06:06 (1.96 MB/s) - ‘documents.json.1’ saved [658332/658332]



In [3]:
import minsearch

In [4]:
import json

In [5]:
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [6]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [7]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [8]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

SELECT * WHERE course = 'data-engineering-zoomcamp';

In [9]:
q = 'the course has already started, can I still enroll?'

In [10]:
index.fit(documents)

<minsearch.Index at 0x7f933fb4cbc0>

In [11]:
# from openai import OpenAI
from groq import Groq

from dotenv import load_dotenv

load_dotenv()

True

In [12]:
# client = OpenAI()
client = Groq()

In [27]:
response = client.chat.completions.create(
    # model='gpt-4o',
    model="llama3-8b-8192",
    messages=[{"role": "user", "content": q}]
)

# print(response.choices[0].message.content)

In [28]:
print(response.choices[0].message.content)

A common concern!

Unfortunately, once a course has already started, it's usually not possible to enroll in it. Here's why:

1. **Lecture schedule**: The course has already begun, which means that lectures, assignments, and evaluations are underway. Catching up on all the material and missed classes can be challenging.
2. **Classroom availability**: The course is already being taught in a physical classroom, and seats are likely already taken. There may not be any available space for a new student to join the class.
3. **Instructor's schedule**: The instructor has already committed to teaching the course, and their schedule is likely already set. Accommodating a new student would require adjusting the instructor's schedule, which can be cumbersome.

However, it never hurts to ask!

You can still reach out to the course instructor, department, or program coordinator to inquire about enrollment possibilities. They may be able to offer alternative solutions, such as:

1. **Late enrollment

In [29]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

In [30]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [31]:
def llm(prompt):
    response = client.chat.completions.create(
        # model='gpt-4o',
        model="llama3-8b-8192",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [32]:
query = 'how do I run kafka?'

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [33]:
rag(query)

"Based on the CONTEXT, here's how to run Kafka-related commands in the terminal:\n\n**For Java:**\nTo run the producer/consumer/kstreams/etc, navigate to the project directory and run:\n```\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n```\n**For Python:**\nTo run Python code that uses Kafka, ensure that you have created a virtual environment and installed the necessary dependencies. The steps to do this are:\n```\npython -m venv env\nsource env/bin/activate\npip install -r ../requirements.txt\n```\nTo activate the virtual environment, run `source env/bin/activate` (or `env/Scripts/activate` on Windows). To deactivate, run `deactivate`.\n\nNote that the error `ModuleNotFoundError: No module named 'kafka.vendor.six.moves'` can be fixed by using `pip install kafka-python-ng` instead of `pip install kafka-python`."

In [34]:
print(rag('the course has already started, can I still enroll?'))

Based on the context, the QUESTION "the course has already started, can I still enroll?" does not match any of the existing answers. The closest answer is:

"yes, even if you don't register, you're still eligible to submit the homeworks"

However, this answer is not directly applicable since the course has already started. There is no answer that specifically addresses the scenario where the course has already started and someone wants to enroll.

Therefore, I would respond:

"I'm sorry, but according to the FAQ database, I couldn't find a direct answer to this question. Since the course has already started, I would recommend reaching out to the course organizers or support team directly to inquire about the possibility of enrolling at this late stage."


In [35]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [38]:
from elasticsearch import Elasticsearch

In [39]:
es_client = Elasticsearch('http://localhost:9200') 

In [40]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [41]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [42]:
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [43]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|██████████| 948/948 [00:06<00:00, 143.41it/s]


In [44]:
query = 'I just disovered the course. Can I still join it?'

In [45]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [46]:
def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [48]:
print(rag(query))

Thank you for your interest in the course! According to our FAQ, yes, you can still join the course even after the start date. You're eligible to submit homeworks, and while there will be deadlines for final projects, you don't need to rush.
