## Q1. Running Elastic 

Run Elastic Search 8.4.3, and get the cluster information. If you run it on localhost, this is how you do it:

```bash
curl localhost:9200
```

What's the `version.build_hash` value?

In [1]:
!curl localhost:9200

{
  "name" : "30642c208c9f",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "fX8snEpWSFuhBxClm2d_ug",
  "version" : {
    "number" : "8.4.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
    "build_date" : "2022-10-04T07:17:24.662462378Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}


##### Answer Q1: 
+ 42f05b9372a9a4a470db3b52817899b99a76ee73

## Getting the data

Now let's get the FAQ data. You can run this snippet:

```python
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)
```

Note that you need to have the `requests` library:

```bash
pip install requests
```

In [2]:
# pip install requests

In [2]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

## Q2. Indexing the data

Index the data in the same way as was shown in the course videos. Make the `course` field a keyword and the rest should be text. 

Don't forget to install the ElasticSearch client for Python:

```bash
pip install elasticsearch
```

Which function do you use for adding your data to elastic?

* `insert`
* `index`
* `put`
* `add`

In [4]:
# pip install elasticsearch

In [3]:
from elasticsearch import Elasticsearch

In [4]:
es_client = Elasticsearch('http://localhost:9200')

In [5]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

# Check if the index exists and create it if it doesn't
if not es_client.indices.exists(index=index_name):
    es_client.indices.create(index=index_name, body=index_settings)


##### Answer Q2:
+ index

## Q3. Searching

Now let's search in our index. 

We will execute a query "How do I execute a command in a running docker container?". 

Use only `question` and `text` fields and give `question` a boost of 4, and use `"type": "best_fields"`.

What's the score for the top ranking result?

* 94.05
* 84.05
* 74.05
* 64.05

Look at the `_score` field.

In [6]:
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|███████████████████████████████████████████████████████████████████████████████████| 948/948 [00:30<00:00, 31.22it/s]


In [10]:
query = 'How do I execute a command in a running docker container?'

In [11]:
def elastic_search(query):

    search_query = {
        # "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                },
                    }
                }
            }
    try:
        response = es_client.search(index=index_name, body=search_query)
        
        top_result = response['hits']['hits'][0] if response['hits']['hits'] else None
    
        if top_result:
            print(f"Top document score: {top_result['_score']}")
            print(f"Document details: {top_result['_source']}")
        else:
            print("No results found.")
            
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return []

In [12]:
elastic_search(query)

Top document score: 84.050095
Document details: {'text': 'Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.\ndocker run -it --entrypoint bash <image>\nIf the container is already running, execute a command in the specific container:\ndocker ps (find the container-id)\ndocker exec -it <container-id> bash\n(Marcos MJD)', 'section': '5. Deploying Machine Learning Models', 'question': 'How do I debug a docker container?', 'course': 'machine-learning-zoomcamp'}


## Q4. Filtering

Now let's only limit the questions to `machine-learning-zoomcamp`.

Return 3 results. What's the 3rd question returned by the search engine?

* How do I debug a docker container?
* How do I copy files from a different folder into docker container’s working directory?
* How do Lambda container images work?
* How can I annotate a graph?


In [13]:
def elastic_search2(query):
    search_query = {
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "machine-learning-zoomcamp"
                    }
                }
            }
        }
    }

    try:
        response = es_client.search(index=index_name, body=search_query)
    
        result = response['hits']['hits'][2] if response['hits']['hits'] else None
    
        if result:
            print(f"Top document score: {result['_score']}")
            print(f"Question: {result['_source']['question']}")
        else:
            print("No results found.")
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return []

Answer Q4:¶

In [14]:
elastic_search2(query)

Top document score: 49.938507
Question: How do I copy files from a different folder into docker container’s working directory?


In [15]:
def elastic_search3(query):
    search_query = {
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "machine-learning-zoomcamp"
                    }
                }
            }
        }
    }

    try:
        response = es_client.search(index=index_name, body=search_query)
        result_docs = []
        for hit in response['hits']['hits']:
            result_docs.append(hit['_source'])
    except Exception as e:
        print(f"An error occurred: {str(e)}")
    
    return result_docs 

In [16]:
elastic_search3(query)

[{'text': 'Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.\ndocker run -it --entrypoint bash <image>\nIf the container is already running, execute a command in the specific container:\ndocker ps (find the container-id)\ndocker exec -it <container-id> bash\n(Marcos MJD)',
  'section': '5. Deploying Machine Learning Models',
  'question': 'How do I debug a docker container?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "You can copy files from your local machine into a Docker container using the docker cp command. Here's how to do it:\nTo copy a file or directory from your local machine into a running Docker container, you can use the `docker cp command`. The basic syntax is as follows:\ndocker cp /path/to/local/file_or_directory container_id:/path/in/container\nHrithik Kumar Advani",
  'section': '5. Deploying Machine Learning Models',
  'question': 'How do I copy files from my local machine to docker container?',
 

## Q5. Building a prompt

Now we're ready to build a prompt to send to an LLM. 

Take the records returned from Elasticsearch in Q4 and use this template to build the context. Separate context entries by two linebreaks (`\n\n`)
```python
context_template = """
Q: {question}
A: {text}
""".strip()
```

Now use the context you just created along with the "How do I execute a command in a running docker container?" question 
to construct a prompt using the template below:

```
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()
```

What's the length of the resulting prompt? (use the `len` function)

* 962
* 1462
* 1962
* 2462

In [17]:
def build_prompt(query, elastic_search):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

    # context_template = """
    # Q: {question}
    # A: {text}
    # """.strip()

    context = ""

    for doc in elastic_search:
        context = context + f"question: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    # prompt_length = len(prompt)

    # print("The length of the prompt is:", prompt_length)
    return prompt
    

In [18]:
build_prompt(query, elastic_search3)

TypeError: 'function' object is not iterable

In [None]:
prompt = "What's the length of the resulting prompt?"
prompt_length = len(prompt)

print("The length of the prompt is:", prompt_length)

## Q6. Tokens

When we use the OpenAI Platform, we're charged by the number of 
tokens we send in our prompt and receive in the response.

The OpenAI python package uses `tiktoken` for tokenization:

```bash
pip install tiktoken
```

Let's calculate the number of tokens in our query: 

```python
encoding = tiktoken.encoding_for_model("gpt-4o")
```

Use the `encode` function. How many tokens does our prompt have?

* 122
* 222
* 322
* 422

Note: to decode back a token into a word, you can use the `decode_single_token_bytes` function:

```python
encoding.decode_single_token_bytes(63842)
```

In [None]:
RAG - Retrieval Augmented Generation // Generowanie Rozszerzonego Pobierania

#### LLM Zoomcamp 1.3 - Retrieval and Search

In [None]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

In [27]:
!/llm-zoomcamp-1.png

/bin/bash: /llm-zoomcamp-1.png: No such file or directory


In [25]:
# file/llm-zoomcamp-1.png
!workspaces/llm-zoomcamp/file/llm-zoomcamp-1.png

/bin/bash: workspaces/llm-zoomcamp/file/llm-zoomcamp-1.png: No such file or directory


In [1]:
import minsearch

In [None]:
!wget https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json

In [2]:
import json

In [3]:
with open('documents.json', 'rt') as f_in:
          docs_raw = json.load(f_in)

In [4]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [5]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [6]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [None]:
SELECT * WHERE course = 'data-engineering-zoomcamp';

In [9]:
q = 'the course has already started, can I still enroll?'

In [7]:
index.fit(documents)

<minsearch.Index at 0x73518bcaad10>

In [10]:
boost = {'questions': 3.0, 'section': 0.5}

results = index.search(
    query=q,
    filter_dict={'course' : 'data-engineering-zoomcamp'},
    boost_dict=boost,
    num_results=5
)

In [17]:
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineerin

#### LLM Zoomcamp 1.4 - Generating Answers with OpenAI GPT

In [11]:
from openai import OpenAI

In [12]:
client = OpenAI()q

In [20]:
q

'the course has already started, can I still enroll?'

In [29]:
response = client.chat.completions.create(
    model='gpt-3.5-turbo-16k',
    # messages=[{"role": "user", "content": "is it too late to join the course?"}]
    messages=[{"role": "user", "content": q}]
)

In [30]:
response

ChatCompletion(id='chatcmpl-9ctSOcG5NolTgIvvXPoMWWch3aieA', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="I'm sorry, but I am an AI language model and do not have the specific information about courses and enrollments. You would need to contact the institution or organization offering the course to inquire about enrolling after the course has already started. They will be able to provide you with the most accurate information regarding enrollment.", role='assistant', function_call=None, tool_calls=None))], created=1719056668, model='gpt-3.5-turbo-16k-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=63, prompt_tokens=18, total_tokens=81))

In [31]:
response.choices[0].message.content

"I'm sorry, but I am an AI language model and do not have the specific information about courses and enrollments. You would need to contact the institution or organization offering the course to inquire about enrolling after the course has already started. They will be able to provide you with the most accurate information regarding enrollment."

In [13]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT  doesn't contain the answer, output NONE

QUESTION: {question}

CONTEXT:
{context}
""".strip()

In [14]:
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [32]:
prompt = prompt_template.format(question=q, context=context).strip()

In [33]:
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT  doesn't contain the answer, output NONE

QUESTION: the course has already started, can I still enroll?

CONTEXT:
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - When will the course start?
answer: The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1
Subscribe to course public Google Calendar (it works from Desktop only

In [34]:
response = client.chat.completions.create(
    model='gpt-3.5-turbo',
    # messages=[{"role": "user", "content": "is it too late to join the course?"}]
    messages=[{"role": "user", "content": prompt}]
)

response.choices[0].message.content

"Yes, you can still join the course after the start date. Even if you don't register, you're still eligible to submit the homeworks. Just be aware of deadlines for turning in the final projects."

#### LLM Zoomcamp 1.4.2 - Exploring Alternatives to OpenAI

https://mistral.ai/

#### LLM Zoomcamp 1.5 - The RAG Flow Cleaning and Modularizing Code

In [36]:
def search(query):
    boost = {'questions': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course' : 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results
    

In [37]:
search('how do I run kafaka')

[{'text': "Answer: To run the provided code, ensure that the 'dlt[duckdb]' package is installed. You can do this by executing the provided installation command: !pip install dlt[duckdb]. If you’re doing it locally, be sure to also have duckdb pip installed (even before the duckdb package is loaded).",
  'section': 'Workshop 1 - dlthub',
  'question': 'How do I install the necessary dependencies to run the code?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'After you create a GitHub account, you should clone the course repo to your local machine using the process outlined in this video: Git for Everybody: How to Clone a Repository from GitHub\nHaving this local repository on your computer will make it easy for you to access the instructors’ code and make pull requests (if you want to add your own notes or make changes to the course content).\nYou will probably also create your own repositories that host your notes, versions of your file, to do this. Here is a great tutorial tha

In [38]:
query = ('how do I run kafaka')
search_results = search(query)

In [43]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT  doesn't contain the answer, output NONE

QUESTION: {question}

CONTEXT:
{context}
""".strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [44]:
build_prompt(query, search_results)
# prompt = build_prompt(query, search_results)

"You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database\nUse only the facts from the CONTEXT when answering the QUESTION.\nIf the CONTEXT  doesn't contain the answer, output NONE\n\nQUESTION: how do I run kafaka\n\nCONTEXT:\nsection: Workshop 1 - dlthub\nquestion: How do I install the necessary dependencies to run the code?\nanswer: Answer: To run the provided code, ensure that the 'dlt[duckdb]' package is installed. You can do this by executing the provided installation command: !pip install dlt[duckdb]. If you’re doing it locally, be sure to also have duckdb pip installed (even before the duckdb package is loaded).\n\nsection: General course-related questions\nquestion: How do I use Git / GitHub for this course?\nanswer: After you create a GitHub account, you should clone the course repo to your local machine using the process outlined in this video: Git for Everybody: How to Clone a Repository from GitHub\nHaving this local repository on y

In [49]:
def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-3.5-turbo',
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content

In [50]:
query = 'how do I run kafka?'
search_result = search(query)
prompt = build_prompt(query, search_result)
answer = llm(prompt)

In [51]:
answer

'To run Kafka, you need to create a virtual environment and run the requirements.txt and python files in that environment by following the steps provided in the answer.'

In [52]:
query = 'how do I run kafka?'

def rag(query):
    search_result = search(query)
    prompt = build_prompt(query, search_result)
    answer = llm(prompt)
    return answer

In [53]:
rag(query)

'To run Kafka, you need to create a virtual environment and install the necessary requirements using the provided steps.'

In [54]:
rag('the course has already started, can I still enroll?')

'Yes, even if the course has already started, you can still enroll. Remember that there will be deadlines for turning in the final projects, so make sure not to leave everything for the last minute.'

#### LLM Zoomcamp 1.6 - Search with Elasticsearch

In [58]:
!wget -P /workspaces/llm-zoomcamp/01-intro https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/README.md

--2024-06-22 12:46:36--  https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/README.md
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/workspaces/llm-zoomcamp/01-intro/README.md’

README.md               [ <=>                ] 325.63K  --.-KB/s    in 0.04s   

2024-06-22 12:46:37 (8.41 MB/s) - ‘/workspaces/llm-zoomcamp/01-intro/README.md’ saved [333447]



In [59]:
from elasticsearch import Elasticsearch

In [60]:
es_client = Elasticsearch('http://localhost:9200') 

In [61]:
es_client.info()

ObjectApiResponse({'name': '6edccedcc834', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'TyVwD6MCRiOpca5A6nLsyg', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [62]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [63]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [64]:
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [65]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|███████████████████████████████████████████████████████████████████████████████████| 948/948 [00:29<00:00, 32.13it/s]


In [67]:
query = 'I just discovered the course. Can I still join it?'

In [68]:
search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": query,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "data-engineering-zoomcamp"
                }
            }
        }
    }
}

In [69]:
es_client.search(index=index_name, body=search_query)

ObjectApiResponse({'took': 853, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 405, 'relation': 'eq'}, 'max_score': 72.849266, 'hits': [{'_index': 'course-questions', '_id': 'twogQJABf9OOEOx7P3TF', '_score': 72.849266, '_source': {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", 'section': 'General course-related questions', 'question': 'Course - Can I still join the course after the start date?', 'course': 'data-engineering-zoomcamp'}}, {'_index': 'course-questions', '_id': 'vAogQJABf9OOEOx7QHRy', '_score': 54.057133, '_source': {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next

In [70]:
response = es_client.search(index=index_name, body=search_query)

In [71]:
response

ObjectApiResponse({'took': 27, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 405, 'relation': 'eq'}, 'max_score': 72.849266, 'hits': [{'_index': 'course-questions', '_id': 'twogQJABf9OOEOx7P3TF', '_score': 72.849266, '_source': {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", 'section': 'General course-related questions', 'question': 'Course - Can I still join the course after the start date?', 'course': 'data-engineering-zoomcamp'}}, {'_index': 'course-questions', '_id': 'vAogQJABf9OOEOx7QHRy', '_score': 54.057133, '_source': {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next 

In [73]:
# response['hits']
response['hits']['hits']


[{'_index': 'course-questions',
  '_id': 'twogQJABf9OOEOx7P3TF',
  '_score': 72.849266,
  '_source': {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
   'section': 'General course-related questions',
   'question': 'Course - Can I still join the course after the start date?',
   'course': 'data-engineering-zoomcamp'}},
 {'_index': 'course-questions',
  '_id': 'vAogQJABf9OOEOx7QHRy',
  '_score': 54.057133,
  '_source': {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
   'section': 'General course-related questions',
   'question': 'Course - Can I follow the course after 

In [74]:
result_docs = []

for hit in response['hits']['hits']:
    result_docs.append(hit['_source'])

In [75]:
result_docs

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (insta

In [77]:
def elastic_search(query):

    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [78]:
elastic_search(query)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (insta

In [80]:
# query = 'how do I run kafka?'

def rag(query):
    search_result = elastic_search(query)
    prompt = build_prompt(query, search_result)
    answer = llm(prompt)
    return answer

In [82]:
rag(query)

"Yes, even if you don't register, you're still eligible to submit the homeworks."