## Q1. Running Elastic
Run Elastic Search 8.17.6, and get the cluster information. What's the version.build_hash value?

In [1]:
response = !curl 'localhost:9200'

In [2]:
import subprocess
import json

In [3]:
result = subprocess.run(
    ["curl", "-s", "http://localhost:9200"],
    stdout=subprocess.PIPE,
    text=True
)

data = json.loads(result.stdout)

build_hash = data["version"]["build_hash"]

print("Build Hash:", build_hash)

Build Hash: dbcbbbd0bc4924cfeb28929dc05d82d662c527b7


In [4]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

## Q2. Indexing the data

In [5]:
!pip install elasticsearch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [6]:
from elasticsearch import Elasticsearch

In [7]:
es_client = Elasticsearch(hosts='http://localhost:9200',headers={"Accept": "application/vnd.elasticsearch+json; compatible-with=8",
             "Content-Type": "application/vnd.elasticsearch+json; compatible-with=8"})

In [37]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            #"section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "questions"
es_client.indices.create(index=index_name,body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'questions'})

In [38]:
from tqdm.auto import tqdm
for doc in tqdm(documents): 
    #We add data using index
    es_client.index(index=index_name,document=doc)

100%|███████████████████████████████████████████████████████████████████████████████████| 948/948 [00:03<00:00, 303.70it/s]


## Q3. Searching
We will execute a query "How do execute a command on a Kubernetes pod?".

Use only question and text fields and give question a boost of 4, and use "type": "best_fields".

In [39]:
query= "How do execute a command on a Kubernetes pod?"

In [49]:
search_query = {
        "size": 10,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

In [50]:
response=es_client.search(index=index_name,body=search_query)

In [51]:
response['hits']

{'total': {'value': 334, 'relation': 'eq'},
 'max_score': 31.973522,
 'hits': [{'_index': 'questions',
   '_id': 'GM1XbpcBCHcM2Mh4e3uS',
   '_score': 31.973522,
   '_source': {'text': 'Install the astronomer-cosmos package as a dependency. (see Terraform example).\nMake a new folder, dbt/, inside the dags/ folder of your Composer GCP bucket and copy paste your dbt-core project there. (see example)\nEnsure your profiles.yml is configured to authenticate with a service account key. (see BigQuery example)\nCreate a new DAG using the DbtTaskGroup class and a ProfileConfig specifying a profiles_yml_filepath that points to the location of your JSON key file. (see example)\nYour dbt lineage graph should now appear as tasks inside a task group like this:',
    'section': 'Course Management Form for Homeworks',
    'question': 'How to run a dbt-core project as an Airflow Task Group on Google Cloud Composer using a service account JSON key',
    'course': 'data-engineering-zoomcamp'}},
  {'_inde

## Q4. Filtering
Now ask a different question: "How do copy a file to a Docker container?".

In [53]:
query="How do copy a file to a Docker container?"

In [78]:
def elastic_search(query):   
    search_query = {
            "size": 5,
            "query": {
                "bool": {
                    "must": {
                        "multi_match": {
                            "query": query,
                            "fields": ["question^4", "text"],
                            "type": "best_fields"
                        }
                    },
                    "filter": {
                        "term": {
                            "course": "machine-learning-zoomcamp"
                        }
                    }
                }
            }
        }
    response=es_client.search(index=index_name,body=search_query)
    result_docs = []
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    return result_docs


## Q5. Building a prompt

In [80]:
def build_prompt(query, search_results):
    context_template =  """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: {question}
    
    CONTEXT:
    {context}
    """.strip()
    
    context = ""
    
    for doc in search_results:
        context= context+ f"question: {doc['question']}\nanswer: {doc['text']}\n\n"
    prompt = context_template.format(question=query,context=context).strip()
    return prompt



In [82]:
query = "How do copy a file to a Docker container?"
result_docs = elastic_search(query)
prompt = build_prompt(query,result_docs)

In [86]:
len(prompt.strip())

2235

In [87]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Downloading tiktoken-0.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25hDownloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.9/796.9 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: regex, tiktoken
Successfully installed regex-2024.11.6 tiktoken-0.9.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[

In [89]:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4o")

In [93]:
len(encoding.encode(prompt))

494