<!-- +----------------+-----------------------------+--------------------------+
| Tool           | Description                 | Best Use Case            |
+================+=============================+==========================+
| Anaconda       | Full Python data distro     | Beginner-friendly setup  |
+----------------+-----------------------------+--------------------------+
| Miniconda      | Minimal Conda installer     | Custom, lightweight envs |
+----------------+-----------------------------+--------------------------+
| Micromamba     | Fast Conda alternative      | CI/CD, containers        |
+----------------+-----------------------------+--------------------------+ -->

| Tool        | Description                                           | Best Use Case             | Size (initial)        | Download Link                                                                 |
|-------------|-------------------------------------------------------|---------------------------|------------------------|--------------------------------------------------------------------------------|
| Anaconda    | Full Python data distro                               | Beginner-friendly setup   | 🚛 Huge (~3–5 GB)      | [Archive 📦](https://repo.anaconda.com/archive/)                               |
| Miniconda   | Minimal Conda installer (Python + conda)              | Custom, lightweight envs  | 📦 Small (~70 MB)      | [Miniconda Releases 🌐](https://repo.anaconda.com/miniconda/)                 |
| Micromamba  | Fast, lightweight Conda-compatible CLI (C++ binary)   | CI/CD, containers         | 🪶 Tiny (~2 MB binary) | [Micromamba GitHub 🚀](https://github.com/mamba-org/micromamba-releases/tags) |



In [1]:
# !conda init

## openai key 

- [platform.openai.com](https://platform.openai.com/docs/overview)
- https://huggingface.co/

In [2]:
# !pip install tqdm notebook jupyter ipywidgets

In [3]:
# !pip install openai transformers tensorflow tf-keras elasticsearch

In [4]:
# !pip install huggingface_hub python-dotenv

In [5]:
# import getpass

# Securely prompt user for Hugging Face token (input will be hidden)
# token = getpass.getpass("🔐 Enter your Hugging Face token: ").strip()

# Validate token is not empty
# if not token:
#     raise ValueError("⚠️ Token input was empty. Please enter a valid token.")

In [6]:
import os
from dotenv import load_dotenv, find_dotenv

# Load variables from .env file
load_dotenv()

# Get the token from environment variable
# os.environ["HUGGINGFACE_TOKEN"]
# os.environ.get("HUGGINGFACE_TOKEN", None)
token = os.getenv("HUGGINGFACE_TOKEN", None)

In [7]:
from huggingface_hub import login

# Perform login – this stores the token in your local cache securely
login(token=token)

# print("✅ Successfully logged in to Hugging Face Hub.")

In [8]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [9]:
from transformers import pipeline

qa = pipeline("question-answering", model="deepset/roberta-base-squad2")

result = qa({
    "context": documents[0]['text'],
    "question": documents[0]['question']
})

print(result["answer"])


2025-07-01 16:11:37.355160: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-01 16:11:37.375177: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-01 16:11:37.638701: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-01 16:11:37.799342: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751386298.106584  111229 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751386298.20

15th Jan 2024


## Elasticsearch, Inside Codespaces to Docker

- https://hub.docker.com/_/elasticsearch/tags

🔎 Explanation:
- xpack.security.enabled=false: disables security for quick testing
- discovery.type=single-node: prevents bootstrap checks
- elasticsearch:9.0.2: use official image from docker.elastic.co

```sh
docker ps
docker pull "elasticsearch:9.0.2"

docker run -d -p 9200:9200 -e "discovery.type=single-node" "elasticsearch:9.0.2"

docker run -d --name es-dev \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:9.0.2
```

Open Kibana in your browser:
- http://localhost:5601

In [10]:
%%writefile docker-compose.yml

version: "3.8"

services:

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:9.0.2
    container_name: es-dev
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false         # disable auth
      - ES_JAVA_OPTS=-Xms512m -Xmx512m       # lower heap for dev use
    ports:
      - "9200:9200"
    volumes:
      - esdata:/usr/share/elasticsearch/data
    healthcheck:
      test: ["CMD-SHELL", "curl -s http://localhost:9200 >/dev/null || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 5

  kibana:
    image: docker.elastic.co/kibana/kibana:9.0.2
    container_name: kibana
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - xpack.security.enabled=false
    ports:
      - "5601:5601"
    depends_on:
      elasticsearch:
        condition: service_healthy

volumes:
  esdata:


Overwriting docker-compose.yml


In [11]:
from elasticsearch import Elasticsearch

In [12]:
es_client = Elasticsearch('http://localhost:9200')

In [13]:
es_client.info()

ObjectApiResponse({'name': '2ccf31400ab2', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'ijKUPCs6S4m7q9woeGo4dA', 'version': {'number': '9.0.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '0a58bc1dc7a4ae5412db66624aab968370bd44ce', 'build_date': '2025-05-28T10:06:37.834829258Z', 'build_snapshot': False, 'lucene_version': '10.1.0', 'minimum_wire_compatibility_version': '8.18.0', 'minimum_index_compatibility_version': '8.0.0'}, 'tagline': 'You Know, for Search'})

In [14]:
!curl http://localhost:9200

{
  "name" : "2ccf31400ab2",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "ijKUPCs6S4m7q9woeGo4dA",
  "version" : {
    "number" : "9.0.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "0a58bc1dc7a4ae5412db66624aab968370bd44ce",
    "build_date" : "2025-05-28T10:06:37.834829258Z",
    "build_snapshot" : false,
    "lucene_version" : "10.1.0",
    "minimum_wire_compatibility_version" : "8.18.0",
    "minimum_index_compatibility_version" : "8.0.0"
  },
  "tagline" : "You Know, for Search"
}


In [15]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

In [16]:
if not es_client.indices.exists(index=index_name):
    es_client.indices.create(index=index_name, body=index_settings)
    print(f"Index '{index_name}' created.")
else:
    print(f"Index '{index_name}' already exists, skipping creation.")

Index 'course-questions' already exists, skipping creation.


In [17]:
from tqdm.auto import tqdm 

In [18]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [19]:
query = "How do execute a command on a Kubernetes pod?"

search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": query,
                    "fields": ["question^4", "text"],
                    "type": "best_fields"
                }
            },
        }
    }
}

search_results = es_client.search(index=index_name, body=search_query)

In [20]:
search_results['hits']['hits'][0]['_score']

44.56891

In [21]:
query = "How do copy a file to a Docker container?"

search_query = {
    "size": 3,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": query,
                    "fields": ["question^4", "text"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "machine-learning-zoomcamp"
                }
            }
        }
    }
}

search_results = es_client.search(index=index_name, body=search_query)

In [22]:
search_results['hits']['hits']

[{'_index': 'course-questions',
  '_id': 'ORK0xpcB5pkhxiHVtT0Y',
  '_score': 73.5441,
  '_source': {'text': 'Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.\ndocker run -it --entrypoint bash <image>\nIf the container is already running, execute a command in the specific container:\ndocker ps (find the container-id)\ndocker exec -it <container-id> bash\n(Marcos MJD)',
   'section': '5. Deploying Machine Learning Models',
   'question': 'How do I debug a docker container?',
   'course': 'machine-learning-zoomcamp'}},
 {'_index': 'course-questions',
  '_id': '7RK4xpcB5pkhxiHVeUAr',
  '_score': 73.5441,
  '_source': {'text': 'Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.\ndocker run -it --entrypoint bash <image>\nIf the container is already running, execute a command in the specific container:\ndocker ps (find the container-id)\ndocker exec -it <container-id> ba

In [23]:
context_template = """
Q: {question}
A: {text}
""".strip()

prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

In [24]:
context_pieces = []

for hit in search_results['hits']['hits']:
    doc = hit['_source']
    context_piece = context_template.format(**doc)
    context_pieces.append(context_piece)

context = '\n\n'.join(context_pieces)

In [25]:
prompt = prompt_template.format(question=query, context=context)

In [26]:
len(prompt)

1306

In [None]:
# !pip install tiktoken



In [28]:
import tiktoken

In [29]:
print(prompt[:100])

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.



In [30]:
tiktoken.encoding_for_model??

[31mSignature:[39m tiktoken.encoding_for_model(model_name: [33m'str'[39m) -> [33m'Encoding'[39m
[31mSource:[39m   
[38;5;28;01mdef[39;00m encoding_for_model(model_name: str) -> Encoding:
    [33m"""Returns the encoding used by a model.[39m

[33m    Raises a KeyError if the model name is not recognised.[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m get_encoding(encoding_name_for_model(model_name))
[31mFile:[39m      /usr/local/python/3.12.1/lib/python3.12/site-packages/tiktoken/model.py
[31mType:[39m      function

In [31]:
# encoding = tiktoken.encoding_for_model("gpt-4o")  # openai

In [32]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

encoding = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")  # hf
# model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

In [33]:
len(encoding.encode(prompt))

333

In [34]:
tokens = encoding.encode(prompt)[:10]
tokens

[0, 1185, 214, 10, 768, 5307, 3167, 4, 31652, 5]

In [35]:
# encoding.decode_single_token_bytes(tokens[5])  # openai
encoding.decode(tokens[5])  # hf

' teaching'