In [3]:
import dlt
from dlt.destinations import qdrant

import requests


In this homework, we will load the data from our FAQ to Qdrant

Let's install dlt with Qdrant support and Qdrant client:

`uv add "dlt[qdrant]" "qdrant-client[fastembed]"`

For reading the FAQ data, we have this helper function.
Annotated with @dlt.resource for creating a dlt pipeline.

In [4]:
@dlt.resource(write_disposition="replace", name="zoomcamp_data")
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']

        for doc in course['documents']:
            doc['course'] = course_name
            yield doc

In [5]:
qdrant_destination = qdrant(
  qd_path="db.qdrant",
)

## Question 2. dlt pipeline

Now let's create a pipeline.

We need to define a destination for that. Let's use the qdrant one.
In this case, we tell dlt (and Qdrant) to create a folder with our data, and the name for it will be db.qdrant.

How many rows were inserted into the `zoomcamp_data` collection?

Look for "Normalized data for the following tables:" in the trace output.

In [6]:
pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination=qdrant_destination,
    dataset_name="zoomcamp_tagged_data"

)
load_info = pipeline.run(zoomcamp_data())
print(pipeline.last_trace)
# zoomcamp_data: 948 row(s)

  from .autonotebook import tqdm as notebook_tqdm
Fetching 5 files: 100%|██████████| 5/5 [00:05<00:00,  1.20s/it]


Run started at 2025-07-09 18:47:15.380894+00:00 and COMPLETED in 14.03 seconds with 4 steps.
Step extract COMPLETED in 1.18 seconds.

Load package 1752086845.0042272 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.05 seconds.
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- zoomcamp_data: 948 row(s)

Load package 1752086845.0042272 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 3.18 seconds.
Pipeline zoomcamp_pipeline load step completed in 3.17 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_data
The qdrant destination used /Users/vasiliy/projects/llm-zoomcamp/dlt/db.qdrant location to store data
Load package 1752086845.0042272 is LOADED and contains no failed jobs

Step run COMPLETED in 14.03 seconds.
Pipeline zoomcamp_pipeline load step completed in 3.17 seconds
1 load package(s) were loaded to 

## Question 3. Embeddings

When inserting the data, an embedding model was used. Which one?

You can find this out by inspecting the meta.json file created in the target folder. During the data insertion process, a folder named db.qdrant will be created, and the meta.json file will be located inside this folder.

`fast-bge-small-en`