# dlt intro
* https://www.youtube.com/watch?v=qUNyfR_X2Mo
* https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2024/workshops/dlt.md
* https://colab.research.google.com/drive/1nNOybHdWQiwUUuJFZu__xvJxL_ADU3xl?usp=sharing#scrollTo=zpqhOpmrS45-

In [2]:
!pip install dlt[lancedb]==0.5.1a0
!pip install sentence-transformers



## Load the data

In [4]:
import requests
import dlt

qa_dataset = requests.get("https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1").json()

@dlt.resource
def qa_documents():
  for course in qa_dataset:
    yield course["documents"]

pipeline = dlt.pipeline(pipeline_name="from_json", destination="lancedb", dataset_name="qanda")

load_info = pipeline.run(qa_documents, table_name="documents")

print(load_info)

Pipeline from_json load step completed in 0.20 seconds
1 load package(s) were loaded to destination LanceDB and into dataset qanda
The LanceDB destination used <dlt.destinations.impl.lancedb.configuration.LanceDBCredentials object at 0x7a7fb27b57e0> location to store data
Load package 1721688122.5950298 is LOADED and contains no failed jobs


In [8]:
import lancedb

db = lancedb.connect("./.lancedb")
print(db.table_names())

['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents']


In [9]:
db_table = db.open_table("qanda___documents")

db_table.to_pandas()

Unnamed: 0,id__,text,section,question,_dlt_load_id,_dlt_id
0,c5ac6328-cbef-5a1a-bc6e-5ca60fcc9bc1,The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1721688099.8029122,SyPl51ibXJQ0xA
1,f448f149-dbd5-563d-b4df-c28bdfa00b5a,GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1721688099.8029122,pKLtkrg3kGLLcQ
2,af4f0c30-3e7d-5d53-9b73-8c1e6bdf2db9,"Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1721688099.8029122,SHuTmrP6CWGweQ
3,66cdb898-4be1-579c-b734-927de4c7ccd7,You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1721688099.8029122,TT5vTEvEWTSYaA
4,76c79b57-185b-535a-b991-98cfdd4c3825,You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1721688099.8029122,2O1UdqLyV5RA9w
...,...,...,...,...,...,...
1891,c34e8121-17db-5559-9dac-d7b707bab76e,Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1721688122.5950298,YzqgScGlGbaVKQ
1892,b3e4e993-a174-5354-968a-a4026399249f,Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1721688122.5950298,yez11sIWlexC+A
1893,21533d64-c3cd-5471-807c-ae3629d0f7d8,Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1721688122.5950298,TO1g25i7I787NA
1894,38c94f86-5595-59ff-9965-27aa63d832f2,Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1721688122.5950298,28xUs7iSUl1eRQ


## Load and embed the data

In [10]:
import os
from dlt.destinations.adapters import lancedb_adapter

os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL_PROVIDER"] = "sentence-transformers"
os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL"] = "all-MiniLM-L6-v2"

pipeline = dlt.pipeline(pipeline_name="from_json_embedded", destination="lancedb", dataset_name="qanda_embedded")

load_info = pipeline.run(lancedb_adapter(qa_documents, embed=["text", "question"]), table_name="documents")
print(load_info)

_dlt_version
[{'name': 'version', 'data_type': 'bigint', 'nullable': False}, {'name': 'engine_version', 'data_type': 'bigint', 'nullable': False}, {'name': 'inserted_at', 'data_type': 'timestamp', 'nullable': False}, {'name': 'schema_name', 'data_type': 'text', 'nullable': False}, {'name': 'version_hash', 'data_type': 'text', 'nullable': False}, {'name': 'schema', 'data_type': 'text', 'nullable': False}]
_dlt_pipeline_state
[{'name': 'version', 'data_type': 'bigint', 'nullable': False}, {'name': 'engine_version', 'data_type': 'bigint', 'nullable': False}, {'name': 'pipeline_name', 'data_type': 'text', 'nullable': False}, {'name': 'state', 'data_type': 'text', 'nullable': False}, {'name': 'created_at', 'data_type': 'timestamp', 'nullable': False}, {'name': 'version_hash', 'data_type': 'text', 'nullable': True}, {'name': '_dlt_load_id', 'data_type': 'text', 'nullable': False}, {'name': '_dlt_id', 'data_type': 'text', 'nullable': False, 'unique': True}]
documents
[{'name': 'text', 'x-lanc

In [11]:
db = lancedb.connect("./.lancedb")
print(db.table_names())

['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents', 'qanda_embedded____dlt_loads', 'qanda_embedded____dlt_pipeline_state', 'qanda_embedded____dlt_version', 'qanda_embedded___dltSentinelTable', 'qanda_embedded___documents']


In [12]:
db_table = db.open_table("qanda_embedded___documents")

db_table.to_pandas()

Unnamed: 0,id__,vector__,text,section,question,_dlt_load_id,_dlt_id
0,90198cdb-5e95-5339-b74b-dfd95343a57a,"[-0.00035094196, -0.062014297, -0.037999876, 0...",The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1721688305.6704516,iHWVYitCyZixkw
1,8872b4e6-47d3-5de8-bcae-bb509a7aa3d6,"[0.020011412, -0.011535538, 0.013017209, -0.00...",GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1721688305.6704516,YtCjWP1N+Ivvqw
2,9a31d951-ce7c-5afe-b0ac-5bb7b2896a7a,"[0.014857555, -0.06664993, -0.013571247, 0.023...","Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1721688305.6704516,bZPvrFH9YbNfrA
3,c4aeb12c-ca89-55a0-a140-93640eece786,"[-0.023312032, -0.09461493, 0.056361612, -0.00...",You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1721688305.6704516,dGhGtURk56shyw
4,f4952fb8-cbf3-5a31-83f7-9f503a1a83f8,"[0.026537651, -0.01779666, 0.0021155947, 0.006...",You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1721688305.6704516,3hKa4rmgtHL1DQ
...,...,...,...,...,...,...,...
943,8b9b7cca-b73f-5d62-9339-68d5c86fe527,"[0.016619362, -0.033603165, -0.093347155, -0.0...",Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1721688305.6704516,Ua4+6iX8o26zXg
944,a11b64e1-d6eb-565d-869e-e651a642c688,"[0.026872855, -0.0019949432, 0.008369081, -0.0...",Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1721688305.6704516,BIb9tu4DYMxE/A
945,2b83dd74-4a3c-5d48-8c4f-09f90573c9df,"[0.03513756, 0.056265566, 0.024428478, -0.0651...",Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1721688305.6704516,VuWw3XwJk5k+tw
946,1c902ca1-43d2-57da-82ac-df9d93c6ae8d,"[0.03380979, -0.0031218985, 0.0017484669, 0.01...",Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1721688305.6704516,uN5suEZSlIXgjg
