# dlt intro
* https://www.youtube.com/watch?v=qUNyfR_X2Mo
* https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2024/workshops/dlt.md
* https://colab.research.google.com/drive/1nNOybHdWQiwUUuJFZu__xvJxL_ADU3xl?usp=sharing#scrollTo=zpqhOpmrS45-

In [2]:
!pip install dlt[lancedb]==0.5.1a0
!pip install sentence-transformers



## Load the data

In [1]:
# https://youtu.be/qUNyfR_X2Mo?t=1650
import requests
import dlt

qa_dataset = requests.get("https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1").json()

@dlt.resource
def qa_documents():
  for course in qa_dataset:
    yield course["documents"]

pipeline = dlt.pipeline(pipeline_name="from_json", destination="lancedb", dataset_name="qanda")

load_info = pipeline.run(qa_documents, table_name="documents")

print(load_info)

  from .autonotebook import tqdm as notebook_tqdm


Pipeline from_json load step completed in 0.24 seconds
1 load package(s) were loaded to destination LanceDB and into dataset qanda
The LanceDB destination used <dlt.destinations.impl.lancedb.configuration.LanceDBCredentials object at 0x76212ddc3bb0> location to store data
Load package 1721723824.1178534 is LOADED and contains no failed jobs


In [2]:
# https://youtu.be/qUNyfR_X2Mo?t=1859
import lancedb

db = lancedb.connect("./.lancedb")
print(db.table_names())

['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents', 'qanda_embedded____dlt_loads', 'qanda_embedded____dlt_pipeline_state', 'qanda_embedded____dlt_version', 'qanda_embedded___dltSentinelTable', 'qanda_embedded___documents']


Automatically created tables: 'qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version'\
Actual table: qanda___documents

In [3]:
db_table = db.open_table("qanda___documents")

db_table.to_pandas()

Unnamed: 0,id__,text,section,question,_dlt_load_id,_dlt_id
0,c5ac6328-cbef-5a1a-bc6e-5ca60fcc9bc1,The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1721688099.8029122,SyPl51ibXJQ0xA
1,f448f149-dbd5-563d-b4df-c28bdfa00b5a,GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1721688099.8029122,pKLtkrg3kGLLcQ
2,af4f0c30-3e7d-5d53-9b73-8c1e6bdf2db9,"Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1721688099.8029122,SHuTmrP6CWGweQ
3,66cdb898-4be1-579c-b734-927de4c7ccd7,You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1721688099.8029122,TT5vTEvEWTSYaA
4,76c79b57-185b-535a-b991-98cfdd4c3825,You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1721688099.8029122,2O1UdqLyV5RA9w
...,...,...,...,...,...,...
2839,e40d62a0-b458-51ac-a98c-a4c4099f6b5d,Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1721723824.1178534,R9ArQwxA2Q8pcg
2840,8a38b0e7-3bb2-5a9b-8a55-97ddd0d4f15e,Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1721723824.1178534,R2lNBLegZCdHYw
2841,68b62750-7e66-5c5b-b9d3-163a364006da,Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1721723824.1178534,GnaXQP+Gz5znWw
2842,c1e1610f-8af2-5d80-8554-0a5aecfaa4c3,Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1721723824.1178534,NqLdsKENS9U1mw


In [13]:
db_table

LanceTable(connection=LanceDBConnection(/workspaces/LLM-zoomcamp/dlt/.lancedb), name="qanda_embedded___documents")

## Load and embed the data
* https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
* Available sentence transformers: https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/#sentence-transformers

In [17]:
print(qa_documents)

DltResource [qa_documents] in section [__main__]:
If you want to see the data items in the resource you must iterate it or convert to list ie. list(resource). Note that, like any iterator, you can iterate the resource only once.
Instance: info: (data pipe id:134688883000992) at 134688883001712


In [4]:
# https://youtu.be/qUNyfR_X2Mo?t=1924
import os
from dlt.destinations.adapters import lancedb_adapter

os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL_PROVIDER"] = "sentence-transformers"
os.environ["DESTINATION__LANCEDB__EMBEDDING_MODEL"] = "all-MiniLM-L6-v2"

pipeline = dlt.pipeline(pipeline_name="from_json_embedded", destination="lancedb", dataset_name="qanda_embedded")

load_info = pipeline.run(lancedb_adapter(qa_documents, embed=["text", "question"]), table_name="documents")
print(load_info)

Pipeline from_json_embedded load step completed in 22.53 seconds
1 load package(s) were loaded to destination LanceDB and into dataset qanda_embedded
The LanceDB destination used <dlt.destinations.impl.lancedb.configuration.LanceDBCredentials object at 0x76211628ddb0> location to store data
Load package 1721724000.4030104 is LOADED and contains no failed jobs


In [5]:
db = lancedb.connect("./.lancedb")
print(db.table_names())

['qanda____dlt_loads', 'qanda____dlt_pipeline_state', 'qanda____dlt_version', 'qanda___dltSentinelTable', 'qanda___documents', 'qanda_embedded____dlt_loads', 'qanda_embedded____dlt_pipeline_state', 'qanda_embedded____dlt_version', 'qanda_embedded___dltSentinelTable', 'qanda_embedded___documents']


In [6]:
db_table = db.open_table("qanda_embedded___documents")

db_table.to_pandas()

Unnamed: 0,id__,vector__,text,section,question,_dlt_load_id,_dlt_id
0,90198cdb-5e95-5339-b74b-dfd95343a57a,"[-0.00035094196, -0.062014297, -0.037999876, 0...",The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,1721688305.6704516,iHWVYitCyZixkw
1,8872b4e6-47d3-5de8-bcae-bb509a7aa3d6,"[0.020011412, -0.011535538, 0.013017209, -0.00...",GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,1721688305.6704516,YtCjWP1N+Ivvqw
2,9a31d951-ce7c-5afe-b0ac-5bb7b2896a7a,"[0.014857555, -0.06664993, -0.013571247, 0.023...","Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,1721688305.6704516,bZPvrFH9YbNfrA
3,c4aeb12c-ca89-55a0-a140-93640eece786,"[-0.023312032, -0.09461493, 0.056361612, -0.00...",You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,1721688305.6704516,dGhGtURk56shyw
4,f4952fb8-cbf3-5a31-83f7-9f503a1a83f8,"[0.026537651, -0.01779666, 0.0021155947, 0.006...",You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,1721688305.6704516,3hKa4rmgtHL1DQ
...,...,...,...,...,...,...,...
1891,811e1b76-4360-53a0-ab40-b2d4227e5dbb,"[0.016619362, -0.033603165, -0.093347155, -0.0...",Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,1721724000.4030104,/KG+ThKqu5tGdA
1892,791063b7-cc79-5ef1-91e8-f7e6316efda1,"[0.026872855, -0.0019949432, 0.008369081, -0.0...",Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,1721724000.4030104,TvDn+MVHPFUDUA
1893,2d6a8f6d-1bc6-57a6-90b2-c15027a26ac3,"[0.03513756, 0.056265566, 0.024428478, -0.0651...",Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,1721724000.4030104,A6egWiRCu9/KgQ
1894,ae26ad22-e6d5-565d-bdb4-1f9d843ad72b,"[0.03380979, -0.0031218985, 0.0017484669, 0.01...",Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,1721724000.4030104,u3wEyy44R666GA


# Create an up-to-date RAG with dlt and LanceDB