Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LanceDB custom destination example code #1323

Merged
merged 52 commits into from
Jun 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
8e174c7
Add LanceDB custom destination example code
Pipboyguy May 5, 2024
d5ec7fa
Format
Pipboyguy May 5, 2024
74e09c7
Remove Postgres credentials from example.secrets.toml
Pipboyguy May 5, 2024
9050001
Format
Pipboyguy May 5, 2024
c9e2028
Add typing
Pipboyguy May 6, 2024
242497b
Refactor code documentation and add type ignore comments
Pipboyguy May 6, 2024
78c7476
Ignore checks
Pipboyguy May 6, 2024
585b931
wrap in main if statement
Pipboyguy May 6, 2024
81d933a
Add lancedb to install dependencies in test_doc_snippets workflow
Pipboyguy May 6, 2024
42f6aaf
poetry
Pipboyguy May 7, 2024
a26ef11
Update deps
Pipboyguy May 7, 2024
4beb3f7
Update LanceDB version and replace Sentence-Transformers with OpenAIE…
Pipboyguy May 7, 2024
c66bf9a
Merge branch 'refs/heads/devel' into 1322-lancedb-usage-example-docs
Pipboyguy May 7, 2024
8cc3437
Poetry lock
Pipboyguy May 7, 2024
948659d
Format
Pipboyguy May 7, 2024
ad34376
Merge branch 'refs/heads/devel' into 1322-lancedb-usage-example-docs
Pipboyguy May 12, 2024
3ea0d2b
Update versions
Pipboyguy May 12, 2024
bc0567d
Replace OpenAI with Cohere in LanceDB custom destination example
Pipboyguy May 13, 2024
023d440
Format
Pipboyguy May 13, 2024
0ca8c7a
Merge branch 'refs/heads/devel' into 1322-lancedb-usage-example-docs
Pipboyguy May 13, 2024
c165619
Add error handling to custom destination lanceDB example
Pipboyguy May 13, 2024
d3ea196
Lift config to secrets/config
Pipboyguy May 13, 2024
312033e
Ignore example lancedb local dir
Pipboyguy May 13, 2024
20e5b31
Why was this uncommented
Pipboyguy May 13, 2024
fc9c00b
Remove unnecessary lock
Pipboyguy May 13, 2024
ed88147
Cleanup
Pipboyguy May 13, 2024
03b686f
Remove print statements from custom_destination_lancedb.py
Pipboyguy May 13, 2024
814edf4
Merge branch 'refs/heads/devel' into 1322-lancedb-usage-example-docs
Pipboyguy May 14, 2024
3fa33e2
Merge branch 'refs/heads/devel' into 1322-lancedb-usage-example-docs
Pipboyguy May 14, 2024
038ec6b
Print info
Pipboyguy May 14, 2024
e9f2588
Print info
Pipboyguy May 14, 2024
b8ce73c
Use rest_client
Pipboyguy May 14, 2024
4803a12
noqa
Pipboyguy May 14, 2024
a86cd8d
Merge remote-tracking branch 'origin/devel' into 1322-lancedb-usage-e…
Pipboyguy May 16, 2024
dafd24f
Remove `cohere` dependency and add `embeddings` extra to `lancedb`
Pipboyguy May 16, 2024
7588da5
Merge branch 'devel' into 1322-lancedb-usage-example-docs
rudolfix Jun 4, 2024
5a54170
changing secrets path for cohere to pass docs tests
rahuljo Jun 7, 2024
1f748c2
Merge branch 'devel' into 1322-lancedb-usage-example-docs
rahuljo Jun 10, 2024
5616f3b
Merge branch 'devel' into 1322-lancedb-usage-example-docs
rudolfix Jun 10, 2024
132b314
fixes lock file
rudolfix Jun 10, 2024
3549ac9
moves get lancedb path to run within the test
rudolfix Jun 10, 2024
71a7cc4
Merge branch 'devel' into 1322-lancedb-usage-example-docs
sh-rp Jun 19, 2024
051c1df
fix dependencies
sh-rp Jun 19, 2024
493b20f
fix linting
sh-rp Jun 19, 2024
76cdfcd
fix lancedb deps
AstrakhantsevaAA Jun 20, 2024
8382a72
update lock file
AstrakhantsevaAA Jun 20, 2024
fcc75cd
change source name
AstrakhantsevaAA Jun 20, 2024
8056c2b
moved client_id to secrets
AstrakhantsevaAA Jun 20, 2024
8bd2ce0
switch lancedb example to openai and small fixes
sh-rp Jun 24, 2024
2dbd5f9
small fixes
sh-rp Jun 24, 2024
ce41d10
add openai to docs deps
sh-rp Jun 24, 2024
6655f56
fix grammar gpt typing
sh-rp Jun 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/examples/custom_destination_lancedb/.dlt/config.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[lancedb]
db_path = "spotify.db"
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[spotify]
Pipboyguy marked this conversation as resolved.
Show resolved Hide resolved
client_id = ""
client_secret = ""

# provide the openai api key here
[destination.lancedb.credentials]
embedding_model_provider_api_key = ""
1 change: 1 addition & 0 deletions docs/examples/custom_destination_lancedb/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
spotify.db
Empty file.
155 changes: 155 additions & 0 deletions docs/examples/custom_destination_lancedb/custom_destination_lancedb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
"""
---
title: Custom Destination with LanceDB
description: Learn how use the custom destination to load to LanceDB.
keywords: [destination, credentials, example, lancedb, custom destination, vectorstore, AI, LLM]
---

This example showcases a Python script that demonstrates the integration of LanceDB, an open-source vector database,
as a custom destination within the dlt ecosystem.
The script illustrates the implementation of a custom destination as well as the population of the LanceDB vector
store with data from various sources.
This highlights the seamless interoperability between dlt and LanceDB.

You can get a Spotify client ID and secret from https://developer.spotify.com/.

We'll learn how to:
- Use the [custom destination](../dlt-ecosystem/destinations/destination.md)
- Delegate the embeddings to LanceDB using OpenAI Embeddings
"""

__source_name__ = "spotify"

import datetime # noqa: I251
import os
from dataclasses import dataclass, fields
from pathlib import Path
from typing import Any

import lancedb # type: ignore
from lancedb.embeddings import get_registry # type: ignore
from lancedb.pydantic import LanceModel, Vector # type: ignore

import dlt
from dlt.common.configuration import configspec
from dlt.common.schema import TTableSchema
from dlt.common.typing import TDataItems, TSecretStrValue
from dlt.sources.helpers import requests
from dlt.sources.helpers.rest_client import RESTClient, AuthConfigBase

# access secrets to get openai key and instantiate embedding function
openai_api_key: str = dlt.secrets.get("destination.lancedb.credentials.embedding_model_provider_api_key")
func = get_registry().get("openai").create(name="text-embedding-3-small", api_key=openai_api_key)


class EpisodeSchema(LanceModel):
id: str # noqa: A003
name: str
description: str = func.SourceField()
vector: Vector(func.ndims()) = func.VectorField() # type: ignore[valid-type]
release_date: datetime.date
href: str


@dataclass(frozen=True)
class Shows:
monday_morning_data_chat: str = "3Km3lBNzJpc1nOTJUtbtMh"
latest_space_podcast: str = "2p7zZVwVF6Yk0Zsb4QmT7t"
superdatascience_podcast: str = "1n8P7ZSgfVLVJ3GegxPat1"
lex_fridman: str = "2MAi0BvDc6GTFvKFPXnkCL"


@configspec
class SpotifyAuth(AuthConfigBase):
client_id: str = None
client_secret: TSecretStrValue = None

def __call__(self, request) -> Any:
if not hasattr(self, "access_token"):
self.access_token = self._get_access_token()
request.headers["Authorization"] = f"Bearer {self.access_token}"
return request

def _get_access_token(self) -> Any:
auth_url = "https://accounts.spotify.com/api/token"
auth_response = requests.post(
auth_url,
{
"grant_type": "client_credentials",
"client_id": self.client_id,
"client_secret": self.client_secret,
},
)
return auth_response.json()["access_token"]


@dlt.source
def spotify_shows(
client_id: str = dlt.secrets.value,
client_secret: str = dlt.secrets.value,
):
spotify_base_api_url = "https://api.spotify.com/v1"
client = RESTClient(
base_url=spotify_base_api_url,
auth=SpotifyAuth(client_id=client_id, client_secret=client_secret), # type: ignore[arg-type]
)

for show in fields(Shows):
show_name = show.name
show_id = show.default
url = f"/shows/{show_id}/episodes"
yield dlt.resource(
client.paginate(url, params={"limit": 50}),
name=show_name,
write_disposition="merge",
primary_key="id",
parallelized=True,
max_table_nesting=0,
)


@dlt.destination(batch_size=250, name="lancedb")
def lancedb_destination(items: TDataItems, table: TTableSchema) -> None:
db_path = Path(dlt.config.get("lancedb.db_path"))
db = lancedb.connect(db_path)

# since we are embedding the description field, we need to do some additional cleaning
# for openai. Openai will not accept empty strings or input with more than 8191 tokens
for item in items:
item["description"] = item.get("description") or "No Description"
item["description"] = item["description"][0:8000]
try:
tbl = db.open_table(table["name"])
except FileNotFoundError:
tbl = db.create_table(table["name"], schema=EpisodeSchema)
tbl.add(items)


if __name__ == "__main__":
db_path = Path(dlt.config.get("lancedb.db_path"))
db = lancedb.connect(db_path)

for show in fields(Shows):
db.drop_table(show.name, ignore_missing=True)

pipeline = dlt.pipeline(
pipeline_name="spotify",
destination=lancedb_destination,
dataset_name="spotify_podcast_data",
progress="log",
)

load_info = pipeline.run(spotify_shows())
load_info.raise_on_failed_jobs()
print(load_info)

row_counts = pipeline.last_trace.last_normalize_info
print(row_counts)

query = "French AI scientist with Lex, talking about AGI and Meta and Llama"
table_to_query = "lex_fridman"

tbl = db.open_table(table_to_query)

results = tbl.search(query=query).to_list()
assert results
2 changes: 1 addition & 1 deletion docs/tools/fix_grammar_gpt.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ def get_chunk_length(chunk: List[str]) -> int:
temperature=0,
)

fixed_chunks.append(response.choices[0].message.content)
fixed_chunks.append(response.choices[0].message.content) # type: ignore

with open(file_path, "w", encoding="utf-8") as f:
for c in fixed_chunks:
Expand Down
152 changes: 151 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading