Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions examples/text_embedding_lancedb/.env
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,10 @@ COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
# Fallback to CPU for operations not supported by MPS on Mac.
# It's no-op for other platforms.
PYTORCH_ENABLE_MPS_FALLBACK=1

# By default, the vector index is not enabled, because LanceDB requires at least
# 256 rows to be there before it can build the index (see
# https://github.com/lance-format/lance/issues/4034) for more details).
#
# After your index has enough data, you can change the following value to `true` to enable the index:
ENABLE_LANCEDB_VECTOR_INDEX=false
7 changes: 7 additions & 0 deletions examples/text_embedding_lancedb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,13 @@ You can also run the command with `-L`, which will watch for file changes and up
cocoindex update -L main
```

By default, the vector index is not enabled, because LanceDB requires at least 256 rows to be there before it can build the index (see [this issue](https://github.com/lance-format/lance/issues/4034) for more details).
After your LanceDB target table has enough data, you can update `.env` file with the following environment variable to enable the vector index from there on:

```sh
ENABLE_LANCEDB_VECTOR_INDEX=true
```

## CocoInsight

I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
Expand Down
25 changes: 16 additions & 9 deletions examples/text_embedding_lancedb/main.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from dotenv import load_dotenv
import os
import datetime
import cocoindex
import math
Expand Down Expand Up @@ -31,6 +31,10 @@ def text_embedding_flow(
"""
Define an example flow that embeds text into a vector database.
"""
ENABLE_LANCEDB_VECTOR_INDEX = os.environ.get(
"ENABLE_LANCEDB_VECTOR_INDEX", "0"
).lower() in ("true", "1")

data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="markdown_files"),
refresh_interval=datetime.timedelta(seconds=5),
Expand All @@ -57,18 +61,21 @@ def text_embedding_flow(
text_embedding=chunk["embedding"],
)

# We cannot enable index when the table has no data yet, as LanceDB requires data to train the index.
# See: https://github.com/lancedb/lance/issues/4034
# Guard it with ENABLE_LANCEDB_VECTOR_INDEX environment variable.
vector_indexes = []
if ENABLE_LANCEDB_VECTOR_INDEX:
vector_indexes.append(
cocoindex.VectorIndexDef(
"text_embedding", cocoindex.VectorSimilarityMetric.L2_DISTANCE
)
)
doc_embeddings.export(
"doc_embeddings",
coco_lancedb.LanceDB(db_uri=LANCEDB_URI, table_name=LANCEDB_TABLE),
primary_key_fields=["id"],
# We cannot enable it when the table has no data yet, as LanceDB requires data to train the index.
# See: https://github.com/lancedb/lance/issues/4034
#
# vector_indexes=[
# cocoindex.VectorIndexDef(
# "text_embedding", cocoindex.VectorSimilarityMetric.L2_DISTANCE
# ),
# ],
vector_indexes=vector_indexes,
)


Expand Down