Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions examples/text_embedding/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@

In this example, we will build index flow from text embedding from local markdown files, and query the index.

We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.

## Steps:
## Steps
🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart)

### Indexing Flow:
### Indexing Flow
<img width="461" alt="Screenshot 2025-05-19 at 5 48 28 PM" src="https://github.com/user-attachments/assets/b6825302-a0c7-4b86-9a2d-52da8286b4bd" />

1. We will ingest from a list of local files.
2. For each file, perform chunking (Recursive Split) and then embeddings.
1. We will ingest a list of local files.
2. For each file, perform chunking (recursively split) and then embedding.
3. We will save the embeddings and the metadata in Postgres with PGVector.

### Query:
We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
### Query
We will match against user-provided text by a SQL query, and reuse the embedding operation in the indexing flow.


## Prerequisite
Expand Down
102 changes: 60 additions & 42 deletions examples/text_embedding_qdrant/README.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,87 @@
## Description
# Build text embedding and semantic search 🔍 with Qdrant

[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)

CocoIndex supports Qdrant natively - [documentation](https://cocoindex.io/docs/ops/storages#qdrant). In this example, we will build index flow from text embedding from local markdown files, and query the index. We will use **Qdrant** as the vector database.

We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.

<img width="860" alt="CocoIndex supports Qdrant" src="https://github.com/user-attachments/assets/a9deecfa-dd94-4b97-a1b1-90488d8178df" />

## Steps
### Indexing Flow
<img width="480" alt="Index flow for text embedding" src="https://github.com/user-attachments/assets/44d47b5e-b49b-4f05-9a00-dcb8027602a1" />

1. We will ingest a list of local files.
2. For each file, perform chunking (recursively split) and then embedding.
3. We will save the embeddings and the metadata in Postgres with PGVector.

### Query
We use Qdrant client to query the index, and reuse the embedding operation in the indexing flow.

Example to build a vector index in Qdrant based on local files.

## Pre-requisites

- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. Although the target store is Qdrant, CocoIndex uses Postgress to track the data lineage for incremental processing.

- Run Qdrant.

```bash
docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant
```
```bash
docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant
```

- [Create a collection](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) to export the embeddings to.

```bash
curl -X PUT \
'http://localhost:6333/collections/cocoindex' \
--header 'Content-Type: application/json' \
--data-raw '{
"vectors": {
"text_embedding": {
"size": 384,
"distance": "Cosine"
}
}
}'
```

You can view the collections and data with the Qdrant dashboard at <http://localhost:6333/dashboard>.
```bash
curl -X PUT \
'http://localhost:6333/collections/cocoindex' \
--header 'Content-Type: application/json' \
--data-raw '{
"vectors": {
"text_embedding": {
"size": 384,
"distance": "Cosine"
}
}
}'
```

You can view the collections and data with the Qdrant dashboard at <http://localhost:6333/dashboard>.

## Run

Install dependencies:
- Install dependencies:

```bash
pip install -e .
```
```bash
pip install -e .
```

Setup:
- Setup:

```bash
python main.py cocoindex setup
```
```bash
python main.py cocoindex setup
```

Update index:
- Update index:

```bash
python main.py cocoindex update
```
```bash
python main.py cocoindex update
```

Run:
- Run:

```bash
python main.py
```
```bash
python main.py
```

## CocoInsight

CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).

Run CocoInsight to understand your RAG data pipeline:
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:

```bash
python main.py cocoindex server -ci
```

Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
Open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).


47 changes: 28 additions & 19 deletions examples/text_embedding_qdrant/main.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,26 @@
from dotenv import load_dotenv
from qdrant_client import QdrantClient
from qdrant_client.http.models import Filter, FieldCondition, MatchValue

import cocoindex

# Define Qdrant connection constants
QDRANT_GRPC_URL = "http://localhost:6334"
QDRANT_COLLECTION = "cocoindex"

def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:

@cocoindex.transform_flow()
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
"""
Embed the text using a SentenceTransformer model.
This is a shared logic between indexing and querying, so extract it as a function.
"""
return text.transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"
)
)
model="sentence-transformers/all-MiniLM-L6-v2"))


@cocoindex.flow_def(name="TextEmbedding")
@cocoindex.flow_def(name="TextEmbeddingWithQdrant")
def text_embedding_flow(
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
):
Expand Down Expand Up @@ -50,35 +55,39 @@ def text_embedding_flow(
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Qdrant(
collection_name="cocoindex", grpc_url="http://localhost:6334/"
collection_name=QDRANT_COLLECTION, grpc_url=QDRANT_GRPC_URL
),
primary_key_fields=["id"],
setup_by_user=True,
)


query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
name="SemanticsSearch",
flow=text_embedding_flow,
target_name="doc_embeddings",
query_transform_flow=text_to_embedding,
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
)


@cocoindex.main_fn()
def _run():
# Initialize Qdrant client
client = QdrantClient(url=QDRANT_GRPC_URL, prefer_grpc=True)

# Run queries in a loop to demonstrate the query capabilities.
while True:
try:
query = input("Enter search query (or Enter to quit): ")
if query == "":
break
results, _ = query_handler.search(query, 10, "text_embedding")

# Get the embedding for the query
query_embedding = text_to_embedding.eval(query)

search_results = client.search(
collection_name=QDRANT_COLLECTION,
query_vector=("text_embedding", query_embedding),
limit=10
)
print("\nSearch results:")
for result in results:
print(f"[{result.score:.3f}] {result.data['filename']}")
print(f" {result.data['text']}")
for result in search_results:
score = result.score
payload = result.payload
print(f"[{score:.3f}] {payload['filename']}")
print(f" {payload['text']}")
print("---")
print()
except KeyboardInterrupt:
Expand Down
2 changes: 1 addition & 1 deletion examples/text_embedding_qdrant/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name = "text-embedding-qdrant"
version = "0.1.0"
description = "Simple example for cocoindex: build embedding index based on local text files."
requires-python = ">=3.10"
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"]
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1", "qdrant-client>=1.6.0"]

[tool.setuptools]
packages = []