diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index 1e999882..2dd1dbb8 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -4,20 +4,20 @@ In this example, we will build index flow from text embedding from local markdown files, and query the index. -We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. -## Steps: +## Steps 🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart) -### Indexing Flow: +### Indexing Flow Screenshot 2025-05-19 at 5 48 28 PM -1. We will ingest from a list of local files. -2. For each file, perform chunking (Recursive Split) and then embeddings. +1. We will ingest a list of local files. +2. For each file, perform chunking (recursively split) and then embedding. 3. We will save the embeddings and the metadata in Postgres with PGVector. -### Query: -We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. +### Query +We will match against user-provided text by a SQL query, and reuse the embedding operation in the indexing flow. ## Prerequisite diff --git a/examples/text_embedding_qdrant/README.md b/examples/text_embedding_qdrant/README.md index 5e2ea059..3f91dc95 100644 --- a/examples/text_embedding_qdrant/README.md +++ b/examples/text_embedding_qdrant/README.md @@ -1,69 +1,87 @@ -## Description +# Build text embedding and semantic search 🔍 with Qdrant + +[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) + +CocoIndex supports Qdrant natively - [documentation](https://cocoindex.io/docs/ops/storages#qdrant). In this example, we will build index flow from text embedding from local markdown files, and query the index. We will use **Qdrant** as the vector database. + +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. + +CocoIndex supports Qdrant + +## Steps +### Indexing Flow +Index flow for text embedding + +1. We will ingest a list of local files. +2. For each file, perform chunking (recursively split) and then embedding. +3. We will save the embeddings and the metadata in Postgres with PGVector. + +### Query +We use Qdrant client to query the index, and reuse the embedding operation in the indexing flow. -Example to build a vector index in Qdrant based on local files. ## Pre-requisites -- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. +- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. Although the target store is Qdrant, CocoIndex uses Postgress to track the data lineage for incremental processing. - Run Qdrant. -```bash -docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant -``` + ```bash + docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant + ``` - [Create a collection](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) to export the embeddings to. -```bash -curl -X PUT \ - 'http://localhost:6333/collections/cocoindex' \ - --header 'Content-Type: application/json' \ - --data-raw '{ - "vectors": { - "text_embedding": { - "size": 384, - "distance": "Cosine" - } - } -}' -``` - -You can view the collections and data with the Qdrant dashboard at . + ```bash + curl -X PUT \ + 'http://localhost:6333/collections/cocoindex' \ + --header 'Content-Type: application/json' \ + --data-raw '{ + "vectors": { + "text_embedding": { + "size": 384, + "distance": "Cosine" + } + } + }' + ``` + + You can view the collections and data with the Qdrant dashboard at . ## Run -Install dependencies: +- Install dependencies: -```bash -pip install -e . -``` + ```bash + pip install -e . + ``` -Setup: +- Setup: -```bash -python main.py cocoindex setup -``` + ```bash + python main.py cocoindex setup + ``` -Update index: +- Update index: -```bash -python main.py cocoindex update -``` + ```bash + python main.py cocoindex update + ``` -Run: +- Run: -```bash -python main.py -``` + ```bash + python main.py + ``` ## CocoInsight - -CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9). - -Run CocoInsight to understand your RAG data pipeline: +I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. +It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight: ```bash python main.py cocoindex server -ci ``` -Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). +Open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). + + diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py index 57f27a45..b2892c43 100644 --- a/examples/text_embedding_qdrant/main.py +++ b/examples/text_embedding_qdrant/main.py @@ -1,21 +1,26 @@ from dotenv import load_dotenv +from qdrant_client import QdrantClient +from qdrant_client.http.models import Filter, FieldCondition, MatchValue import cocoindex +# Define Qdrant connection constants +QDRANT_GRPC_URL = "http://localhost:6334" +QDRANT_COLLECTION = "cocoindex" -def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice: + +@cocoindex.transform_flow() +def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]: """ Embed the text using a SentenceTransformer model. This is a shared logic between indexing and querying, so extract it as a function. """ return text.transform( cocoindex.functions.SentenceTransformerEmbed( - model="sentence-transformers/all-MiniLM-L6-v2" - ) - ) + model="sentence-transformers/all-MiniLM-L6-v2")) -@cocoindex.flow_def(name="TextEmbedding") +@cocoindex.flow_def(name="TextEmbeddingWithQdrant") def text_embedding_flow( flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope ): @@ -50,35 +55,39 @@ def text_embedding_flow( doc_embeddings.export( "doc_embeddings", cocoindex.storages.Qdrant( - collection_name="cocoindex", grpc_url="http://localhost:6334/" + collection_name=QDRANT_COLLECTION, grpc_url=QDRANT_GRPC_URL ), primary_key_fields=["id"], setup_by_user=True, ) -query_handler = cocoindex.query.SimpleSemanticsQueryHandler( - name="SemanticsSearch", - flow=text_embedding_flow, - target_name="doc_embeddings", - query_transform_flow=text_to_embedding, - default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY, -) - - @cocoindex.main_fn() def _run(): + # Initialize Qdrant client + client = QdrantClient(url=QDRANT_GRPC_URL, prefer_grpc=True) + # Run queries in a loop to demonstrate the query capabilities. while True: try: query = input("Enter search query (or Enter to quit): ") if query == "": break - results, _ = query_handler.search(query, 10, "text_embedding") + + # Get the embedding for the query + query_embedding = text_to_embedding.eval(query) + + search_results = client.search( + collection_name=QDRANT_COLLECTION, + query_vector=("text_embedding", query_embedding), + limit=10 + ) print("\nSearch results:") - for result in results: - print(f"[{result.score:.3f}] {result.data['filename']}") - print(f" {result.data['text']}") + for result in search_results: + score = result.score + payload = result.payload + print(f"[{score:.3f}] {payload['filename']}") + print(f" {payload['text']}") print("---") print() except KeyboardInterrupt: diff --git a/examples/text_embedding_qdrant/pyproject.toml b/examples/text_embedding_qdrant/pyproject.toml index 25b2663c..70454200 100644 --- a/examples/text_embedding_qdrant/pyproject.toml +++ b/examples/text_embedding_qdrant/pyproject.toml @@ -3,7 +3,7 @@ name = "text-embedding-qdrant" version = "0.1.0" description = "Simple example for cocoindex: build embedding index based on local text files." requires-python = ">=3.10" -dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"] +dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1", "qdrant-client>=1.6.0"] [tool.setuptools] packages = []