From 462c3b63e394a31115bdc5d51314abda216ee0e6 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 21:33:06 -0700 Subject: [PATCH 01/13] Update README.md --- examples/code_embedding/README.md | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md index 5d716fa6..f4ec8415 100644 --- a/examples/code_embedding/README.md +++ b/examples/code_embedding/README.md @@ -1,15 +1,28 @@ -# Build embedding index for codebase +# Build real-time index for codebase +[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) + +CocoIndex provides built-in support for code base chunking, with native Tree-sitter support. In this example, we will build real-time index for codebase using CocoIndex. + +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. ![Build embedding index for codebase](https://cocoindex.io/blogs/assets/images/cover-9bf0a7cff69b66a40918ab2fc1cea0c7.png) -In this example, we will build an embedding index for a codebase using CocoIndex. CocoIndex provides built-in support for code base chunking, with native Tree-sitter support. [Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. +[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. +Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section. -Please give [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us if you like our work. Thank you so much with a warm coconut hug 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) ## Tutorials -- Blog with step by step tutorial [here](https://cocoindex.io/blogs/index-code-base-for-rag). -- Video walkthrough [here](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2) +### Step by step tutorial +Checkout the blog [here](https://cocoindex.io/blogs/index-code-base-for-rag). + +### Video Tutorial +
+ + Code Embedding with CocoIndex Tutorial + +

Click the image above to watch the video tutorial on YouTube

+
## Prerequisite From 4fdd56603d03309633d0fac44a37b46df8f2f66d Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 21:42:11 -0700 Subject: [PATCH 02/13] Update README.md --- examples/code_embedding/README.md | 56 ++++++++++++++----------------- 1 file changed, 26 insertions(+), 30 deletions(-) diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md index f4ec8415..d7f5825e 100644 --- a/examples/code_embedding/README.md +++ b/examples/code_embedding/README.md @@ -13,16 +13,13 @@ Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/fu ## Tutorials -### Step by step tutorial -Checkout the blog [here](https://cocoindex.io/blogs/index-code-base-for-rag). - -### Video Tutorial -
- - Code Embedding with CocoIndex Tutorial - -

Click the image above to watch the video tutorial on YouTube

-
+- Step by step tutorial - Checkout the [blog](https://cocoindex.io/blogs/index-code-base-for-rag). +- Video Tutorial +
+ + Code Embedding with CocoIndex Tutorial + +
## Prerequisite @@ -30,33 +27,32 @@ Checkout the blog [here](https://cocoindex.io/blogs/index-code-base-for-rag). ## Run -Install dependencies: -```bash -pip install -e . -``` +- Install dependencies: + ```bash + pip install -e . + ``` -Setup: +- Setup: -```bash -python main.py cocoindex setup -``` + ```bash + python main.py cocoindex setup + ``` -Update index: +- Update index: + + ```bash + python main.py cocoindex update + ``` -```bash -python main.py cocoindex update -``` +- Run: -Run: - -```bash -python main.py -``` + ```bash + python main.py + ``` ## CocoInsight -CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9). - -Run CocoInsight to understand your RAG data pipeline: +I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. +It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight: ``` python main.py cocoindex server -ci From bd3a8a4b3d66f1ce4429aa0b082d2504d7afea26 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 22:13:37 -0700 Subject: [PATCH 03/13] code embedding --- examples/code_embedding/main.py | 39 ++++++++++++++++++-------- examples/text_embedding_qdrant/main.py | 7 ----- 2 files changed, 27 insertions(+), 19 deletions(-) diff --git a/examples/code_embedding/main.py b/examples/code_embedding/main.py index abd6d7b0..a48961a9 100644 --- a/examples/code_embedding/main.py +++ b/examples/code_embedding/main.py @@ -1,5 +1,5 @@ from dotenv import load_dotenv - +from psycopg_pool import ConnectionPool import cocoindex import os @@ -8,7 +8,8 @@ def extract_extension(filename: str) -> str: """Extract the extension of a filename.""" return os.path.splitext(filename)[1] -def code_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice: +@cocoindex.transform_flow() +def code_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]: """ Embed the text using a SentenceTransformer model. """ @@ -23,7 +24,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind """ data_scope["files"] = flow_builder.add_source( cocoindex.sources.LocalFile(path="../..", - included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"], + included_patterns=["*.toml"], excluded_patterns=[".*", "target", "**/node_modules"])) code_embeddings = data_scope.add_collector() @@ -47,26 +48,40 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)]) -query_handler = cocoindex.query.SimpleSemanticsQueryHandler( - name="SemanticsSearch", - flow=code_embedding_flow, - target_name="code_embeddings", - query_transform_flow=code_to_embedding, - default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY) + +def search(pool: ConnectionPool, query: str, top_k: int = 5): + # Get the table name, for the export target in the code_embedding_flow above. + table_name = cocoindex.utils.get_target_storage_default_name(code_embedding_flow, "code_embeddings") + # Evaluate the transform flow defined above with the input query, to get the embedding. + query_vector = code_to_embedding.eval(query) + # Run the query and get the results. + with pool.connection() as conn: + with conn.cursor() as cur: + cur.execute(f""" + SELECT filename, code, embedding <=> %s::vector AS distance + FROM {table_name} ORDER BY distance LIMIT %s + """, (query_vector, top_k)) + return [ + {"filename": row[0], "code": row[1], "score": 1.0 - row[2]} + for row in cur.fetchall() + ] @cocoindex.main_fn() def _run(): + # Initialize the database connection pool. + pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL")) # Run queries in a loop to demonstrate the query capabilities. while True: try: query = input("Enter search query (or Enter to quit): ") if query == '': break - results, _ = query_handler.search(query, 10) + # Run the query function with the database connection pool and the query. + results = search(pool, query) print("\nSearch results:") for result in results: - print(f"[{result.score:.3f}] {result.data['filename']}") - print(f" {result.data['code']}") + print(f"[{result['score']:.3f}] {result['filename']}") + print(f" {result['code']}") print("---") print() except KeyboardInterrupt: diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py index 57f27a45..febe05e5 100644 --- a/examples/text_embedding_qdrant/main.py +++ b/examples/text_embedding_qdrant/main.py @@ -57,13 +57,6 @@ def text_embedding_flow( ) -query_handler = cocoindex.query.SimpleSemanticsQueryHandler( - name="SemanticsSearch", - flow=text_embedding_flow, - target_name="doc_embeddings", - query_transform_flow=text_to_embedding, - default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY, -) @cocoindex.main_fn() From 7bbbb4606099f621269dacf83f6fa3a614317f59 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 22:17:02 -0700 Subject: [PATCH 04/13] Update main.py --- examples/code_embedding/main.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/code_embedding/main.py b/examples/code_embedding/main.py index a48961a9..f45921e5 100644 --- a/examples/code_embedding/main.py +++ b/examples/code_embedding/main.py @@ -24,7 +24,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind """ data_scope["files"] = flow_builder.add_source( cocoindex.sources.LocalFile(path="../..", - included_patterns=["*.toml"], + included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"], excluded_patterns=[".*", "target", "**/node_modules"])) code_embeddings = data_scope.add_collector() From 57be4ab5aaee89954fdcb2b0f10d1d75e2d9bf82 Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 22:18:58 -0700 Subject: [PATCH 05/13] Update README.md --- examples/code_embedding/README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md index d7f5825e..8a02c4c3 100644 --- a/examples/code_embedding/README.md +++ b/examples/code_embedding/README.md @@ -21,6 +21,19 @@ Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/fu +## Steps + +### Indexing Flow +Screenshot 2025-05-19 at 10 14 36 PM + + +1. We will ingest CocoIndex codebase +2. For each file, perform chunking (Tree-sitter) and then embeddings. +3. We will save the embeddings and the metadata in Postgres with PGVector. + +### Query: +We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. + ## Prerequisite [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. From 55b34369ed7a42fc32d8ff159f2f4e4b4c0eb369 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 22:34:22 -0700 Subject: [PATCH 06/13] Update main.py --- examples/code_embedding/main.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/code_embedding/main.py b/examples/code_embedding/main.py index f45921e5..98e551a5 100644 --- a/examples/code_embedding/main.py +++ b/examples/code_embedding/main.py @@ -25,7 +25,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind data_scope["files"] = flow_builder.add_source( cocoindex.sources.LocalFile(path="../..", included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"], - excluded_patterns=[".*", "target", "**/node_modules"])) + excluded_patterns=["**/.*", "target", "**/node_modules"])) code_embeddings = data_scope.add_collector() with data_scope["files"].row() as file: From e62e146fc7ce6b49fa0e2808c3477294b4215bfa Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 22:35:57 -0700 Subject: [PATCH 07/13] Update main.py --- examples/text_embedding_qdrant/main.py | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py index febe05e5..57f27a45 100644 --- a/examples/text_embedding_qdrant/main.py +++ b/examples/text_embedding_qdrant/main.py @@ -57,6 +57,13 @@ def text_embedding_flow( ) +query_handler = cocoindex.query.SimpleSemanticsQueryHandler( + name="SemanticsSearch", + flow=text_embedding_flow, + target_name="doc_embeddings", + query_transform_flow=text_to_embedding, + default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY, +) @cocoindex.main_fn() From 9d0269045707c2342b872c69d1946cb996bbf019 Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 22:37:08 -0700 Subject: [PATCH 08/13] Update README.md --- examples/code_embedding/README.md | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md index 8a02c4c3..d3bc5a9b 100644 --- a/examples/code_embedding/README.md +++ b/examples/code_embedding/README.md @@ -5,27 +5,21 @@ CocoIndex provides built-in support for code base chunking, with native Tree-sit We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. -![Build embedding index for codebase](https://cocoindex.io/blogs/assets/images/cover-9bf0a7cff69b66a40918ab2fc1cea0c7.png) +![Build embedding index for codebase](https://github.com/user-attachments/assets/6dc5ce89-c949-41d4-852f-ad95af163dbd) -[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. - -Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section. +[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section. ## Tutorials - Step by step tutorial - Checkout the [blog](https://cocoindex.io/blogs/index-code-base-for-rag). -- Video Tutorial -
- - Code Embedding with CocoIndex Tutorial - -
+- Video Tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2) ## Steps ### Indexing Flow -Screenshot 2025-05-19 at 10 14 36 PM - +

+ Screenshot 2025-05-19 at 10 14 36 PM +

1. We will ingest CocoIndex codebase 2. For each file, perform chunking (Tree-sitter) and then embeddings. From 8ad8a8eedb0b95a3568d13d181fb6926c0c8a56a Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 22:40:35 -0700 Subject: [PATCH 09/13] Update README.md --- examples/code_embedding/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md index d3bc5a9b..97fb9d47 100644 --- a/examples/code_embedding/README.md +++ b/examples/code_embedding/README.md @@ -66,3 +66,6 @@ python main.py cocoindex server -ci ``` Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). + +Chunking Visualization + From 4c8d263b9416438b69f78c2488781658381a1ce5 Mon Sep 17 00:00:00 2001 From: LJ Date: Tue, 20 May 2025 08:59:28 -0700 Subject: [PATCH 10/13] Update README.md --- examples/code_embedding/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md index 97fb9d47..24d8b4ce 100644 --- a/examples/code_embedding/README.md +++ b/examples/code_embedding/README.md @@ -12,7 +12,7 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c ## Tutorials - Step by step tutorial - Checkout the [blog](https://cocoindex.io/blogs/index-code-base-for-rag). -- Video Tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2) +- Video tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2) ## Steps From c704e03421315a895461fe10f4d56707916c4ac5 Mon Sep 17 00:00:00 2001 From: LJ Date: Tue, 20 May 2025 10:22:57 -0700 Subject: [PATCH 11/13] Update README.md --- examples/code_embedding/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md index 24d8b4ce..d8afb4a0 100644 --- a/examples/code_embedding/README.md +++ b/examples/code_embedding/README.md @@ -21,8 +21,8 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c Screenshot 2025-05-19 at 10 14 36 PM

-1. We will ingest CocoIndex codebase -2. For each file, perform chunking (Tree-sitter) and then embeddings. +1. We will ingest CocoIndex codebase. +2. For each file, perform chunking (Tree-sitter) and then embedding. 3. We will save the embeddings and the metadata in Postgres with PGVector. ### Query: From 879ef6e69019fdd897f343461e4ab0c8182d1128 Mon Sep 17 00:00:00 2001 From: LJ Date: Tue, 20 May 2025 10:23:40 -0700 Subject: [PATCH 12/13] Update README.md --- examples/code_embedding/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md index d8afb4a0..70780558 100644 --- a/examples/code_embedding/README.md +++ b/examples/code_embedding/README.md @@ -1,7 +1,7 @@ # Build real-time index for codebase [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) -CocoIndex provides built-in support for code base chunking, with native Tree-sitter support. In this example, we will build real-time index for codebase using CocoIndex. +CocoIndex provides built-in support for code base chunking, using Tree-sitter to keep syntax boundary. In this example, we will build real-time index for codebase using CocoIndex. We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. From 71ef401f5b82cfe1c23687b6c2064dda2890bea5 Mon Sep 17 00:00:00 2001 From: LJ Date: Tue, 20 May 2025 10:34:02 -0700 Subject: [PATCH 13/13] Update README.md --- examples/code_embedding/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md index 70780558..dd7b9ee0 100644 --- a/examples/code_embedding/README.md +++ b/examples/code_embedding/README.md @@ -7,12 +7,12 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c ![Build embedding index for codebase](https://github.com/user-attachments/assets/6dc5ce89-c949-41d4-852f-ad95af163dbd) -[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section. +[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library. It is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. Check out the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section. ## Tutorials -- Step by step tutorial - Checkout the [blog](https://cocoindex.io/blogs/index-code-base-for-rag). -- Video tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2) +- Step by step tutorial - Check out the [blog](https://cocoindex.io/blogs/index-code-base-for-rag). +- Video tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2). ## Steps @@ -59,7 +59,7 @@ We will match against user-provided text by a SQL query, reusing the embedding o ## CocoInsight I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. -It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight: +It just connects to your local CocoIndex server, with Zero pipeline data retention. Run the following command to start CocoInsight: ``` python main.py cocoindex server -ci