From 462c3b63e394a31115bdc5d51314abda216ee0e6 Mon Sep 17 00:00:00 2001
From: Linghua Jin
Date: Mon, 19 May 2025 21:33:06 -0700
Subject: [PATCH 01/13] Update README.md
---
examples/code_embedding/README.md | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md
index 5d716fa6..f4ec8415 100644
--- a/examples/code_embedding/README.md
+++ b/examples/code_embedding/README.md
@@ -1,15 +1,28 @@
-# Build embedding index for codebase
+# Build real-time index for codebase
+[](https://github.com/cocoindex-io/cocoindex)
+
+CocoIndex provides built-in support for code base chunking, with native Tree-sitter support. In this example, we will build real-time index for codebase using CocoIndex.
+
+We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.

-In this example, we will build an embedding index for a codebase using CocoIndex. CocoIndex provides built-in support for code base chunking, with native Tree-sitter support. [Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages.
+[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages.
+Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section.
-Please give [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us if you like our work. Thank you so much with a warm coconut hug 🥥🤗. [](https://github.com/cocoindex-io/cocoindex)
## Tutorials
-- Blog with step by step tutorial [here](https://cocoindex.io/blogs/index-code-base-for-rag).
-- Video walkthrough [here](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2)
+### Step by step tutorial
+Checkout the blog [here](https://cocoindex.io/blogs/index-code-base-for-rag).
+
+### Video Tutorial
+
+
+
+
+
Click the image above to watch the video tutorial on YouTube
+
## Prerequisite
From 4fdd56603d03309633d0fac44a37b46df8f2f66d Mon Sep 17 00:00:00 2001
From: LJ
Date: Mon, 19 May 2025 21:42:11 -0700
Subject: [PATCH 02/13] Update README.md
---
examples/code_embedding/README.md | 56 ++++++++++++++-----------------
1 file changed, 26 insertions(+), 30 deletions(-)
diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md
index f4ec8415..d7f5825e 100644
--- a/examples/code_embedding/README.md
+++ b/examples/code_embedding/README.md
@@ -13,16 +13,13 @@ Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/fu
## Tutorials
-### Step by step tutorial
-Checkout the blog [here](https://cocoindex.io/blogs/index-code-base-for-rag).
-
-### Video Tutorial
-
-
-
-
-
Click the image above to watch the video tutorial on YouTube
-
+- Step by step tutorial - Checkout the [blog](https://cocoindex.io/blogs/index-code-base-for-rag).
+- Video Tutorial
+
## Prerequisite
@@ -30,33 +27,32 @@ Checkout the blog [here](https://cocoindex.io/blogs/index-code-base-for-rag).
## Run
-Install dependencies:
-```bash
-pip install -e .
-```
+- Install dependencies:
+ ```bash
+ pip install -e .
+ ```
-Setup:
+- Setup:
-```bash
-python main.py cocoindex setup
-```
+ ```bash
+ python main.py cocoindex setup
+ ```
-Update index:
+- Update index:
+
+ ```bash
+ python main.py cocoindex update
+ ```
-```bash
-python main.py cocoindex update
-```
+- Run:
-Run:
-
-```bash
-python main.py
-```
+ ```bash
+ python main.py
+ ```
## CocoInsight
-CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).
-
-Run CocoInsight to understand your RAG data pipeline:
+I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
+It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
```
python main.py cocoindex server -ci
From bd3a8a4b3d66f1ce4429aa0b082d2504d7afea26 Mon Sep 17 00:00:00 2001
From: Linghua Jin
Date: Mon, 19 May 2025 22:13:37 -0700
Subject: [PATCH 03/13] code embedding
---
examples/code_embedding/main.py | 39 ++++++++++++++++++--------
examples/text_embedding_qdrant/main.py | 7 -----
2 files changed, 27 insertions(+), 19 deletions(-)
diff --git a/examples/code_embedding/main.py b/examples/code_embedding/main.py
index abd6d7b0..a48961a9 100644
--- a/examples/code_embedding/main.py
+++ b/examples/code_embedding/main.py
@@ -1,5 +1,5 @@
from dotenv import load_dotenv
-
+from psycopg_pool import ConnectionPool
import cocoindex
import os
@@ -8,7 +8,8 @@ def extract_extension(filename: str) -> str:
"""Extract the extension of a filename."""
return os.path.splitext(filename)[1]
-def code_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:
+@cocoindex.transform_flow()
+def code_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
"""
Embed the text using a SentenceTransformer model.
"""
@@ -23,7 +24,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
"""
data_scope["files"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="../..",
- included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"],
+ included_patterns=["*.toml"],
excluded_patterns=[".*", "target", "**/node_modules"]))
code_embeddings = data_scope.add_collector()
@@ -47,26 +48,40 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
-query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
- name="SemanticsSearch",
- flow=code_embedding_flow,
- target_name="code_embeddings",
- query_transform_flow=code_to_embedding,
- default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
+
+def search(pool: ConnectionPool, query: str, top_k: int = 5):
+ # Get the table name, for the export target in the code_embedding_flow above.
+ table_name = cocoindex.utils.get_target_storage_default_name(code_embedding_flow, "code_embeddings")
+ # Evaluate the transform flow defined above with the input query, to get the embedding.
+ query_vector = code_to_embedding.eval(query)
+ # Run the query and get the results.
+ with pool.connection() as conn:
+ with conn.cursor() as cur:
+ cur.execute(f"""
+ SELECT filename, code, embedding <=> %s::vector AS distance
+ FROM {table_name} ORDER BY distance LIMIT %s
+ """, (query_vector, top_k))
+ return [
+ {"filename": row[0], "code": row[1], "score": 1.0 - row[2]}
+ for row in cur.fetchall()
+ ]
@cocoindex.main_fn()
def _run():
+ # Initialize the database connection pool.
+ pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
# Run queries in a loop to demonstrate the query capabilities.
while True:
try:
query = input("Enter search query (or Enter to quit): ")
if query == '':
break
- results, _ = query_handler.search(query, 10)
+ # Run the query function with the database connection pool and the query.
+ results = search(pool, query)
print("\nSearch results:")
for result in results:
- print(f"[{result.score:.3f}] {result.data['filename']}")
- print(f" {result.data['code']}")
+ print(f"[{result['score']:.3f}] {result['filename']}")
+ print(f" {result['code']}")
print("---")
print()
except KeyboardInterrupt:
diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py
index 57f27a45..febe05e5 100644
--- a/examples/text_embedding_qdrant/main.py
+++ b/examples/text_embedding_qdrant/main.py
@@ -57,13 +57,6 @@ def text_embedding_flow(
)
-query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
- name="SemanticsSearch",
- flow=text_embedding_flow,
- target_name="doc_embeddings",
- query_transform_flow=text_to_embedding,
- default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
-)
@cocoindex.main_fn()
From 7bbbb4606099f621269dacf83f6fa3a614317f59 Mon Sep 17 00:00:00 2001
From: Linghua Jin
Date: Mon, 19 May 2025 22:17:02 -0700
Subject: [PATCH 04/13] Update main.py
---
examples/code_embedding/main.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/examples/code_embedding/main.py b/examples/code_embedding/main.py
index a48961a9..f45921e5 100644
--- a/examples/code_embedding/main.py
+++ b/examples/code_embedding/main.py
@@ -24,7 +24,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
"""
data_scope["files"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="../..",
- included_patterns=["*.toml"],
+ included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"],
excluded_patterns=[".*", "target", "**/node_modules"]))
code_embeddings = data_scope.add_collector()
From 57be4ab5aaee89954fdcb2b0f10d1d75e2d9bf82 Mon Sep 17 00:00:00 2001
From: LJ
Date: Mon, 19 May 2025 22:18:58 -0700
Subject: [PATCH 05/13] Update README.md
---
examples/code_embedding/README.md | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md
index d7f5825e..8a02c4c3 100644
--- a/examples/code_embedding/README.md
+++ b/examples/code_embedding/README.md
@@ -21,6 +21,19 @@ Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/fu
+## Steps
+
+### Indexing Flow
+
+
+
+1. We will ingest CocoIndex codebase
+2. For each file, perform chunking (Tree-sitter) and then embeddings.
+3. We will save the embeddings and the metadata in Postgres with PGVector.
+
+### Query:
+We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
+
## Prerequisite
[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
From 55b34369ed7a42fc32d8ff159f2f4e4b4c0eb369 Mon Sep 17 00:00:00 2001
From: Linghua Jin
Date: Mon, 19 May 2025 22:34:22 -0700
Subject: [PATCH 06/13] Update main.py
---
examples/code_embedding/main.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/examples/code_embedding/main.py b/examples/code_embedding/main.py
index f45921e5..98e551a5 100644
--- a/examples/code_embedding/main.py
+++ b/examples/code_embedding/main.py
@@ -25,7 +25,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
data_scope["files"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="../..",
included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"],
- excluded_patterns=[".*", "target", "**/node_modules"]))
+ excluded_patterns=["**/.*", "target", "**/node_modules"]))
code_embeddings = data_scope.add_collector()
with data_scope["files"].row() as file:
From e62e146fc7ce6b49fa0e2808c3477294b4215bfa Mon Sep 17 00:00:00 2001
From: Linghua Jin
Date: Mon, 19 May 2025 22:35:57 -0700
Subject: [PATCH 07/13] Update main.py
---
examples/text_embedding_qdrant/main.py | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py
index febe05e5..57f27a45 100644
--- a/examples/text_embedding_qdrant/main.py
+++ b/examples/text_embedding_qdrant/main.py
@@ -57,6 +57,13 @@ def text_embedding_flow(
)
+query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
+ name="SemanticsSearch",
+ flow=text_embedding_flow,
+ target_name="doc_embeddings",
+ query_transform_flow=text_to_embedding,
+ default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
+)
@cocoindex.main_fn()
From 9d0269045707c2342b872c69d1946cb996bbf019 Mon Sep 17 00:00:00 2001
From: LJ
Date: Mon, 19 May 2025 22:37:08 -0700
Subject: [PATCH 08/13] Update README.md
---
examples/code_embedding/README.md | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md
index 8a02c4c3..d3bc5a9b 100644
--- a/examples/code_embedding/README.md
+++ b/examples/code_embedding/README.md
@@ -5,27 +5,21 @@ CocoIndex provides built-in support for code base chunking, with native Tree-sit
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
-
+
-[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages.
-
-Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section.
+[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section.
## Tutorials
- Step by step tutorial - Checkout the [blog](https://cocoindex.io/blogs/index-code-base-for-rag).
-- Video Tutorial
-
+- Video Tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2)
## Steps
### Indexing Flow
-
-
+
+
+
1. We will ingest CocoIndex codebase
2. For each file, perform chunking (Tree-sitter) and then embeddings.
From 8ad8a8eedb0b95a3568d13d181fb6926c0c8a56a Mon Sep 17 00:00:00 2001
From: LJ
Date: Mon, 19 May 2025 22:40:35 -0700
Subject: [PATCH 09/13] Update README.md
---
examples/code_embedding/README.md | 3 +++
1 file changed, 3 insertions(+)
diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md
index d3bc5a9b..97fb9d47 100644
--- a/examples/code_embedding/README.md
+++ b/examples/code_embedding/README.md
@@ -66,3 +66,6 @@ python main.py cocoindex server -ci
```
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
+
+
+
From 4c8d263b9416438b69f78c2488781658381a1ce5 Mon Sep 17 00:00:00 2001
From: LJ
Date: Tue, 20 May 2025 08:59:28 -0700
Subject: [PATCH 10/13] Update README.md
---
examples/code_embedding/README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md
index 97fb9d47..24d8b4ce 100644
--- a/examples/code_embedding/README.md
+++ b/examples/code_embedding/README.md
@@ -12,7 +12,7 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c
## Tutorials
- Step by step tutorial - Checkout the [blog](https://cocoindex.io/blogs/index-code-base-for-rag).
-- Video Tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2)
+- Video tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2)
## Steps
From c704e03421315a895461fe10f4d56707916c4ac5 Mon Sep 17 00:00:00 2001
From: LJ
Date: Tue, 20 May 2025 10:22:57 -0700
Subject: [PATCH 11/13] Update README.md
---
examples/code_embedding/README.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md
index 24d8b4ce..d8afb4a0 100644
--- a/examples/code_embedding/README.md
+++ b/examples/code_embedding/README.md
@@ -21,8 +21,8 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c
-1. We will ingest CocoIndex codebase
-2. For each file, perform chunking (Tree-sitter) and then embeddings.
+1. We will ingest CocoIndex codebase.
+2. For each file, perform chunking (Tree-sitter) and then embedding.
3. We will save the embeddings and the metadata in Postgres with PGVector.
### Query:
From 879ef6e69019fdd897f343461e4ab0c8182d1128 Mon Sep 17 00:00:00 2001
From: LJ
Date: Tue, 20 May 2025 10:23:40 -0700
Subject: [PATCH 12/13] Update README.md
---
examples/code_embedding/README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md
index d8afb4a0..70780558 100644
--- a/examples/code_embedding/README.md
+++ b/examples/code_embedding/README.md
@@ -1,7 +1,7 @@
# Build real-time index for codebase
[](https://github.com/cocoindex-io/cocoindex)
-CocoIndex provides built-in support for code base chunking, with native Tree-sitter support. In this example, we will build real-time index for codebase using CocoIndex.
+CocoIndex provides built-in support for code base chunking, using Tree-sitter to keep syntax boundary. In this example, we will build real-time index for codebase using CocoIndex.
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
From 71ef401f5b82cfe1c23687b6c2064dda2890bea5 Mon Sep 17 00:00:00 2001
From: LJ
Date: Tue, 20 May 2025 10:34:02 -0700
Subject: [PATCH 13/13] Update README.md
---
examples/code_embedding/README.md | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/examples/code_embedding/README.md b/examples/code_embedding/README.md
index 70780558..dd7b9ee0 100644
--- a/examples/code_embedding/README.md
+++ b/examples/code_embedding/README.md
@@ -7,12 +7,12 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c

-[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. Checkout the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section.
+[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library. It is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. Check out the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section.
## Tutorials
-- Step by step tutorial - Checkout the [blog](https://cocoindex.io/blogs/index-code-base-for-rag).
-- Video tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2)
+- Step by step tutorial - Check out the [blog](https://cocoindex.io/blogs/index-code-base-for-rag).
+- Video tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2).
## Steps
@@ -59,7 +59,7 @@ We will match against user-provided text by a SQL query, reusing the embedding o
## CocoInsight
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
-It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
+It just connects to your local CocoIndex server, with Zero pipeline data retention. Run the following command to start CocoInsight:
```
python main.py cocoindex server -ci