Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
3cd5106
update product recommendation example description
badmonster0 May 18, 2025
e32cfea
rename folder
badmonster0 May 18, 2025
dddee93
Update README.md
badmonster0 May 18, 2025
e43a28e
Update README.md
badmonster0 May 18, 2025
ade9afb
Update README.md
badmonster0 May 19, 2025
2bead87
Update README.md
badmonster0 May 19, 2025
a374650
update text_embedding with new query handler
badmonster0 May 20, 2025
3df050d
Update main.py
badmonster0 May 20, 2025
0471abe
Update README.md
badmonster0 May 20, 2025
c3c09b0
Update README.md
badmonster0 May 20, 2025
bab25a4
Update README.md
badmonster0 May 20, 2025
4f0b607
Update README.md
badmonster0 May 20, 2025
9ffe7ed
Update README.md
badmonster0 May 20, 2025
631cad7
Update README.md
badmonster0 May 20, 2025
8008d85
Update README.md
badmonster0 May 20, 2025
2aa1bc8
qdrant
badmonster0 May 20, 2025
c76db2d
Update README.md
badmonster0 May 20, 2025
f5e965d
Update README.md
badmonster0 May 20, 2025
eed7a25
Update README.md
badmonster0 May 20, 2025
12fc9be
Update README.md
badmonster0 May 20, 2025
407eda2
Update README.md
badmonster0 May 20, 2025
44d93dd
Update README.md
badmonster0 May 20, 2025
689d359
Update README.md
badmonster0 May 20, 2025
a72270c
Merge branch 'jelly-ex2' of https://github.com/cocoindex-io/cocoindex…
badmonster0 May 20, 2025
eb56fe2
Update main.py
badmonster0 May 20, 2025
e69c212
Update main.py
badmonster0 May 20, 2025
9272a8a
Update main.py
badmonster0 May 20, 2025
7967914
Update main.py
badmonster0 May 20, 2025
35b19ba
Merge branch 'main' into jelly-ex2
badmonster0 May 20, 2025
3577694
upgrade query handling for pdf embedding
badmonster0 May 20, 2025
f127c54
Merge branch 'main' into jelly-ex2
badmonster0 May 20, 2025
9741fa8
Update README.md
badmonster0 May 20, 2025
818b806
Update README.md
badmonster0 May 20, 2025
b5fbae6
Update README.md
badmonster0 May 20, 2025
3ab5ddc
Update README.md
badmonster0 May 20, 2025
58d7cba
Merge branch 'main' into jelly-ex2
badmonster0 May 20, 2025
331e0e2
Merge branch 'main' into jelly-ex2
badmonster0 May 20, 2025
ff43f46
Update main.py
badmonster0 May 20, 2025
55ebc4a
google drive example update query handler
badmonster0 May 20, 2025
a79d279
Merge branch 'main' into jelly-ex2
badmonster0 May 20, 2025
6e72fb2
Update README.md
badmonster0 May 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 38 additions & 20 deletions examples/gdrive_text_embedding/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,23 @@
This example builds embedding index based on Google Drive files.
It continuously updates the index as files are added / updated / deleted in the source folders:
it keeps the index in sync with the source folders effortlessly.
# Build Google Drive text embedding and semantic search 🔍
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)

In this example, we will build an embedding index based on Google Drive files and perform semantic search.

It continuously updates the index as files are added / updated / deleted in the source folders. It keeps the index in sync with the source folders in real-time.

We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.

## Steps

### Indexing Flow
<img width="801" alt="Google Drive File Ingestion" src="https://github.com/user-attachments/assets/bc772e1e-d7a0-46de-b57c-290a78c128ac" />

1. We will ingest files from Google Drive folders.
2. For each file, perform chunking (recursively split) and then embedding.
3. We will save the embeddings and the metadata in Postgres with PGVector.

### Query
We will match against user-provided text by a SQL query, and reuse the embedding operation in the indexing flow.

## Prerequisite

Expand All @@ -25,32 +42,31 @@ Before running the example, you need to:

## Run

Install dependencies:

```sh
pip install -e .
```
- Install dependencies:

Setup:
```sh
pip install -e .
```

```sh
cocoindex setup main.py
```
- Setup:

Run:
```sh
cocoindex setup main.py
```

```sh
python main.py
```
- Run:

```sh
python main.py
```

During running, it will keep observing changes in the source folders and update the index automatically.
At the same time, it accepts queries from the terminal, and performs search on top of the up-to-date index.


## CocoInsight
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).

Run CocoInsight to understand your RAG data pipeline:
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:

```sh
cocoindex server -ci main.py
Expand All @@ -62,4 +78,6 @@ You can also add a `-L` flag to make the server keep updating the index to refle
cocoindex server -ci -L main.py
```

Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).

<img width="1316" alt="Screenshot 2025-05-20 at 5 06 31 PM" src="https://github.com/user-attachments/assets/0ed848db-3cc3-43d3-8cb8-35069f503288" />
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this alt looks a little bit temporary

73 changes: 45 additions & 28 deletions examples/gdrive_text_embedding/main.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
from dotenv import load_dotenv

from psycopg_pool import ConnectionPool
import cocoindex
import datetime
import os

@cocoindex.transform_flow()
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
"""
Embed the text using a SentenceTransformer model.
This is a shared logic between indexing and querying, so extract it as a function.
"""
return text.transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))

@cocoindex.flow_def(name="GoogleDriveTextEmbedding")
def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
"""
Expand All @@ -27,9 +37,7 @@ def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope:
language="markdown", chunk_size=2000, chunk_overlap=500)

with doc["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
chunk["embedding"] = text_to_embedding(chunk["text"])
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])

Expand All @@ -42,33 +50,42 @@ def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope:
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
name="SemanticsSearch",
flow=gdrive_text_embedding_flow,
target_name="doc_embeddings",
query_transform_flow=lambda text: text.transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2")),
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
def search(pool: ConnectionPool, query: str, top_k: int = 5):
# Get the table name, for the export target in the gdrive_text_embedding_flow above.
table_name = cocoindex.utils.get_target_storage_default_name(gdrive_text_embedding_flow, "doc_embeddings")
# Evaluate the transform flow defined above with the input query, to get the embedding.
query_vector = text_to_embedding.eval(query)
# Run the query and get the results.
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute(f"""
SELECT filename, text, embedding <=> %s::vector AS distance
FROM {table_name} ORDER BY distance LIMIT %s
""", (query_vector, top_k))
return [
{"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
for row in cur.fetchall()
]

def _main():
# Use a `FlowLiveUpdater` to keep the flow data updated.
with cocoindex.FlowLiveUpdater(gdrive_text_embedding_flow):
# Run queries in a loop to demonstrate the query capabilities.
while True:
try:
query = input("Enter search query (or Enter to quit): ")
if query == '':
break
results, _ = query_handler.search(query, 10)
print("\nSearch results:")
for result in results:
print(f"[{result.score:.3f}] {result.data['filename']}")
print(f" {result.data['text']}")
print("---")
print()
except KeyboardInterrupt:
# Initialize the database connection pool.
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
# Run queries in a loop to demonstrate the query capabilities.
while True:
try:
query = input("Enter search query (or Enter to quit): ")
if query == '':
break
# Run the query function with the database connection pool and the query.
results = search(pool, query)
print("\nSearch results:")
for result in results:
print(f"[{result['score']:.3f}] {result['filename']}")
print(f" {result['text']}")
print("---")
print()
except KeyboardInterrupt:
break

if __name__ == "__main__":
load_dotenv()
Expand Down