From 3cd5106e8fe2faf0f21d372d96bed1674f2812a4 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Sun, 18 May 2025 16:12:57 -0700 Subject: [PATCH 01/47] update product recommendation example description --- README.md | 2 +- examples/product_recommendation/README.md | 6 ++---- 2 files changed, 3 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 8b3cb3a23..3c7a9a932 100644 --- a/README.md +++ b/README.md @@ -137,7 +137,7 @@ It defines an index flow like this: | [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph | | [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search | | [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup | -| [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database| +| [Recommendation Engine with Knowledge Graph](examples/product_taxonomy_knowledge_graph) | Build real-time product recommendations with LLM and knowledge graph | | [Image Search with Vision API](examples/image_search_example) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend| More coming and stay tuned 👀! diff --git a/examples/product_recommendation/README.md b/examples/product_recommendation/README.md index 96565782f..24da1f069 100644 --- a/examples/product_recommendation/README.md +++ b/examples/product_recommendation/README.md @@ -1,8 +1,6 @@ -# Build Real-Time Recommendation Engine with LLM and Graph Database +# Build Real-Time Recommendation Engine with LLM and Knowledge Graph -We will build a real-time product recommendation engine with LLM and graph database. In particular, we will use LLM to understand the category (taxonomy) of a product. In addition, we will use LLM to enumerate the complementary products - users are likely to buy together with the current product (pencil and notebook). - -We will use Graph to explore the relationships between products that can be further used for product recommendations or labeling. +We will process a list of products and use LLM to extract the taxonomy and complimentary taxonomy for each product and find connections between products. Please drop [CocoIndex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us and stay tuned for more updates. Thank you so much 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) From e32cfea69db9d46698dbcb663198f6040292a862 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Sun, 18 May 2025 16:54:12 -0700 Subject: [PATCH 02/47] rename folder --- examples/product_recommendation/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/examples/product_recommendation/README.md b/examples/product_recommendation/README.md index 24da1f069..c554655bb 100644 --- a/examples/product_recommendation/README.md +++ b/examples/product_recommendation/README.md @@ -1,6 +1,8 @@ # Build Real-Time Recommendation Engine with LLM and Knowledge Graph -We will process a list of products and use LLM to extract the taxonomy and complimentary taxonomy for each product and find connections between products. +We will build a real-time product recommendation engine with LLM and knowledge graph. In particular, we will use LLM to understand the category (taxonomy) of a product. In addition, we will use LLM to enumerate the complementary products - users are likely to buy together with the current product (pencil and notebook). + +We will use Knowledge Graph to explore the relationships between products that can be further used for product recommendations or labeling. Please drop [CocoIndex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us and stay tuned for more updates. Thank you so much 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) From dddee93334a7ad8121f84d35e6f31adc01e68adb Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Sun, 18 May 2025 16:54:40 -0700 Subject: [PATCH 03/47] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3c7a9a932..b09e76211 100644 --- a/README.md +++ b/README.md @@ -137,7 +137,7 @@ It defines an index flow like this: | [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph | | [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search | | [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup | -| [Recommendation Engine with Knowledge Graph](examples/product_taxonomy_knowledge_graph) | Build real-time product recommendations with LLM and knowledge graph | +| [Product Recommendation with Knowledge Graph](examples/product_recommendation) | Build real-time product recommendations with LLM and knowledge graph | | [Image Search with Vision API](examples/image_search_example) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend| More coming and stay tuned 👀! From e43a28ebe676ee1584819513d576b4e1b8e6f341 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Sun, 18 May 2025 16:56:18 -0700 Subject: [PATCH 04/47] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b09e76211..8b3cb3a23 100644 --- a/README.md +++ b/README.md @@ -137,7 +137,7 @@ It defines an index flow like this: | [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph | | [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search | | [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup | -| [Product Recommendation with Knowledge Graph](examples/product_recommendation) | Build real-time product recommendations with LLM and knowledge graph | +| [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database| | [Image Search with Vision API](examples/image_search_example) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend| More coming and stay tuned 👀! From ade9afb527f9d7dc8afc14c07419764865ff4031 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Sun, 18 May 2025 17:07:15 -0700 Subject: [PATCH 05/47] Update README.md --- examples/product_recommendation/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/product_recommendation/README.md b/examples/product_recommendation/README.md index c554655bb..51552a6b4 100644 --- a/examples/product_recommendation/README.md +++ b/examples/product_recommendation/README.md @@ -1,6 +1,6 @@ -# Build Real-Time Recommendation Engine with LLM and Knowledge Graph +# Build Real-Time Recommendation Engine with LLM and Graph Database -We will build a real-time product recommendation engine with LLM and knowledge graph. In particular, we will use LLM to understand the category (taxonomy) of a product. In addition, we will use LLM to enumerate the complementary products - users are likely to buy together with the current product (pencil and notebook). +We will build a real-time product recommendation engine with LLM and graph database. In particular, we will use LLM to understand the category (taxonomy) of a product. In addition, we will use LLM to enumerate the complementary products - users are likely to buy together with the current product (pencil and notebook). We will use Knowledge Graph to explore the relationships between products that can be further used for product recommendations or labeling. From 2bead8794e35eb0eebe2628927f86b6b94a1d0a3 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Sun, 18 May 2025 17:07:46 -0700 Subject: [PATCH 06/47] Update README.md --- examples/product_recommendation/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/product_recommendation/README.md b/examples/product_recommendation/README.md index 51552a6b4..96565782f 100644 --- a/examples/product_recommendation/README.md +++ b/examples/product_recommendation/README.md @@ -2,7 +2,7 @@ We will build a real-time product recommendation engine with LLM and graph database. In particular, we will use LLM to understand the category (taxonomy) of a product. In addition, we will use LLM to enumerate the complementary products - users are likely to buy together with the current product (pencil and notebook). -We will use Knowledge Graph to explore the relationships between products that can be further used for product recommendations or labeling. +We will use Graph to explore the relationships between products that can be further used for product recommendations or labeling. Please drop [CocoIndex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us and stay tuned for more updates. Thank you so much 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) From a37465081314654d65508856fd1378dcdfc5bb1b Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 17:49:08 -0700 Subject: [PATCH 07/47] update text_embedding with new query handler --- examples/text_embedding/README.md | 5 ++++- examples/text_embedding/main.py | 1 - 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index 1e9998827..fc821c6ce 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -1,4 +1,7 @@ -# Build text embedding and semantic search 🔍 +Build text embedding and semantic search based on local files. + +In this example, we will build a text embedding index and a semantic search flow based on local markdown files. + [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cocoindex-io/cocoindex/blob/main/examples/text_embedding/Text_Embedding.ipynb) [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) diff --git a/examples/text_embedding/main.py b/examples/text_embedding/main.py index e69e1e7c5..581f62c65 100644 --- a/examples/text_embedding/main.py +++ b/examples/text_embedding/main.py @@ -42,7 +42,6 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind field_name="embedding", metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)]) - def search(pool: ConnectionPool, query: str, top_k: int = 5): # Get the table name, for the export target in the text_embedding_flow above. table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings") From 3df050d01251413fe920c22cfa04a59bd90bac31 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 18:14:44 -0700 Subject: [PATCH 08/47] Update main.py --- examples/text_embedding/main.py | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/text_embedding/main.py b/examples/text_embedding/main.py index 581f62c65..e69e1e7c5 100644 --- a/examples/text_embedding/main.py +++ b/examples/text_embedding/main.py @@ -42,6 +42,7 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind field_name="embedding", metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)]) + def search(pool: ConnectionPool, query: str, top_k: int = 5): # Get the table name, for the export target in the text_embedding_flow above. table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings") From 0471abec519d46af2902862d75fc0dd402d99e32 Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 18:01:29 -0700 Subject: [PATCH 09/47] Update README.md --- examples/text_embedding/README.md | 22 +++++----------------- 1 file changed, 5 insertions(+), 17 deletions(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index fc821c6ce..9ddc9d7b5 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -1,27 +1,15 @@ -Build text embedding and semantic search based on local files. - -In this example, we will build a text embedding index and a semantic search flow based on local markdown files. - +# Build text embedding and semantic search [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cocoindex-io/cocoindex/blob/main/examples/text_embedding/Text_Embedding.ipynb) -[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) -In this example, we will build index flow from text embedding from local markdown files, and query the index. -We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. +In this example, we will build a text embedding index and a semantic search flow based on local markdown files. -## Steps: -🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart) -### Indexing Flow: Screenshot 2025-05-19 at 5 48 28 PM -1. We will ingest from a list of local files. -2. For each file, perform chunking (Recursive Split) and then embeddings. -3. We will save the embeddings and the metadata in Postgres with PGVector. - -### Query: -We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. - +We will ingest from a list of local files. For each file, perform chunking (Recursive Split) and then embeddings. +We will save the embeddings and the metadata in Postgres with PGVector. +And then add a simpler query handler for semantic search. ## Prerequisite From c3c09b0d1cab00c4971365e4b2d1b99ac53d4b81 Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 18:02:01 -0700 Subject: [PATCH 10/47] Update README.md --- examples/text_embedding/README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index 9ddc9d7b5..01db59535 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -7,9 +7,10 @@ In this example, we will build a text embedding index and a semantic search flow Screenshot 2025-05-19 at 5 48 28 PM -We will ingest from a list of local files. For each file, perform chunking (Recursive Split) and then embeddings. -We will save the embeddings and the metadata in Postgres with PGVector. -And then add a simpler query handler for semantic search. +- We will ingest from a list of local files. +- For each file, perform chunking (Recursive Split) and then embeddings. +- We will save the embeddings and the metadata in Postgres with PGVector. +- And then add a simpler query handler for semantic search. ## Prerequisite From bab25a4db17b2d79bbe0d39b422eeab312a79d71 Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 18:11:22 -0700 Subject: [PATCH 11/47] Update README.md --- examples/text_embedding/README.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index 01db59535..618f854c9 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -1,16 +1,21 @@ -# Build text embedding and semantic search +# Build text embedding and semantic search 🔍 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cocoindex-io/cocoindex/blob/main/examples/text_embedding/Text_Embedding.ipynb) +[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) +In this example, we will build index flow from text embedding from local markdown files. And build semantic search with simple query handler. -In this example, we will build a text embedding index and a semantic search flow based on local markdown files. - +We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. Screenshot 2025-05-19 at 5 48 28 PM -- We will ingest from a list of local files. -- For each file, perform chunking (Recursive Split) and then embeddings. -- We will save the embeddings and the metadata in Postgres with PGVector. -- And then add a simpler query handler for semantic search. +Steps: +1. We will ingest from a list of local files. +2. For each file, perform chunking (Recursive Split) and then embeddings. +3. We will save the embeddings and the metadata in Postgres with PGVector. +4. And then add a simpler query handler for semantic search. + +🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart) + ## Prerequisite From 4f0b607ffb95a8eb805757f8c2fcfe7868927228 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 18:20:44 -0700 Subject: [PATCH 12/47] Update README.md --- examples/text_embedding/README.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index 618f854c9..aba3e16d3 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -2,17 +2,20 @@ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cocoindex-io/cocoindex/blob/main/examples/text_embedding/Text_Embedding.ipynb) [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) -In this example, we will build index flow from text embedding from local markdown files. And build semantic search with simple query handler. +In this example, we will build index flow from text embedding from local markdown files. And provide an simple example to query the index. We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. Screenshot 2025-05-19 at 5 48 28 PM Steps: -1. We will ingest from a list of local files. -2. For each file, perform chunking (Recursive Split) and then embeddings. -3. We will save the embeddings and the metadata in Postgres with PGVector. -4. And then add a simpler query handler for semantic search. +- Indexing Flow: + 1. We will ingest from a list of local files. + 2. For each file, perform chunking (Recursive Split) and then embeddings. + 3. We will save the embeddings and the metadata in Postgres with PGVector. + +- Query: +1. We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. 🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart) From 9ffe7ed4a73363abfdebaa5bd349249c94c83c1c Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 18:21:39 -0700 Subject: [PATCH 13/47] Update README.md --- examples/text_embedding/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index aba3e16d3..809b2de7e 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -2,7 +2,7 @@ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cocoindex-io/cocoindex/blob/main/examples/text_embedding/Text_Embedding.ipynb) [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) -In this example, we will build index flow from text embedding from local markdown files. And provide an simple example to query the index. +In this example, we will build index flow from text embedding from local markdown files, and query the index. We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. From 631cad75c4625b8fc6f68736d6dbb0235575e18f Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 18:29:22 -0700 Subject: [PATCH 14/47] Update README.md --- examples/text_embedding/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index 809b2de7e..32c91d1ba 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -15,7 +15,7 @@ Steps: 3. We will save the embeddings and the metadata in Postgres with PGVector. - Query: -1. We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. +We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. 🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart) From 8008d85d91b6c02c5691fe6ecc01dfebe3f1d3d3 Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 18:23:39 -0700 Subject: [PATCH 15/47] Update README.md --- examples/text_embedding/README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index 32c91d1ba..f5b39f979 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -6,18 +6,18 @@ In this example, we will build index flow from text embedding from local markdow We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. -Screenshot 2025-05-19 at 5 48 28 PM - -Steps: -- Indexing Flow: - 1. We will ingest from a list of local files. - 2. For each file, perform chunking (Recursive Split) and then embeddings. - 3. We will save the embeddings and the metadata in Postgres with PGVector. +## Steps: +🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart) -- Query: -We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. +### Indexing Flow: +Screenshot 2025-05-19 at 5 48 28 PM -🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart) +1. We will ingest from a list of local files. +2. For each file, perform chunking (Recursive Split) and then embeddings. +3. We will save the embeddings and the metadata in Postgres with PGVector. + +### Query: +1. We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. ## Prerequisite From 2aa1bc8c3048d4ae1e9367aeb67cea5ba3e80c41 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 19:13:27 -0700 Subject: [PATCH 16/47] qdrant --- examples/text_embedding_qdrant/README.md | 16 ++++--- examples/text_embedding_qdrant/main.py | 43 +++++++++++-------- examples/text_embedding_qdrant/pyproject.toml | 2 +- 3 files changed, 36 insertions(+), 25 deletions(-) diff --git a/examples/text_embedding_qdrant/README.md b/examples/text_embedding_qdrant/README.md index 5e2ea059c..55091fafa 100644 --- a/examples/text_embedding_qdrant/README.md +++ b/examples/text_embedding_qdrant/README.md @@ -1,6 +1,10 @@ ## Description +# Build text embedding and semantic search 🔍 with Qdrant -Example to build a vector index in Qdrant based on local files. +[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) + +In this example, we will build index flow from text embedding from local markdown files, and query the index. +We will use **Qdrant** as the vector database. ## Pre-requisites @@ -57,13 +61,13 @@ python main.py ``` ## CocoInsight - -CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9). - -Run CocoInsight to understand your RAG data pipeline: +I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. +It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight: ```bash python main.py cocoindex server -ci ``` -Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). +Open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). + + diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py index 57f27a45e..fd15ba73c 100644 --- a/examples/text_embedding_qdrant/main.py +++ b/examples/text_embedding_qdrant/main.py @@ -1,21 +1,22 @@ from dotenv import load_dotenv +from qdrant_client import QdrantClient +from qdrant_client.http.models import Filter, FieldCondition, MatchValue import cocoindex -def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice: +@cocoindex.transform_flow() +def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]: """ Embed the text using a SentenceTransformer model. This is a shared logic between indexing and querying, so extract it as a function. """ return text.transform( cocoindex.functions.SentenceTransformerEmbed( - model="sentence-transformers/all-MiniLM-L6-v2" - ) - ) + model="sentence-transformers/all-MiniLM-L6-v2")) -@cocoindex.flow_def(name="TextEmbedding") +@cocoindex.flow_def(name="TextEmbeddingWithQdrant") def text_embedding_flow( flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope ): @@ -57,28 +58,34 @@ def text_embedding_flow( ) -query_handler = cocoindex.query.SimpleSemanticsQueryHandler( - name="SemanticsSearch", - flow=text_embedding_flow, - target_name="doc_embeddings", - query_transform_flow=text_to_embedding, - default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY, -) - - @cocoindex.main_fn() def _run(): + # Initialize Qdrant client + client = QdrantClient(host="localhost", port=6333) + # Run queries in a loop to demonstrate the query capabilities. while True: try: query = input("Enter search query (or Enter to quit): ") if query == "": break - results, _ = query_handler.search(query, 10, "text_embedding") + + # Get the embedding for the query + query_embedding = text_to_embedding.eval(query) + + # Search in Qdrant + search_results = client.search( + collection_name="cocoindex", + query_vector=("text_embedding", query_embedding), + limit=10 + ) + print("\nSearch results:") - for result in results: - print(f"[{result.score:.3f}] {result.data['filename']}") - print(f" {result.data['text']}") + for result in search_results: + score = result.score + payload = result.payload + print(f"[{score:.3f}] {payload['filename']}") + print(f" {payload['text']}") print("---") print() except KeyboardInterrupt: diff --git a/examples/text_embedding_qdrant/pyproject.toml b/examples/text_embedding_qdrant/pyproject.toml index 25b2663cc..704542007 100644 --- a/examples/text_embedding_qdrant/pyproject.toml +++ b/examples/text_embedding_qdrant/pyproject.toml @@ -3,7 +3,7 @@ name = "text-embedding-qdrant" version = "0.1.0" description = "Simple example for cocoindex: build embedding index based on local text files." requires-python = ">=3.10" -dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"] +dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1", "qdrant-client>=1.6.0"] [tool.setuptools] packages = [] From c76db2dd9f4ed5d39f35d52801a3497abecb2c6c Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 19:15:44 -0700 Subject: [PATCH 17/47] Update README.md --- examples/text_embedding/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index f5b39f979..1e9998827 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -17,7 +17,7 @@ We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c 3. We will save the embeddings and the metadata in Postgres with PGVector. ### Query: -1. We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. +We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. ## Prerequisite From f5e965df731d3ee0f667d388d56a0cfaceeda167 Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 19:22:36 -0700 Subject: [PATCH 18/47] Update README.md --- examples/text_embedding_qdrant/README.md | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/examples/text_embedding_qdrant/README.md b/examples/text_embedding_qdrant/README.md index 55091fafa..17fb118b0 100644 --- a/examples/text_embedding_qdrant/README.md +++ b/examples/text_embedding_qdrant/README.md @@ -1,14 +1,28 @@ -## Description # Build text embedding and semantic search 🔍 with Qdrant [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) -In this example, we will build index flow from text embedding from local markdown files, and query the index. -We will use **Qdrant** as the vector database. +CocoIndex supports Qdrant natively - [documentation](https://cocoindex.io/docs/ops/storages#qdrant). In this example, we will build index flow from text embedding from local markdown files, and query the index. We will use **Qdrant** as the vector database. + +We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. + +coco feat qdrant + +## Steps: +### Indexing Flow: +Screenshot 2025-05-19 at 7 19 50 PM + +1. We will ingest from a list of local files. +2. For each file, perform chunking (Recursive Split) and then embeddings. +3. We will save the embeddings and the metadata in Postgres with PGVector. + +### Query: +We will be use Qdrant client to query the index, reusing the embedding operation in the indexing flow. + ## Pre-requisites -- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. +- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. Even the target store is Qdrant, CocoIndex uses Postgress to track the data lineage for incremental processing. - Run Qdrant. From eed7a258a510ba7b5be28ca46eed51b92d96214b Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 19:25:08 -0700 Subject: [PATCH 19/47] Update README.md --- examples/text_embedding_qdrant/README.md | 70 ++++++++++++------------ 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/examples/text_embedding_qdrant/README.md b/examples/text_embedding_qdrant/README.md index 17fb118b0..bb497f6a8 100644 --- a/examples/text_embedding_qdrant/README.md +++ b/examples/text_embedding_qdrant/README.md @@ -6,7 +6,7 @@ CocoIndex supports Qdrant natively - [documentation](https://cocoindex.io/docs/o We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. -coco feat qdrant +Screenshot 2025-05-19 at 7 24 13 PM ## Steps: ### Indexing Flow: @@ -26,53 +26,53 @@ We will be use Qdrant client to query the index, reusing the embedding operation - Run Qdrant. -```bash -docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant -``` + ```bash + docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant + ``` - [Create a collection](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) to export the embeddings to. -```bash -curl -X PUT \ - 'http://localhost:6333/collections/cocoindex' \ - --header 'Content-Type: application/json' \ - --data-raw '{ - "vectors": { - "text_embedding": { - "size": 384, - "distance": "Cosine" - } - } -}' -``` - -You can view the collections and data with the Qdrant dashboard at . + ```bash + curl -X PUT \ + 'http://localhost:6333/collections/cocoindex' \ + --header 'Content-Type: application/json' \ + --data-raw '{ + "vectors": { + "text_embedding": { + "size": 384, + "distance": "Cosine" + } + } + }' + ``` + + You can view the collections and data with the Qdrant dashboard at . ## Run -Install dependencies: +- Install dependencies: -```bash -pip install -e . -``` + ```bash + pip install -e . + ``` -Setup: +- Setup: -```bash -python main.py cocoindex setup -``` + ```bash + python main.py cocoindex setup + ``` -Update index: +- Update index: -```bash -python main.py cocoindex update -``` + ```bash + python main.py cocoindex update + ``` -Run: +- Run: -```bash -python main.py -``` + ```bash + python main.py + ``` ## CocoInsight I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. From 12fc9be73573cc85d3298d078e6e13e2c8ceef04 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 20:57:29 -0700 Subject: [PATCH 20/47] Update README.md --- examples/text_embedding_qdrant/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/text_embedding_qdrant/README.md b/examples/text_embedding_qdrant/README.md index bb497f6a8..55a7c20fc 100644 --- a/examples/text_embedding_qdrant/README.md +++ b/examples/text_embedding_qdrant/README.md @@ -4,7 +4,7 @@ CocoIndex supports Qdrant natively - [documentation](https://cocoindex.io/docs/ops/storages#qdrant). In this example, we will build index flow from text embedding from local markdown files, and query the index. We will use **Qdrant** as the vector database. -We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. Screenshot 2025-05-19 at 7 24 13 PM From 407eda24d51664cf871170ef605934bb92a7bab7 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 21:00:34 -0700 Subject: [PATCH 21/47] Update README.md --- examples/text_embedding/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index 1e9998827..63291a15c 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -4,7 +4,7 @@ In this example, we will build index flow from text embedding from local markdown files, and query the index. -We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. ## Steps: 🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart) From 44d93dde6aa9e5ac8ba6bf2cf017a1f28c74e8bc Mon Sep 17 00:00:00 2001 From: LJ Date: Mon, 19 May 2025 21:06:21 -0700 Subject: [PATCH 22/47] Update README.md --- examples/text_embedding_qdrant/README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/examples/text_embedding_qdrant/README.md b/examples/text_embedding_qdrant/README.md index 55a7c20fc..3f91dc950 100644 --- a/examples/text_embedding_qdrant/README.md +++ b/examples/text_embedding_qdrant/README.md @@ -6,23 +6,23 @@ CocoIndex supports Qdrant natively - [documentation](https://cocoindex.io/docs/o We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. -Screenshot 2025-05-19 at 7 24 13 PM +CocoIndex supports Qdrant -## Steps: -### Indexing Flow: -Screenshot 2025-05-19 at 7 19 50 PM +## Steps +### Indexing Flow +Index flow for text embedding -1. We will ingest from a list of local files. -2. For each file, perform chunking (Recursive Split) and then embeddings. +1. We will ingest a list of local files. +2. For each file, perform chunking (recursively split) and then embedding. 3. We will save the embeddings and the metadata in Postgres with PGVector. -### Query: -We will be use Qdrant client to query the index, reusing the embedding operation in the indexing flow. +### Query +We use Qdrant client to query the index, and reuse the embedding operation in the indexing flow. ## Pre-requisites -- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. Even the target store is Qdrant, CocoIndex uses Postgress to track the data lineage for incremental processing. +- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. Although the target store is Qdrant, CocoIndex uses Postgress to track the data lineage for incremental processing. - Run Qdrant. From 689d359d495eb72bad88ae686e750699e8936824 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 21:07:00 -0700 Subject: [PATCH 23/47] Update README.md --- examples/text_embedding/README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/examples/text_embedding/README.md b/examples/text_embedding/README.md index 63291a15c..2dd1dbb81 100644 --- a/examples/text_embedding/README.md +++ b/examples/text_embedding/README.md @@ -6,18 +6,18 @@ In this example, we will build index flow from text embedding from local markdow We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. -## Steps: +## Steps 🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart) -### Indexing Flow: +### Indexing Flow Screenshot 2025-05-19 at 5 48 28 PM -1. We will ingest from a list of local files. -2. For each file, perform chunking (Recursive Split) and then embeddings. +1. We will ingest a list of local files. +2. For each file, perform chunking (recursively split) and then embedding. 3. We will save the embeddings and the metadata in Postgres with PGVector. -### Query: -We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow. +### Query +We will match against user-provided text by a SQL query, and reuse the embedding operation in the indexing flow. ## Prerequisite From eb56fe2e12880c3e48e250c175508c219abfb138 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 21:21:25 -0700 Subject: [PATCH 24/47] Update main.py --- examples/text_embedding_qdrant/main.py | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py index fd15ba73c..7bc81edbe 100644 --- a/examples/text_embedding_qdrant/main.py +++ b/examples/text_embedding_qdrant/main.py @@ -4,6 +4,10 @@ import cocoindex +# Define Qdrant connection constants +QDRANT_URL = "http://localhost:6333" +QDRANT_COLLECTION = "cocoindex" + @cocoindex.transform_flow() def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]: @@ -51,7 +55,7 @@ def text_embedding_flow( doc_embeddings.export( "doc_embeddings", cocoindex.storages.Qdrant( - collection_name="cocoindex", grpc_url="http://localhost:6334/" + collection_name=QDRANT_COLLECTION, grpc_url=QDRANT_URL ), primary_key_fields=["id"], setup_by_user=True, @@ -61,7 +65,7 @@ def text_embedding_flow( @cocoindex.main_fn() def _run(): # Initialize Qdrant client - client = QdrantClient(host="localhost", port=6333) + client = QdrantClient(url=QDRANT_URL) # Run queries in a loop to demonstrate the query capabilities. while True: @@ -75,7 +79,7 @@ def _run(): # Search in Qdrant search_results = client.search( - collection_name="cocoindex", + collection_name=QDRANT_COLLECTION, query_vector=("text_embedding", query_embedding), limit=10 ) From e69c212426719be5cc73d6f6f4c5fe15d948b25e Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 21:44:49 -0700 Subject: [PATCH 25/47] Update main.py --- examples/text_embedding_qdrant/main.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py index 7bc81edbe..20341e25c 100644 --- a/examples/text_embedding_qdrant/main.py +++ b/examples/text_embedding_qdrant/main.py @@ -5,7 +5,7 @@ import cocoindex # Define Qdrant connection constants -QDRANT_URL = "http://localhost:6333" +QDRANT_URL = "http://localhost:6334" QDRANT_COLLECTION = "cocoindex" From 9272a8a1438dd4edc2700614f348733eb4cb5b96 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 22:59:28 -0700 Subject: [PATCH 26/47] Update main.py --- examples/text_embedding_qdrant/main.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py index 20341e25c..7bc81edbe 100644 --- a/examples/text_embedding_qdrant/main.py +++ b/examples/text_embedding_qdrant/main.py @@ -5,7 +5,7 @@ import cocoindex # Define Qdrant connection constants -QDRANT_URL = "http://localhost:6334" +QDRANT_URL = "http://localhost:6333" QDRANT_COLLECTION = "cocoindex" From 79679140ad8a8929111bdbb732b6602aa65362e4 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Mon, 19 May 2025 23:29:59 -0700 Subject: [PATCH 27/47] Update main.py --- examples/text_embedding_qdrant/main.py | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/examples/text_embedding_qdrant/main.py b/examples/text_embedding_qdrant/main.py index 7bc81edbe..b2892c433 100644 --- a/examples/text_embedding_qdrant/main.py +++ b/examples/text_embedding_qdrant/main.py @@ -5,7 +5,7 @@ import cocoindex # Define Qdrant connection constants -QDRANT_URL = "http://localhost:6333" +QDRANT_GRPC_URL = "http://localhost:6334" QDRANT_COLLECTION = "cocoindex" @@ -55,7 +55,7 @@ def text_embedding_flow( doc_embeddings.export( "doc_embeddings", cocoindex.storages.Qdrant( - collection_name=QDRANT_COLLECTION, grpc_url=QDRANT_URL + collection_name=QDRANT_COLLECTION, grpc_url=QDRANT_GRPC_URL ), primary_key_fields=["id"], setup_by_user=True, @@ -65,7 +65,7 @@ def text_embedding_flow( @cocoindex.main_fn() def _run(): # Initialize Qdrant client - client = QdrantClient(url=QDRANT_URL) + client = QdrantClient(url=QDRANT_GRPC_URL, prefer_grpc=True) # Run queries in a loop to demonstrate the query capabilities. while True: @@ -77,13 +77,11 @@ def _run(): # Get the embedding for the query query_embedding = text_to_embedding.eval(query) - # Search in Qdrant search_results = client.search( collection_name=QDRANT_COLLECTION, query_vector=("text_embedding", query_embedding), limit=10 ) - print("\nSearch results:") for result in search_results: score = result.score From 3577694bf5efadd8cb5bcfad244da3a5bf45a5e6 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Tue, 20 May 2025 14:25:53 -0700 Subject: [PATCH 28/47] upgrade query handling for pdf embedding --- examples/pdf_embedding/README.md | 3 +- examples/pdf_embedding/main.py | 70 +++++++++++++++++++++++--------- 2 files changed, 52 insertions(+), 21 deletions(-) diff --git a/examples/pdf_embedding/README.md b/examples/pdf_embedding/README.md index 3dde765d5..dbee1145c 100644 --- a/examples/pdf_embedding/README.md +++ b/examples/pdf_embedding/README.md @@ -1,4 +1,5 @@ -Simple example for cocoindex: build embedding index based on local files. +# Build embedding index from PDF files and query with natural language +[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) ## Prerequisite [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. diff --git a/examples/pdf_embedding/main.py b/examples/pdf_embedding/main.py index 00b1ae519..7928a16f3 100644 --- a/examples/pdf_embedding/main.py +++ b/examples/pdf_embedding/main.py @@ -1,16 +1,21 @@ +import cocoindex +import os import tempfile +from typing import List, Dict, Any from dotenv import load_dotenv +from marker.config.parser import ConfigParser from marker.converters.pdf import PdfConverter from marker.models import create_model_dict from marker.output import text_from_rendered -from marker.config.parser import ConfigParser +from psycopg_pool import ConnectionPool +from jinja2 import Template -import cocoindex class PdfToMarkdown(cocoindex.op.FunctionSpec): """Convert a PDF to markdown.""" + @cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1) class PdfToMarkdownExecutor: """Executor for PdfToMarkdown.""" @@ -30,14 +35,17 @@ def __call__(self, content: bytes) -> str: return text -def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice: +@cocoindex.transform_flow() +def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]: """ Embed the text using a SentenceTransformer model. + This is a shared logic between indexing and querying, so extract it as a function. """ return text.transform( cocoindex.functions.SentenceTransformerEmbed( model="sentence-transformers/all-MiniLM-L6-v2")) + @cocoindex.flow_def(name="PdfEmbedding") def pdf_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): """ @@ -45,7 +53,7 @@ def pdf_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoinde """ data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="pdf_files", binary=True)) - doc_embeddings = data_scope.add_collector() + pdf_embeddings = data_scope.add_collector() with data_scope["documents"].row() as doc: doc["markdown"] = doc["content"].transform(PdfToMarkdown()) @@ -55,12 +63,12 @@ def pdf_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoinde with doc["chunks"].row() as chunk: chunk["embedding"] = chunk["text"].call(text_to_embedding) - doc_embeddings.collect(id=cocoindex.GeneratedField.UUID, + pdf_embeddings.collect(id=cocoindex.GeneratedField.UUID, filename=doc["filename"], location=chunk["location"], text=chunk["text"], embedding=chunk["embedding"]) - doc_embeddings.export( - "doc_embeddings", + pdf_embeddings.export( + "pdf_embeddings", cocoindex.storages.Postgres(), primary_key_fields=["id"], vector_indexes=[ @@ -68,31 +76,53 @@ def pdf_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoinde field_name="embedding", metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)]) -query_handler = cocoindex.query.SimpleSemanticsQueryHandler( - name="SemanticsSearch", - flow=pdf_embedding_flow, - target_name="doc_embeddings", - query_transform_flow=text_to_embedding, - default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY) + +def search(pool: ConnectionPool, query: str, top_k: int = 5): + # Get the table name, for the export target in the pdf_embedding_flow above. + table_name = cocoindex.utils.get_target_storage_default_name(pdf_embedding_flow, "pdf_embeddings") + # Evaluate the transform flow defined above with the input query, to get the embedding. + query_vector = text_to_embedding.eval(query) + # Run the query and get the results. + with pool.connection() as conn: + with conn.cursor() as cur: + cur.execute(f""" + SELECT filename, text, embedding <=> %s::vector AS distance + FROM {table_name} ORDER BY distance LIMIT %s + """, (query_vector, top_k)) + return [ + {"filename": row[0], "text": row[1], "score": 1.0 - row[2]} + for row in cur.fetchall() + ] + + +# Define the search results template using Jinja2 +SEARCH_RESULTS_TEMPLATE = Template(""" +Search results: +{% for result in results %} +[{{ "%.3f"|format(result.score) }}] {{ result.filename }} + {{ result.text }} +--- +{% endfor %} +""") + @cocoindex.main_fn() def _run(): + # Initialize the database connection pool. + pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL")) # Run queries in a loop to demonstrate the query capabilities. while True: try: query = input("Enter search query (or Enter to quit): ") if query == '': break - results, _ = query_handler.search(query, 10) - print("\nSearch results:") - for result in results: - print(f"[{result.score:.3f}] {result.data['filename']}") - print(f" {result.data['text']}") - print("---") - print() + # Run the query function with the database connection pool and the query. + results = search(pool, query) + print(SEARCH_RESULTS_TEMPLATE.render(results=results)) except KeyboardInterrupt: break + if __name__ == "__main__": load_dotenv(override=True) _run() From 9741fa8a6b055ab4da6ca09be8cf05b3fdcf311d Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Tue, 20 May 2025 14:30:07 -0700 Subject: [PATCH 29/47] Update README.md --- examples/pdf_embedding/README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/examples/pdf_embedding/README.md b/examples/pdf_embedding/README.md index dbee1145c..bea0684ee 100644 --- a/examples/pdf_embedding/README.md +++ b/examples/pdf_embedding/README.md @@ -1,6 +1,16 @@ # Build embedding index from PDF files and query with natural language [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) + +In this example, we will build index flow from text embedding from local PDF files, and query the index. + +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. + +## Steps +### Indexing Flow + + + ## Prerequisite [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. From 818b8060903d25ecdba83e27d05f08b95a2ce4c9 Mon Sep 17 00:00:00 2001 From: LJ Date: Tue, 20 May 2025 14:33:34 -0700 Subject: [PATCH 30/47] Update README.md --- examples/pdf_embedding/README.md | 50 ++++++++++++++++++-------------- 1 file changed, 29 insertions(+), 21 deletions(-) diff --git a/examples/pdf_embedding/README.md b/examples/pdf_embedding/README.md index bea0684ee..32b6e1786 100644 --- a/examples/pdf_embedding/README.md +++ b/examples/pdf_embedding/README.md @@ -9,6 +9,16 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c ## Steps ### Indexing Flow +PDF indexing flow +1. We will ingest a list of PDF files +2. For each file + - convert it to markdown + - perform chunking (recursively split) and then embed each chunk. +3. We will save the embeddings and the metadata in Postgres with PGVector. + +### Query +We will match against user-provided text by a SQL query, and reuse the embedding operation in the indexing flow. + ## Prerequisite @@ -16,37 +26,35 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c ## Run -Install dependencies: - -```bash -pip install -e . -``` +- Install dependencies: + + ```bash + pip install -e . + ``` -Setup: +- Setup: -```bash -python main.py cocoindex setup -``` + ```bash + python main.py cocoindex setup + ``` -Update index: +- Update index: -```bash -python main.py cocoindex update -``` + ```bash + python main.py cocoindex update + ``` -Run: +- Run: -```bash -python main.py -``` + ```bash + python main.py + ``` ## CocoInsight -CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9). - -Run CocoInsight to understand your RAG data pipeline: +I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight: ``` python main.py cocoindex server -ci ``` -Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). \ No newline at end of file +Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). From b5fbae66375543552af98de0689e21100cbed213 Mon Sep 17 00:00:00 2001 From: LJ Date: Tue, 20 May 2025 14:34:24 -0700 Subject: [PATCH 31/47] Update README.md --- examples/pdf_embedding/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/pdf_embedding/README.md b/examples/pdf_embedding/README.md index 32b6e1786..e041ed844 100644 --- a/examples/pdf_embedding/README.md +++ b/examples/pdf_embedding/README.md @@ -10,6 +10,7 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c ### Indexing Flow PDF indexing flow + 1. We will ingest a list of PDF files 2. For each file - convert it to markdown From 3ab5ddce0e761f14ea374e36e4752efb9e775f1f Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Tue, 20 May 2025 14:47:10 -0700 Subject: [PATCH 32/47] Update README.md --- examples/pdf_embedding/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/examples/pdf_embedding/README.md b/examples/pdf_embedding/README.md index e041ed844..d738f7dbe 100644 --- a/examples/pdf_embedding/README.md +++ b/examples/pdf_embedding/README.md @@ -2,7 +2,7 @@ [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) -In this example, we will build index flow from text embedding from local PDF files, and query the index. +In this example, we will build index flow for text embedding from local PDF files, and query the index. We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. @@ -11,9 +11,9 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c PDF indexing flow -1. We will ingest a list of PDF files -2. For each file - - convert it to markdown +1. We will ingest a list of PDF files. +2. For each file: + - convert it to markdown, and then - perform chunking (recursively split) and then embed each chunk. 3. We will save the embeddings and the metadata in Postgres with PGVector. From ff43f46f241bbbba06ca364a3d0396839ffe539d Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Tue, 20 May 2025 16:49:16 -0700 Subject: [PATCH 33/47] Update main.py --- examples/pdf_embedding/main.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/examples/pdf_embedding/main.py b/examples/pdf_embedding/main.py index 928a66ff7..a623c1526 100644 --- a/examples/pdf_embedding/main.py +++ b/examples/pdf_embedding/main.py @@ -62,7 +62,7 @@ def pdf_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoinde language="markdown", chunk_size=2000, chunk_overlap=500) with doc["chunks"].row() as chunk: - chunk["embedding"] = chunk["text"].call(text_to_embedding) + chunk["embedding"] = text_to_embedding(chunk["text"]) pdf_embeddings.collect(id=cocoindex.GeneratedField.UUID, filename=doc["filename"], location=chunk["location"], text=chunk["text"], embedding=chunk["embedding"]) @@ -107,6 +107,8 @@ def search(pool: ConnectionPool, query: str, top_k: int = 5): def _main(): + # Initialize the database connection pool. + pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL")) # Run queries in a loop to demonstrate the query capabilities. while True: try: From 55ebc4af2d44119e90b2d54ccaab93470b24b26c Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Tue, 20 May 2025 16:58:43 -0700 Subject: [PATCH 34/47] google drive example update query handler --- examples/gdrive_text_embedding/README.md | 54 +++++++++++------- examples/gdrive_text_embedding/main.py | 73 +++++++++++++++--------- 2 files changed, 79 insertions(+), 48 deletions(-) diff --git a/examples/gdrive_text_embedding/README.md b/examples/gdrive_text_embedding/README.md index 73167ab8f..3c41f1ab3 100644 --- a/examples/gdrive_text_embedding/README.md +++ b/examples/gdrive_text_embedding/README.md @@ -1,6 +1,21 @@ -This example builds embedding index based on Google Drive files. -It continuously updates the index as files are added / updated / deleted in the source folders: -it keeps the index in sync with the source folders effortlessly. +# Build Google Drive text embedding and semantic search 🔍 +[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) + +In this example, we will build an embedding index based on Google Drive files and perform semantic search. + +It continuously updates the index as files are added / updated / deleted in the source folders. It keeps the index in sync with the source folders in real-time. + +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. + +## Steps + +### Indexing Flow +1. We will ingest files from Google Drive folders. +2. For each file, perform chunking (recursively split) and then embedding. +3. We will save the embeddings and the metadata in Postgres with PGVector. + +### Query +We will match against user-provided text by a SQL query, and reuse the embedding operation in the indexing flow. ## Prerequisite @@ -25,32 +40,31 @@ Before running the example, you need to: ## Run -Install dependencies: - -```sh -pip install -e . -``` +- Install dependencies: -Setup: + ```sh + pip install -e . + ``` -```sh -cocoindex setup main.py -``` +- Setup: -Run: + ```sh + cocoindex setup main.py + ``` -```sh -python main.py -``` +- Run: + + ```sh + python main.py + ``` During running, it will keep observing changes in the source folders and update the index automatically. At the same time, it accepts queries from the terminal, and performs search on top of the up-to-date index. ## CocoInsight -CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9). - -Run CocoInsight to understand your RAG data pipeline: +I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. +It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight: ```sh cocoindex server -ci main.py @@ -62,4 +76,4 @@ You can also add a `-L` flag to make the server keep updating the index to refle cocoindex server -ci -L main.py ``` -Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). \ No newline at end of file +Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). diff --git a/examples/gdrive_text_embedding/main.py b/examples/gdrive_text_embedding/main.py index 7e37ca7ef..b612e2d51 100644 --- a/examples/gdrive_text_embedding/main.py +++ b/examples/gdrive_text_embedding/main.py @@ -1,9 +1,19 @@ from dotenv import load_dotenv - +from psycopg_pool import ConnectionPool import cocoindex import datetime import os +@cocoindex.transform_flow() +def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]: + """ + Embed the text using a SentenceTransformer model. + This is a shared logic between indexing and querying, so extract it as a function. + """ + return text.transform( + cocoindex.functions.SentenceTransformerEmbed( + model="sentence-transformers/all-MiniLM-L6-v2")) + @cocoindex.flow_def(name="GoogleDriveTextEmbedding") def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): """ @@ -27,9 +37,7 @@ def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: language="markdown", chunk_size=2000, chunk_overlap=500) with doc["chunks"].row() as chunk: - chunk["embedding"] = chunk["text"].transform( - cocoindex.functions.SentenceTransformerEmbed( - model="sentence-transformers/all-MiniLM-L6-v2")) + chunk["embedding"] = text_to_embedding(chunk["text"]) doc_embeddings.collect(filename=doc["filename"], location=chunk["location"], text=chunk["text"], embedding=chunk["embedding"]) @@ -42,33 +50,42 @@ def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: field_name="embedding", metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)]) -query_handler = cocoindex.query.SimpleSemanticsQueryHandler( - name="SemanticsSearch", - flow=gdrive_text_embedding_flow, - target_name="doc_embeddings", - query_transform_flow=lambda text: text.transform( - cocoindex.functions.SentenceTransformerEmbed( - model="sentence-transformers/all-MiniLM-L6-v2")), - default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY) +def search(pool: ConnectionPool, query: str, top_k: int = 5): + # Get the table name, for the export target in the gdrive_text_embedding_flow above. + table_name = cocoindex.utils.get_target_storage_default_name(gdrive_text_embedding_flow, "doc_embeddings") + # Evaluate the transform flow defined above with the input query, to get the embedding. + query_vector = text_to_embedding.eval(query) + # Run the query and get the results. + with pool.connection() as conn: + with conn.cursor() as cur: + cur.execute(f""" + SELECT filename, text, embedding <=> %s::vector AS distance + FROM {table_name} ORDER BY distance LIMIT %s + """, (query_vector, top_k)) + return [ + {"filename": row[0], "text": row[1], "score": 1.0 - row[2]} + for row in cur.fetchall() + ] def _main(): - # Use a `FlowLiveUpdater` to keep the flow data updated. - with cocoindex.FlowLiveUpdater(gdrive_text_embedding_flow): - # Run queries in a loop to demonstrate the query capabilities. - while True: - try: - query = input("Enter search query (or Enter to quit): ") - if query == '': - break - results, _ = query_handler.search(query, 10) - print("\nSearch results:") - for result in results: - print(f"[{result.score:.3f}] {result.data['filename']}") - print(f" {result.data['text']}") - print("---") - print() - except KeyboardInterrupt: + # Initialize the database connection pool. + pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL")) + # Run queries in a loop to demonstrate the query capabilities. + while True: + try: + query = input("Enter search query (or Enter to quit): ") + if query == '': break + # Run the query function with the database connection pool and the query. + results = search(pool, query) + print("\nSearch results:") + for result in results: + print(f"[{result['score']:.3f}] {result['filename']}") + print(f" {result['text']}") + print("---") + print() + except KeyboardInterrupt: + break if __name__ == "__main__": load_dotenv() From 6e72fb268762c0816723bef3c0109629abd8681b Mon Sep 17 00:00:00 2001 From: LJ Date: Tue, 20 May 2025 17:07:48 -0700 Subject: [PATCH 35/47] Update README.md --- examples/gdrive_text_embedding/README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/examples/gdrive_text_embedding/README.md b/examples/gdrive_text_embedding/README.md index 3c41f1ab3..3d79dcb92 100644 --- a/examples/gdrive_text_embedding/README.md +++ b/examples/gdrive_text_embedding/README.md @@ -10,6 +10,8 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c ## Steps ### Indexing Flow +Google Drive File Ingestion + 1. We will ingest files from Google Drive folders. 2. For each file, perform chunking (recursively split) and then embedding. 3. We will save the embeddings and the metadata in Postgres with PGVector. @@ -77,3 +79,5 @@ cocoindex server -ci -L main.py ``` Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). + +Screenshot 2025-05-20 at 5 06 31 PM From b46cf771562ae075c76859f20c8dfecbe2aad8e0 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Tue, 20 May 2025 17:17:13 -0700 Subject: [PATCH 36/47] Update README.md --- examples/gdrive_text_embedding/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/gdrive_text_embedding/README.md b/examples/gdrive_text_embedding/README.md index 3d79dcb92..6f7bbb790 100644 --- a/examples/gdrive_text_embedding/README.md +++ b/examples/gdrive_text_embedding/README.md @@ -80,4 +80,4 @@ cocoindex server -ci -L main.py Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). -Screenshot 2025-05-20 at 5 06 31 PM +Use CocoInsight to understand the data of the pipeline From b8745127ddce2dd319980f8ab54da58e71893f57 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Tue, 20 May 2025 17:27:41 -0700 Subject: [PATCH 37/47] rename to image_search (to be more consistent with other examples) --- README.md | 2 +- .../{image_search_example => image_search}/.env | 0 .../README.md | 0 .../frontend/.gitignore | 0 .../frontend/index.html | 0 .../frontend/package-lock.json | 0 .../frontend/package.json | 0 .../frontend/src/App.jsx | 0 .../frontend/src/main.jsx | 0 .../frontend/src/style.css | 0 .../frontend/vite.config.js | 0 .../img/cat1.jpeg | Bin .../img/dog1.jpeg | Bin .../img/elephant1.jpg | Bin .../img/giraffe.jpg | Bin .../{image_search_example => image_search}/main.py | 0 .../requirements.txt | 0 17 files changed, 1 insertion(+), 1 deletion(-) rename examples/{image_search_example => image_search}/.env (100%) rename examples/{image_search_example => image_search}/README.md (100%) rename examples/{image_search_example => image_search}/frontend/.gitignore (100%) rename examples/{image_search_example => image_search}/frontend/index.html (100%) rename examples/{image_search_example => image_search}/frontend/package-lock.json (100%) rename examples/{image_search_example => image_search}/frontend/package.json (100%) rename examples/{image_search_example => image_search}/frontend/src/App.jsx (100%) rename examples/{image_search_example => image_search}/frontend/src/main.jsx (100%) rename examples/{image_search_example => image_search}/frontend/src/style.css (100%) rename examples/{image_search_example => image_search}/frontend/vite.config.js (100%) rename examples/{image_search_example => image_search}/img/cat1.jpeg (100%) rename examples/{image_search_example => image_search}/img/dog1.jpeg (100%) rename examples/{image_search_example => image_search}/img/elephant1.jpg (100%) rename examples/{image_search_example => image_search}/img/giraffe.jpg (100%) rename examples/{image_search_example => image_search}/main.py (100%) rename examples/{image_search_example => image_search}/requirements.txt (100%) diff --git a/README.md b/README.md index 8b3cb3a23..007586e76 100644 --- a/README.md +++ b/README.md @@ -138,7 +138,7 @@ It defines an index flow like this: | [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search | | [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup | | [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database| -| [Image Search with Vision API](examples/image_search_example) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend| +| [Image Search with Vision API](examples/image_search) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend| More coming and stay tuned 👀! diff --git a/examples/image_search_example/.env b/examples/image_search/.env similarity index 100% rename from examples/image_search_example/.env rename to examples/image_search/.env diff --git a/examples/image_search_example/README.md b/examples/image_search/README.md similarity index 100% rename from examples/image_search_example/README.md rename to examples/image_search/README.md diff --git a/examples/image_search_example/frontend/.gitignore b/examples/image_search/frontend/.gitignore similarity index 100% rename from examples/image_search_example/frontend/.gitignore rename to examples/image_search/frontend/.gitignore diff --git a/examples/image_search_example/frontend/index.html b/examples/image_search/frontend/index.html similarity index 100% rename from examples/image_search_example/frontend/index.html rename to examples/image_search/frontend/index.html diff --git a/examples/image_search_example/frontend/package-lock.json b/examples/image_search/frontend/package-lock.json similarity index 100% rename from examples/image_search_example/frontend/package-lock.json rename to examples/image_search/frontend/package-lock.json diff --git a/examples/image_search_example/frontend/package.json b/examples/image_search/frontend/package.json similarity index 100% rename from examples/image_search_example/frontend/package.json rename to examples/image_search/frontend/package.json diff --git a/examples/image_search_example/frontend/src/App.jsx b/examples/image_search/frontend/src/App.jsx similarity index 100% rename from examples/image_search_example/frontend/src/App.jsx rename to examples/image_search/frontend/src/App.jsx diff --git a/examples/image_search_example/frontend/src/main.jsx b/examples/image_search/frontend/src/main.jsx similarity index 100% rename from examples/image_search_example/frontend/src/main.jsx rename to examples/image_search/frontend/src/main.jsx diff --git a/examples/image_search_example/frontend/src/style.css b/examples/image_search/frontend/src/style.css similarity index 100% rename from examples/image_search_example/frontend/src/style.css rename to examples/image_search/frontend/src/style.css diff --git a/examples/image_search_example/frontend/vite.config.js b/examples/image_search/frontend/vite.config.js similarity index 100% rename from examples/image_search_example/frontend/vite.config.js rename to examples/image_search/frontend/vite.config.js diff --git a/examples/image_search_example/img/cat1.jpeg b/examples/image_search/img/cat1.jpeg similarity index 100% rename from examples/image_search_example/img/cat1.jpeg rename to examples/image_search/img/cat1.jpeg diff --git a/examples/image_search_example/img/dog1.jpeg b/examples/image_search/img/dog1.jpeg similarity index 100% rename from examples/image_search_example/img/dog1.jpeg rename to examples/image_search/img/dog1.jpeg diff --git a/examples/image_search_example/img/elephant1.jpg b/examples/image_search/img/elephant1.jpg similarity index 100% rename from examples/image_search_example/img/elephant1.jpg rename to examples/image_search/img/elephant1.jpg diff --git a/examples/image_search_example/img/giraffe.jpg b/examples/image_search/img/giraffe.jpg similarity index 100% rename from examples/image_search_example/img/giraffe.jpg rename to examples/image_search/img/giraffe.jpg diff --git a/examples/image_search_example/main.py b/examples/image_search/main.py similarity index 100% rename from examples/image_search_example/main.py rename to examples/image_search/main.py diff --git a/examples/image_search_example/requirements.txt b/examples/image_search/requirements.txt similarity index 100% rename from examples/image_search_example/requirements.txt rename to examples/image_search/requirements.txt From 2402c78afc875678c0b85ad3379062bc80e11b56 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Tue, 20 May 2025 18:53:53 -0700 Subject: [PATCH 38/47] update query handling for image search --- examples/image_search/README.md | 12 ++------- examples/image_search/main.py | 40 ++++++++++++++++++---------- examples/image_search/pyproject.toml | 9 +++++++ 3 files changed, 37 insertions(+), 24 deletions(-) create mode 100644 examples/image_search/pyproject.toml diff --git a/examples/image_search/README.md b/examples/image_search/README.md index 3299ab457..51ce32c66 100644 --- a/examples/image_search/README.md +++ b/examples/image_search/README.md @@ -2,7 +2,7 @@ ![image](https://github.com/user-attachments/assets/3a696344-c9b4-46e8-9413-6229dbb8672a) -- QDrant for Vector Storage +- Qdrant for Vector Storage - Ollama Gemma3 (Image to Text) - CLIP ViT-L/14 - Embeddings Model - Live Update @@ -13,7 +13,7 @@ docker run -d --name qdrant -p 6334:6334 qdrant/qdrant:latest export COCOINDEX_DATABASE_URL="postgres://cocoindex:cocoindex@localhost/cocoindex" ``` -## Create QDrant Collection +## Create Qdrant Collection ``` curl -X PUT 'http://localhost:6333/collections/image_search' \ @@ -26,7 +26,6 @@ curl -X PUT } } }' - ``` ## Run Ollama @@ -35,13 +34,6 @@ ollama pull gemma3 ollama serve ``` -## Create virtual environment and install dependencies -``` -python -m venv .venv -source .venv/bin/activate -pip install -r requirements.txt -``` - ### Place your images in the `img` directory. - No need to update manually. CocoIndex will automatically update the index as new images are added to the directory. diff --git a/examples/image_search/main.py b/examples/image_search/main.py index 7ea2e9eb1..abbc74439 100644 --- a/examples/image_search/main.py +++ b/examples/image_search/main.py @@ -7,9 +7,11 @@ from fastapi import FastAPI, Query from fastapi.middleware.cors import CORSMiddleware from fastapi.staticfiles import StaticFiles +from qdrant_client import QdrantClient OLLAMA_URL = "http://localhost:11434/api/generate" OLLAMA_MODEL = "gemma3" +QDRANT_GRPC_URL = os.getenv("QDRANT_GRPC_URL", "http://localhost:6334/") # 1. Extract caption from image using Ollama vision model @cocoindex.op.function(cache=True, behavior_version=1) @@ -42,7 +44,12 @@ def get_image_caption(img_bytes: bytes) -> str: # 2. Embed the caption string -def caption_to_embedding(caption: cocoindex.DataSlice) -> cocoindex.DataSlice: +@cocoindex.transform_flow() +def caption_to_embedding(caption: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]: + """ + Embed the caption using a CLIP model. + This is shared logic between indexing and querying. + """ return caption.transform( cocoindex.functions.SentenceTransformerEmbed( model="clip-ViT-L-14", @@ -70,7 +77,7 @@ def image_object_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: "img_embeddings", cocoindex.storages.Qdrant( collection_name="image_search", - grpc_url=os.getenv("QDRANT_GRPC_URL", "http://localhost:6334/"), + grpc_url=QDRANT_GRPC_URL, ), primary_key_fields=["id"], setup_by_user=True, @@ -93,26 +100,31 @@ def image_object_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: def startup_event(): load_dotenv() cocoindex.init() - app.state.query_handler = cocoindex.query.SimpleSemanticsQueryHandler( - name="ImageObjectSearch", - flow=image_object_embedding_flow, - target_name="img_embeddings", - query_transform_flow=caption_to_embedding, - default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY, + # Initialize Qdrant client + app.state.qdrant_client = QdrantClient( + url=QDRANT_GRPC_URL, + prefer_grpc=True ) app.state.live_updater = cocoindex.FlowLiveUpdater(image_object_embedding_flow) app.state.live_updater.start() @app.get("/search") def search(q: str = Query(..., description="Search query"), limit: int = Query(5, description="Number of results")): - query_handler = app.state.query_handler - results, _ = query_handler.search(q, limit, "embedding") + # Get the embedding for the query + query_embedding = caption_to_embedding.eval(q) + + # Search in Qdrant + search_results = app.state.qdrant_client.search( + collection_name="image_search", + query_vector=("embedding", query_embedding), + limit=limit + ) + + # Format results out = [] - for result in results: - row = dict(result.data) - # Only include filename and score + for result in search_results: out.append({ - "filename": row["filename"], + "filename": result.payload["filename"], "score": result.score }) return {"results": out} diff --git a/examples/image_search/pyproject.toml b/examples/image_search/pyproject.toml new file mode 100644 index 000000000..ac010d53b --- /dev/null +++ b/examples/image_search/pyproject.toml @@ -0,0 +1,9 @@ +[project] +name = "image-search" +version = "0.1.0" +description = "Simple example for cocoindex: build embedding index based on images." +requires-python = ">=3.11" +dependencies = ["cocoindex>=0.1.42", "python-dotenv>=1.0.1", "fastapi>=0.100.0"] + +[tool.setuptools] +packages = [] From ac43f786930c229754a6d6e633dc2e489018b5a2 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Tue, 20 May 2025 19:44:54 -0700 Subject: [PATCH 39/47] Update README.md --- examples/image_search/README.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/examples/image_search/README.md b/examples/image_search/README.md index 51ce32c66..16c912056 100644 --- a/examples/image_search/README.md +++ b/examples/image_search/README.md @@ -9,16 +9,15 @@ ## Make sure Postgres and Qdrant are running ``` -docker run -d --name qdrant -p 6334:6334 qdrant/qdrant:latest +docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant export COCOINDEX_DATABASE_URL="postgres://cocoindex:cocoindex@localhost/cocoindex" ``` ## Create Qdrant Collection ``` -curl -X PUT - 'http://localhost:6333/collections/image_search' \ - --header 'Content-Type: application/json' \ - --data-raw '{ +curl -X PUT 'http://localhost:6333/collections/image_search' \ + -H 'Content-Type: application/json' \ + -d '{ "vectors": { "embedding": { "size": 768, From b645bb14b6d0478877b52e707b6491a60a3d927a Mon Sep 17 00:00:00 2001 From: LJ Date: Tue, 20 May 2025 21:56:46 -0700 Subject: [PATCH 40/47] Update README.md --- examples/image_search/README.md | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/examples/image_search/README.md b/examples/image_search/README.md index 16c912056..0aa2a7c91 100644 --- a/examples/image_search/README.md +++ b/examples/image_search/README.md @@ -38,17 +38,23 @@ ollama serve ## Run Backend -``` -cocoindex setup main.py -uvicorn main:app --reload --host 0.0.0.0 --port 8000 -``` - -## Run Frontend -``` -cd frontend -npm install -npm run dev -``` +- Install dependencies: + ``` + pip install -e . + ``` + +- Run Backend + ``` + cocoindex setup main.py + uvicorn main:app --reload --host 0.0.0.0 --port 8000 + ``` + +- Run Frontend + ``` + cd frontend + npm install + npm run dev + ``` Go to `http://localhost:5174` to search. From 32fcd4d285fbb6fcafe3f8042603898cbbff5df5 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Wed, 21 May 2025 14:02:54 -0700 Subject: [PATCH 41/47] simplify the fast api example --- examples/fastapi_server_docker/main.py | 44 ++++++- .../fastapi_server_docker/sample_code/main.py | 113 ------------------ .../src/cocoindex_funs.py | 45 ------- 3 files changed, 43 insertions(+), 159 deletions(-) delete mode 100644 examples/fastapi_server_docker/sample_code/main.py delete mode 100644 examples/fastapi_server_docker/src/cocoindex_funs.py diff --git a/examples/fastapi_server_docker/main.py b/examples/fastapi_server_docker/main.py index 7ff9d943c..5d7bdd18b 100644 --- a/examples/fastapi_server_docker/main.py +++ b/examples/fastapi_server_docker/main.py @@ -1,10 +1,52 @@ import cocoindex import uvicorn +import os from fastapi import FastAPI from dotenv import load_dotenv -from src.cocoindex_funs import code_embedding_flow, code_to_embedding +@cocoindex.op.function() +def extract_extension(filename: str) -> str: + """Extract the extension of a filename.""" + return os.path.splitext(filename)[1] + +def code_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice: + """ + Embed the text using a SentenceTransformer model. + """ + return text.transform( + cocoindex.functions.SentenceTransformerEmbed( + model="sentence-transformers/all-MiniLM-L6-v2")) + +@cocoindex.flow_def(name="CodeEmbeddingFastApiExample") +def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): + """ + Define an example flow that embeds files into a vector database. + """ + data_scope["files"] = flow_builder.add_source( + cocoindex.sources.LocalFile(path="./", + included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx", "*.ts", "*.tsx"], + excluded_patterns=[".*", "target", "**/node_modules"])) + code_embeddings = data_scope.add_collector() + + with data_scope["files"].row() as file: + file["extension"] = file["filename"].transform(extract_extension) + file["chunks"] = file["content"].transform( + cocoindex.functions.SplitRecursively(), + language=file["extension"], chunk_size=1000, chunk_overlap=300) + with file["chunks"].row() as chunk: + chunk["embedding"] = chunk["text"].call(code_to_embedding) + code_embeddings.collect(filename=file["filename"], location=chunk["location"], + code=chunk["text"], embedding=chunk["embedding"]) + + code_embeddings.export( + "code_embeddings", + cocoindex.storages.Postgres(), + primary_key_fields=["filename", "location"], + vector_indexes=[ + cocoindex.VectorIndexDef( + field_name="embedding", + metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)]) fastapi_app = FastAPI() diff --git a/examples/fastapi_server_docker/sample_code/main.py b/examples/fastapi_server_docker/sample_code/main.py deleted file mode 100644 index 0b4876f24..000000000 --- a/examples/fastapi_server_docker/sample_code/main.py +++ /dev/null @@ -1,113 +0,0 @@ -class Button: - def __init__(self, label, on_click): - self.label = label - self.on_click = on_click - - def click(self): - print(f"Button '{self.label}' clicked.") - if callable(self.on_click): - self.on_click() - - -class ProgressBar: - def __init__(self, max_value): - self.max_value = max_value - self.current_value = 0 - - def update(self, value): - self.current_value = min(value, self.max_value) - self.display() - - def display(self): - percent = (self.current_value / self.max_value) * 100 - print(f"Progress: {percent:.2f}%") - - -class Slider: - def __init__(self, min_value, max_value, initial_value=None): - self.min_value = min_value - self.max_value = max_value - self.value = initial_value if initial_value is not None else min_value - - def set_value(self, new_value): - if self.min_value <= new_value <= self.max_value: - self.value = new_value - print(f"Slider set to {self.value}") - else: - print("Value out of range.") - - -class Dropdown: - def __init__(self, options): - self.options = options - self.selected = None - - def select(self, option): - if option in self.options: - self.selected = option - print(f"Dropdown selected: {option}") - else: - print("Option not available.") - - -class TextField: - def __init__(self, placeholder=''): - self.placeholder = placeholder - self.text = '' - - def input(self, new_text): - self.text = new_text - print(f"Text field updated: {self.text}") - - -class Checkbox: - def __init__(self, label): - self.label = label - self.checked = False - - def toggle(self): - self.checked = not self.checked - print(f"{self.label}: {'Checked' if self.checked else 'Unchecked'}") - - -class RadioButton: - def __init__(self, group, label): - self.group = group - self.label = label - self.selected = False - - def select(self): - self.selected = True - print(f"Radio button '{self.label}' selected in group '{self.group}'.") - - -class ToggleSwitch: - def __init__(self, state=False): - self.state = state - - def toggle(self): - self.state = not self.state - print(f"ToggleSwitch is now {'On' if self.state else 'Off'}") - - -class Tooltip: - def __init__(self, text): - self.text = text - - def show(self): - print(f"Tooltip: {self.text}") - - -class Modal: - def __init__(self, content): - self.content = content - self.visible = False - - def open(self): - self.visible = True - print("Modal opened.") - print(f"Content: {self.content}") - - def close(self): - self.visible = False - print("Modal closed.") diff --git a/examples/fastapi_server_docker/src/cocoindex_funs.py b/examples/fastapi_server_docker/src/cocoindex_funs.py deleted file mode 100644 index ff4bd7dbd..000000000 --- a/examples/fastapi_server_docker/src/cocoindex_funs.py +++ /dev/null @@ -1,45 +0,0 @@ -import cocoindex -import os - -@cocoindex.op.function() -def extract_extension(filename: str) -> str: - """Extract the extension of a filename.""" - return os.path.splitext(filename)[1] - -def code_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice: - """ - Embed the text using a SentenceTransformer model. - """ - return text.transform( - cocoindex.functions.SentenceTransformerEmbed( - model="sentence-transformers/all-MiniLM-L6-v2")) - -@cocoindex.flow_def(name="CodeEmbedding") -def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): - """ - Define an example flow that embeds files into a vector database. - """ - data_scope["files"] = flow_builder.add_source( - cocoindex.sources.LocalFile(path="sample_code", - included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx", "*.ts", "*.tsx"], - excluded_patterns=[".*", "target", "**/node_modules"])) - code_embeddings = data_scope.add_collector() - - with data_scope["files"].row() as file: - file["extension"] = file["filename"].transform(extract_extension) - file["chunks"] = file["content"].transform( - cocoindex.functions.SplitRecursively(), - language=file["extension"], chunk_size=1000, chunk_overlap=300) - with file["chunks"].row() as chunk: - chunk["embedding"] = chunk["text"].call(code_to_embedding) - code_embeddings.collect(filename=file["filename"], location=chunk["location"], - code=chunk["text"], embedding=chunk["embedding"]) - - code_embeddings.export( - "code_embeddings", - cocoindex.storages.Postgres(), - primary_key_fields=["filename", "location"], - vector_indexes=[ - cocoindex.VectorIndexDef( - field_name="embedding", - metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)]) From 5a01dc8d484cc88bd79b24cde869e8be3c62a577 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Wed, 21 May 2025 14:21:13 -0700 Subject: [PATCH 42/47] upgrade query handling --- examples/fastapi_server_docker/.env | 6 +- .../files/1810.04805v2.md | 530 ++++++++++++++++++ examples/fastapi_server_docker/main.py | 93 +-- 3 files changed, 589 insertions(+), 40 deletions(-) create mode 100644 examples/fastapi_server_docker/files/1810.04805v2.md diff --git a/examples/fastapi_server_docker/.env b/examples/fastapi_server_docker/.env index 8a1c89b4b..f322f4e2d 100644 --- a/examples/fastapi_server_docker/.env +++ b/examples/fastapi_server_docker/.env @@ -1 +1,5 @@ -COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@coco_db:5432/cocoindex \ No newline at end of file +# for docker +# COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@coco_db:5432/cocoindex + +# For local testing +COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex diff --git a/examples/fastapi_server_docker/files/1810.04805v2.md b/examples/fastapi_server_docker/files/1810.04805v2.md new file mode 100644 index 000000000..21ac07f46 --- /dev/null +++ b/examples/fastapi_server_docker/files/1810.04805v2.md @@ -0,0 +1,530 @@ +# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding + +Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova + +Google AI Language + +{jacobdevlin,mingweichang,kentonl,kristout}@google.com + +### Abstract + +We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models [(Peters et al.,](#page-10-0) [2018a;](#page-10-0) [Rad](#page-10-1)[ford et al.,](#page-10-1) [2018)](#page-10-1), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications. + +BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). + +### 1 Introduction + +Language model pre-training has been shown to be effective for improving many natural language processing tasks [(Dai and Le,](#page-9-0) [2015;](#page-9-0) [Peters et al.,](#page-10-0) [2018a;](#page-10-0) [Radford et al.,](#page-10-1) [2018;](#page-10-1) [Howard and Ruder,](#page-9-1) [2018)](#page-9-1). These include sentence-level tasks such as natural language inference [(Bowman et al.,](#page-9-2) [2015;](#page-9-2) [Williams et al.,](#page-11-0) [2018)](#page-11-0) and paraphrasing [(Dolan](#page-9-3) [and Brockett,](#page-9-3) [2005)](#page-9-3), which aim to predict the relationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition and question answering, where models are required to produce fine-grained output at the token level [(Tjong Kim Sang and](#page-10-2) [De Meulder,](#page-10-2) [2003;](#page-10-2) [Rajpurkar et al.,](#page-10-3) [2016)](#page-10-3). + +There are two existing strategies for applying pre-trained language representations to downstream tasks: *feature-based* and *fine-tuning*. The feature-based approach, such as ELMo [(Peters](#page-10-0) [et al.,](#page-10-0) [2018a)](#page-10-0), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) [(Radford et al.,](#page-10-1) [2018)](#page-10-1), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning *all* pretrained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations. + +We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer [(Vaswani et al.,](#page-10-4) [2017)](#page-10-4). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions. + +In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidirectionality constraint by using a "masked language model" (MLM) pre-training objective, inspired by the Cloze task [(Taylor,](#page-10-5) [1953)](#page-10-5). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-toright language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer. In addition to the masked language model, we also use a "next sentence prediction" task that jointly pretrains text-pair representations. The contributions of our paper are as follows: + +- We demonstrate the importance of bidirectional pre-training for language representations. Unlike [Radford et al.](#page-10-1) [(2018)](#page-10-1), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to [Peters et al.](#page-10-0) [(2018a)](#page-10-0), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. +- We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level *and* token-level tasks, outperforming many task-specific architectures. +- BERT advances the state of the art for eleven NLP tasks. The code and pre-trained models are available at [https://github.com/](https://github.com/google-research/bert) [google-research/bert](https://github.com/google-research/bert). + +### 2 Related Work + +There is a long history of pre-training general language representations, and we briefly review the most widely-used approaches in this section. + +#### 2.1 Unsupervised Feature-based Approaches + +Learning widely applicable representations of words has been an active area of research for decades, including non-neural [(Brown et al.,](#page-9-4) [1992;](#page-9-4) [Ando and Zhang,](#page-9-5) [2005;](#page-9-5) [Blitzer et al.,](#page-9-6) [2006)](#page-9-6) and neural [(Mikolov et al.,](#page-10-6) [2013;](#page-10-6) [Pennington et al.,](#page-10-7) [2014)](#page-10-7) methods. Pre-trained word embeddings are an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch [(Turian et al.,](#page-10-8) [2010)](#page-10-8). To pretrain word embedding vectors, left-to-right language modeling objectives have been used [(Mnih](#page-10-9) [and Hinton,](#page-10-9) [2009)](#page-10-9), as well as objectives to discriminate correct from incorrect words in left and right context [(Mikolov et al.,](#page-10-6) [2013)](#page-10-6). + +These approaches have been generalized to coarser granularities, such as sentence embeddings [(Kiros et al.,](#page-10-10) [2015;](#page-10-10) [Logeswaran and Lee,](#page-10-11) [2018)](#page-10-11) or paragraph embeddings [(Le and Mikolov,](#page-10-12) [2014)](#page-10-12). To train sentence representations, prior work has used objectives to rank candidate next sentences [(Jernite et al.,](#page-9-7) [2017;](#page-9-7) [Logeswaran and](#page-10-11) [Lee,](#page-10-11) [2018)](#page-10-11), left-to-right generation of next sentence words given a representation of the previous sentence [(Kiros et al.,](#page-10-10) [2015)](#page-10-10), or denoising autoencoder derived objectives [(Hill et al.,](#page-9-8) [2016)](#page-9-8). + +ELMo and its predecessor [(Peters et al.,](#page-10-13) [2017,](#page-10-13) [2018a)](#page-10-0) generalize traditional word embedding research along a different dimension. They extract *context-sensitive* features from a left-to-right and a right-to-left language model. The contextual representation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for several major NLP benchmarks [(Peters et al.,](#page-10-0) [2018a)](#page-10-0) including question answering [(Rajpurkar et al.,](#page-10-3) [2016)](#page-10-3), sentiment analysis [(Socher et al.,](#page-10-14) [2013)](#page-10-14), and named entity recognition [(Tjong Kim Sang and De Meulder,](#page-10-2) [2003)](#page-10-2). [Melamud et al.](#page-10-15) [(2016)](#page-10-15) proposed learning contextual representations through a task to predict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. [Fedus](#page-9-9) [et al.](#page-9-9) [(2018)](#page-9-9) shows that the cloze task can be used to improve the robustness of text generation models. + +#### 2.2 Unsupervised Fine-tuning Approaches + +As with the feature-based approaches, the first works in this direction only pre-trained word embedding parameters from unlabeled text [(Col](#page-9-10)[lobert and Weston,](#page-9-10) [2008)](#page-9-10). + +More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task [(Dai](#page-9-0) [and Le,](#page-9-0) [2015;](#page-9-0) [Howard and Ruder,](#page-9-1) [2018;](#page-9-1) [Radford](#page-10-1) [et al.,](#page-10-1) [2018)](#page-10-1). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT [(Radford et al.,](#page-10-1) [2018)](#page-10-1) achieved previously state-of-the-art results on many sentencelevel tasks from the GLUE benchmark [(Wang](#page-10-16) [et al.,](#page-10-16) [2018a)](#page-10-16). Left-to-right language model- + +Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers). + +ing and auto-encoder objectives have been used for pre-training such models [(Howard and Ruder,](#page-9-1) [2018;](#page-9-1) [Radford et al.,](#page-10-1) [2018;](#page-10-1) [Dai and Le,](#page-9-0) [2015)](#page-9-0). + +#### 2.3 Transfer Learning from Supervised Data + +There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference [(Conneau et al.,](#page-9-11) [2017)](#page-9-11) and machine translation [(McCann et al.,](#page-10-17) [2017)](#page-10-17). Computer vision research has also demonstrated the importance of transfer learning from large pre-trained models, where an effective recipe is to fine-tune models pre-trained with ImageNet [(Deng et al.,](#page-9-12) [2009;](#page-9-12) [Yosinski et al.,](#page-11-1) [2014)](#page-11-1). + +### 3 BERT + +We introduce BERT and its detailed implementation in this section. There are two steps in our framework: *pre-training* and *fine-tuning*. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. The question-answering example in Figure [1](#page-2-0) will serve as a running example for this section. + +A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture. + +Model Architecture BERT's model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in [Vaswani et al.](#page-10-4) [(2017)](#page-10-4) and released in the tensor2tensor library.[1](#page-2-1) Because the use of Transformers has become common and our implementation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to [Vaswani et al.](#page-10-4) [(2017)](#page-10-4) as well as excellent guides such as "The Annotated Transformer."[2](#page-2-2) + +In this work, we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. [3](#page-2-3) We primarily report results on two model sizes: BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M). + +BERTBASE was chosen to have the same model size as OpenAI GPT for comparison purposes. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left.[4](#page-2-4) + +1 https://github.com/tensorflow/tensor2tensor 2 http://nlp.seas.harvard.edu/2018/04/03/attention.html 3 In all cases we set the feed-forward/filter size to be 4H, + +i.e., 3072 for the H = 768 and 4096 for the H = 1024. 4We note that in the literature the bidirectional Trans- + +Input/Output Representations To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answeri) in one token sequence. Throughout this work, a "sentence" can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A "sequence" refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together. + +We use WordPiece embeddings [(Wu et al.,](#page-11-2) [2016)](#page-11-2) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure [1,](#page-2-0) we denote input embedding as E, the final hidden vector of the special [CLS] token as C ∈ R H, and the final hidden vector for the i th input token as Ti ∈ R H. + +For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure [2.](#page-4-0) + +### 3.1 Pre-training BERT + +Unlike [Peters et al.](#page-10-0) [(2018a)](#page-10-0) and [Radford et al.](#page-10-1) [(2018)](#page-10-1), we do not use traditional left-to-right or right-to-left language models to pre-train BERT. Instead, we pre-train BERT using two unsupervised tasks, described in this section. This step is presented in the left part of Figure [1.](#page-2-0) + +Task #1: Masked LM Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-toright and a right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right *or* right-to-left, since bidirectional conditioning would allow each word to indirectly "see itself", and the model could trivially predict the target word in a multi-layered context. + +In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a "masked LM" (MLM), although it is often referred to as a *Cloze* task in the literature [(Taylor,](#page-10-5) [1953)](#page-10-5). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask 15% of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders [(Vincent et al.,](#page-10-18) [2008)](#page-10-18), we only predict the masked words rather than reconstructing the entire input. + +Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace "masked" words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, Ti will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix [C.2.](#page-15-0) + +Task #2: Next Sentence Prediction (NSP) Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the *relationship* between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized *next sentence prediction* task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext). As we show in Figure [1,](#page-2-0) C is used for next sentence prediction (NSP).[5](#page-3-0) Despite its simplicity, we demonstrate in Section [5.1](#page-7-0) that pre-training towards this task is very beneficial to both QA and NLI. [6](#page-3-1) + +former is often referred to as a "Transformer encoder" while the left-context-only version is referred to as a "Transformer decoder" since it can be used for text generation. + +5The final model achieves 97%-98% accuracy on NSP. + +6The vector C is not a meaningful sentence representation without fine-tuning, since it was trained with NSP. + +Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings. + +The NSP task is closely related to representationlearning objectives used in [Jernite et al.](#page-9-7) [(2017)](#page-9-7) and [Logeswaran and Lee](#page-10-11) [(2018)](#page-10-11). However, in prior work, only sentence embeddings are transferred to down-stream tasks, where BERT transfers all parameters to initialize end-task model parameters. + +Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the BooksCorpus (800M words) [(Zhu et al.,](#page-11-3) [2015)](#page-11-3) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark [(Chelba et al.,](#page-9-13) [2013)](#page-9-13) in order to extract long contiguous sequences. + +#### 3.2 Fine-tuning BERT + +Fine-tuning is straightforward since the selfattention mechanism in the Transformer allows BERT to model many downstream tasks whether they involve single text or text pairs—by swapping out the appropriate inputs and outputs. For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention, such as [Parikh et al.](#page-10-19) [(2016)](#page-10-19); [Seo et al.](#page-10-20) [(2017)](#page-10-20). BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes *bidirectional* cross attention between two sentences. + +For each task, we simply plug in the taskspecific inputs and outputs into BERT and finetune all the parameters end-to-end. At the input, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphrasing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and (4) a degenerate text-∅ pair in text classification or sequence tagging. At the output, the token representations are fed into an output layer for tokenlevel tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis. + +Compared to pre-training, fine-tuning is relatively inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.[7](#page-4-1) We describe the task-specific details in the corresponding subsections of Section [4.](#page-4-2) More details can be found in Appendix [A.5.](#page-13-0) + +### 4 Experiments + +In this section, we present BERT fine-tuning results on 11 NLP tasks. + +#### 4.1 GLUE + +The General Language Understanding Evaluation (GLUE) benchmark [(Wang et al.,](#page-10-16) [2018a)](#page-10-16) is a collection of diverse natural language understanding tasks. Detailed descriptions of GLUE datasets are included in Appendix [B.1.](#page-13-1) + +To fine-tune on GLUE, we represent the input sequence (for single sentence or sentence pairs) as described in Section [3,](#page-2-5) and use the final hidden vector C ∈ R H corresponding to the first input token ([CLS]) as the aggregate representation. The only new parameters introduced during fine-tuning are classification layer weights W ∈ R K×H, where K is the number of labels. We compute a standard classification loss with C and W, i.e., log(softmax(CWT )). + +- 8 See (10) in . +7 For example, the BERT SQuAD model can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%. + + + +| System | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average | +|------------------|-------------|------|------|-------|------|-------|------|------|---------| +| | 392k | 363k | 108k | 67k | 8.5k | 5.7k | 3.5k | 2.5k | - | +| Pre-OpenAI SOTA | 80.6/80.1 | 66.1 | 82.3 | 93.2 | 35.0 | 81.0 | 86.0 | 61.7 | 74.0 | +| BiLSTM+ELMo+Attn | 76.4/76.1 | 64.8 | 79.8 | 90.4 | 36.0 | 73.3 | 84.9 | 56.8 | 71.0 | +| OpenAI GPT | 82.1/81.4 | 70.3 | 87.4 | 91.3 | 45.4 | 80.0 | 82.3 | 56.0 | 75.1 | +| BERTBASE | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 | +| BERTLARGE | 86.7/85.9 | 72.1 | 92.7 | 94.9 | 60.5 | 86.5 | 89.3 | 70.1 | 82.1 | + +Table 1: GLUE Test results, scored by the evaluation server (). The number below each task denotes the number of training examples. The "Average" column is slightly different than the official GLUE score, since we exclude the problematic WNLI set.[8](#page-4-3) BERT and OpenAI GPT are singlemodel, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components. + +We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set. Additionally, for BERTLARGE we found that finetuning was sometimes unstable on small datasets, so we ran several random restarts and selected the best model on the Dev set. With random restarts, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initialization.[9](#page-5-0) + +Results are presented in Table [1.](#page-5-1) Both BERTBASE and BERTLARGE outperform all systems on all tasks by a substantial margin, obtaining 4.5% and 7.0% respective average accuracy improvement over the prior state of the art. Note that BERTBASE and OpenAI GPT are nearly identical in terms of model architecture apart from the attention masking. For the largest and most widely reported GLUE task, MNLI, BERT obtains a 4.6% absolute accuracy improvement. On the official GLUE leaderboard[10](#page-5-2), BERTLARGE obtains a score of 80.5, compared to OpenAI GPT, which obtains 72.8 as of the date of writing. + +We find that BERTLARGE significantly outperforms BERTBASE across all tasks, especially those with very little training data. The effect of model size is explored more thoroughly in Section [5.2.](#page-7-1) + +#### 4.2 SQuAD v1.1 + +The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowdsourced question/answer pairs [(Rajpurkar et al.,](#page-10-3) [2016)](#page-10-3). Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage. + +As shown in Figure [1,](#page-2-0) in the question answering task, we represent the input question and passage as a single packed sequence, with the question using the A embedding and the passage using the B embedding. We only introduce a start vector S ∈ R H and an end vector E ∈ R H during fine-tuning. The probability of word i being the start of the answer span is computed as a dot product between Ti and S followed by a softmax over all of the words in the paragraph: Pi = e S·Ti P j e S·Tj . The analogous formula is used for the end of the answer span. The score of a candidate span from position i to position j is defined as S·Ti + E·Tj , and the maximum scoring span where j ≥ i is used as a prediction. The training objective is the sum of the log-likelihoods of the correct start and end positions. We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32. + +Table [2](#page-6-0) shows top leaderboard entries as well as results from top published systems [(Seo et al.,](#page-10-20) [2017;](#page-10-20) [Clark and Gardner,](#page-9-14) [2018;](#page-9-14) [Peters et al.,](#page-10-0) [2018a;](#page-10-0) [Hu et al.,](#page-9-15) [2018)](#page-9-15). The top results from the SQuAD leaderboard do not have up-to-date public system descriptions available,[11](#page-5-3) and are allowed to use any public data when training their systems. We therefore use modest data augmentation in our system by first fine-tuning on TriviaQA [(Joshi](#page-10-21) [et al.,](#page-10-21) [2017)](#page-10-21) befor fine-tuning on SQuAD. + +Our best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 as a single system. In fact, our single BERT model outperforms the top ensemble system in terms of F1 score. Without TriviaQA fine- + +9The GLUE data set distribution does not include the Test labels, and we only made a single GLUE evaluation server submission for each of BERTBASE and BERTLARGE. + +10https://gluebenchmark.com/leaderboard + +11QANet is described in [Yu et al.](#page-11-4) [(2018)](#page-11-4), but the system has improved substantially after publication. + + + +| System | Dev | | Test | | | | | | +|------------------------------------------|------|------|------|------|--|--|--|--| +| | EM | F1 | EM | F1 | | | | | +| Top Leaderboard Systems (Dec 10th, 2018) | | | | | | | | | +| Human | - | - | 82.3 | 91.2 | | | | | +| #1 Ensemble - nlnet | - | - | 86.0 | 91.7 | | | | | +| #2 Ensemble - QANet | - | - | 84.5 | 90.5 | | | | | +| Published | | | | | | | | | +| BiDAF+ELMo (Single) | - | 85.6 | - | 85.8 | | | | | +| R.M. Reader (Ensemble) | 81.2 | 87.9 | 82.3 | 88.5 | | | | | +| Ours | | | | | | | | | +| BERTBASE (Single) | 80.8 | 88.5 | - | - | | | | | +| BERTLARGE (Single) | 84.1 | 90.9 | - | - | | | | | +| BERTLARGE (Ensemble) | 85.8 | 91.8 | - | - | | | | | +| BERTLARGE (Sgl.+TriviaQA) | 84.2 | 91.1 | 85.1 | 91.8 | | | | | +| BERTLARGE (Ens.+TriviaQA) | 86.2 | 92.2 | 87.4 | 93.2 | | | | | + +Table 2: SQuAD 1.1 results. The BERT ensemble is 7x systems which use different pre-training checkpoints and fine-tuning seeds. + + + +| System | Dev | | Test | | | | | | +|------------------------------------------|------|------|------|------|--|--|--|--| +| | EM | F1 | EM | F1 | | | | | +| Top Leaderboard Systems (Dec 10th, 2018) | | | | | | | | | +| Human | 86.3 | 89.0 | 86.9 | 89.5 | | | | | +| #1 Single - MIR-MRC (F-Net) | - | - | 74.8 | 78.0 | | | | | +| #2 Single - nlnet | - | - | 74.2 | 77.1 | | | | | +| Published | | | | | | | | | +| unet (Ensemble) | - | - | 71.4 | 74.9 | | | | | +| SLQA+ (Single) | - | | 71.4 | 74.4 | | | | | +| Ours | | | | | | | | | +| BERTLARGE (Single) | 78.7 | 81.9 | 80.0 | 83.1 | | | | | + +Table 3: SQuAD 2.0 results. We exclude entries that use BERT as one of their components. + +tuning data, we only lose 0.1-0.4 F1, still outperforming all existing systems by a wide margin.[12](#page-6-1) + +#### 4.3 SQuAD v2.0 + +The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic. + +We use a simple approach to extend the SQuAD v1.1 BERT model for this task. We treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token. For prediction, we compare the score of the no-answer span: snull = S·C + E·C to the score of the best non-null span + + + +| System | Dev | Test | +|------------------------|------|------| +| ESIM+GloVe | 51.9 | 52.7 | +| ESIM+ELMo | 59.1 | 59.2 | +| OpenAI GPT | - | 78.0 | +| BERTBASE | 81.6 | - | +| BERTLARGE | 86.6 | 86.3 | +| Human (expert)† | - | 85.0 | +| Human (5 annotations)† | - | 88.0 | + +Table 4: SWAG Dev and Test accuracies. †Human performance is measured with 100 samples, as reported in the SWAG paper. + +sˆi,j = maxj≥iS·Ti + E·Tj . We predict a non-null answer when sˆi,j > snull + τ , where the threshold τ is selected on the dev set to maximize F1. We did not use TriviaQA data for this model. We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48. + +The results compared to prior leaderboard entries and top published work [(Sun et al.,](#page-10-22) [2018;](#page-10-22) [Wang et al.,](#page-11-5) [2018b)](#page-11-5) are shown in Table [3,](#page-6-2) excluding systems that use BERT as one of their components. We observe a +5.1 F1 improvement over the previous best system. + +#### 4.4 SWAG + +The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded commonsense inference [(Zellers et al.,](#page-11-6) [2018)](#page-11-6). Given a sentence, the task is to choose the most plausible continuation among four choices. + +When fine-tuning on the SWAG dataset, we construct four input sequences, each containing the concatenation of the given sentence (sentence A) and a possible continuation (sentence B). The only task-specific parameters introduced is a vector whose dot product with the [CLS] token representation C denotes a score for each choice which is normalized with a softmax layer. + +We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16. Results are presented in Table [4.](#page-6-3) BERTLARGE outperforms the authors' baseline ESIM+ELMo system by +27.1% and OpenAI GPT by 8.3%. + +### 5 Ablation Studies + +In this section, we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance. Additional + +12The TriviaQA data we used consists of paragraphs from TriviaQA-Wiki formed of the first 400 tokens in documents, that contain at least one of the provided possible answers. + + + +| | Dev Set | | | | | | | +|--------------|---------|-------|-------|-------|-------|--|--| +| Tasks | MNLI-m | QNLI | MRPC | SST-2 | SQuAD | | | +| | (Acc) | (Acc) | (Acc) | (Acc) | (F1) | | | +| BERTBASE | 84.4 | 88.4 | 86.7 | 92.7 | 88.5 | | | +| No NSP | 83.9 | 84.9 | 86.5 | 92.6 | 87.9 | | | +| LTR & No NSP | 82.1 | 84.3 | 77.5 | 92.1 | 77.8 | | | +| + BiLSTM | 82.1 | 84.1 | 75.7 | 91.6 | 84.9 | | | + +Table 5: Ablation over the pre-training tasks using the BERTBASE architecture. "No NSP" is trained without the next sentence prediction task. "LTR & No NSP" is trained as a left-to-right LM without the next sentence prediction, like OpenAI GPT. "+ BiLSTM" adds a randomly initialized BiLSTM on top of the "LTR + No NSP" model during fine-tuning. + +ablation studies can be found in Appendix [C.](#page-15-1) + +### 5.1 Effect of Pre-training Tasks + +We demonstrate the importance of the deep bidirectionality of BERT by evaluating two pretraining objectives using exactly the same pretraining data, fine-tuning scheme, and hyperparameters as BERTBASE: + +No NSP: A bidirectional model which is trained using the "masked LM" (MLM) but without the "next sentence prediction" (NSP) task. + +LTR & No NSP: A left-context-only model which is trained using a standard Left-to-Right (LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance. Additionally, this model was pre-trained without the NSP task. This is directly comparable to OpenAI GPT, but using our larger training dataset, our input representation, and our fine-tuning scheme. + +We first examine the impact brought by the NSP task. In Table [5,](#page-7-2) we show that removing NSP hurts performance significantly on QNLI, MNLI, and SQuAD 1.1. Next, we evaluate the impact of training bidirectional representations by comparing "No NSP" to "LTR & No NSP". The LTR model performs worse than the MLM model on all tasks, with large drops on MRPC and SQuAD. + +For SQuAD it is intuitively clear that a LTR model will perform poorly at token predictions, since the token-level hidden states have no rightside context. In order to make a good faith attempt at strengthening the LTR system, we added a randomly initialized BiLSTM on top. This does significantly improve results on SQuAD, but the results are still far worse than those of the pretrained bidirectional models. The BiLSTM hurts performance on the GLUE tasks. + +We recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does. However: (a) this is twice as expensive as a single bidirectional model; (b) this is non-intuitive for tasks like QA, since the RTL model would not be able to condition the answer on the question; (c) this it is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer. + +### 5.2 Effect of Model Size + +In this section, we explore the effect of model size on fine-tuning task accuracy. We trained a number of BERT models with a differing number of layers, hidden units, and attention heads, while otherwise using the same hyperparameters and training procedure as described previously. + +Results on selected GLUE tasks are shown in Table [6.](#page-8-0) In this table, we report the average Dev Set accuracy from 5 random restarts of fine-tuning. We can see that larger models lead to a strict accuracy improvement across all four datasets, even for MRPC which only has 3,600 labeled training examples, and is substantially different from the pre-training tasks. It is also perhaps surprising that we are able to achieve such significant improvements on top of models which are already quite large relative to the existing literature. For example, the largest Transformer explored in [Vaswani et al.](#page-10-4) [(2017)](#page-10-4) is (L=6, H=1024, A=16) with 100M parameters for the encoder, and the largest Transformer we have found in the literature is (L=64, H=512, A=2) with 235M parameters [(Al-Rfou et al.,](#page-9-16) [2018)](#page-9-16). By contrast, BERTBASE contains 110M parameters and BERTLARGE contains 340M parameters. + +It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstrated by the LM perplexity of held-out training data shown in Table [6.](#page-8-0) However, we believe that this is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained. [Peters et al.](#page-10-23) [(2018b)](#page-10-23) presented mixed results on the downstream task impact of increasing the pre-trained bi-LM size from two to four layers and [Melamud et al.](#page-10-15) [(2016)](#page-10-15) mentioned in passing that increasing hidden dimension size from 200 to 600 helped, but increasing further to 1,000 did not bring further improvements. Both of these prior works used a featurebased approach — we hypothesize that when the model is fine-tuned directly on the downstream tasks and uses only a very small number of randomly initialized additional parameters, the taskspecific models can benefit from the larger, more expressive pre-trained representations even when downstream task data is very small. + +#### 5.3 Feature-based Approach with BERT + +All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is added to the pre-trained model, and all parameters are jointly fine-tuned on a downstream task. However, the feature-based approach, where fixed features are extracted from the pretrained model, has certain advantages. First, not all tasks can be easily represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added. Second, there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation. + +In this section, we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition (NER) task [(Tjong Kim Sang](#page-10-2) [and De Meulder,](#page-10-2) [2003)](#page-10-2). In the input to BERT, we use a case-preserving WordPiece model, and we include the maximal document context provided by the data. Following standard practice, we formulate this as a tagging task but do not use a CRF + + + +| | Hyperparams | | | Dev Set Accuracy | | | | | +|-------------------------|----------------------------------|---------------------------|--------------------------------------|--------------------------------------|--------------------------------------|--------------------------------------|--|--| +| #L | #H | #A | LM (ppl) | MNLI-m | MRPC | SST-2 | | | +| 3
6
6
12
12 | 768
768
768
768
1024 | 12
3
12
12
16 | 5.84
5.24
4.68
3.99
3.54 | 77.9
80.6
81.9
84.4
85.7 | 79.8
82.2
84.8
86.7
86.9 | 88.4
90.7
91.3
92.9
93.3 | | | +| 24 | 1024 | 16 | 3.23 | 86.6 | 87.8 | 93.7 | | | + +Table 6: Ablation over BERT model size. #L = the number of layers; #H = hidden size; #A = number of attention heads. "LM (ppl)" is the masked LM perplexity of held-out training data. + + + +| System | Dev F1 | Test F1 | +|-----------------------------------|--------|---------| +| ELMo (Peters et al., 2018a) | 95.7 | 92.2 | +| CVT (Clark et al., 2018) | - | 92.6 | +| CSE (Akbik et al., 2018) | - | 93.1 | +| Fine-tuning approach | | | +| BERTLARGE | 96.6 | 92.8 | +| BERTBASE | 96.4 | 92.4 | +| Feature-based approach (BERTBASE) | | | +| Embeddings | 91.0 | - | +| Second-to-Last Hidden | 95.6 | - | +| Last Hidden | 94.9 | - | +| Weighted Sum Last Four Hidden | 95.9 | - | +| Concat Last Four Hidden | 96.1 | - | +| Weighted Sum All 12 Layers | 95.5 | - | + +Table 7: CoNLL-2003 Named Entity Recognition results. Hyperparameters were selected using the Dev set. The reported Dev and Test scores are averaged over 5 random restarts using those hyperparameters. + +layer in the output. We use the representation of the first sub-token as the input to the token-level classifier over the NER label set. + +To ablate the fine-tuning approach, we apply the feature-based approach by extracting the activations from one or more layers *without* fine-tuning any parameters of BERT. These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer. + +Results are presented in Table [7.](#page-8-1) BERTLARGE performs competitively with state-of-the-art methods. The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which is only 0.3 F1 behind fine-tuning the entire model. This demonstrates that BERT is effective for both finetuning and feature-based approaches. + +### 6 Conclusion + +Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, these results enable even low-resource tasks to benefit from deep unidirectional architectures. Our major contribution is further generalizing these findings to deep *bidirectional* architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks. + +### References + +- Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In *Proceedings of the 27th International Conference on Computational Linguistics*, pages 1638–1649. +- Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2018. Character-level language modeling with deeper self-attention. *arXiv preprint arXiv:1808.04444*. +- Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. *Journal of Machine Learning Research*, 6(Nov):1817–1853. +- Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In *TAC*. NIST. +- John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In *Proceedings of the 2006 conference on empirical methods in natural language processing*, pages 120–128. Association for Computational Linguistics. +- Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In *EMNLP*. Association for Computational Linguistics. +- Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. *Computational linguistics*, 18(4):467–479. +- Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. [Semeval-2017](https://doi.org/10.18653/v1/S17-2001) [task 1: Semantic textual similarity multilingual and](https://doi.org/10.18653/v1/S17-2001) [crosslingual focused evaluation.](https://doi.org/10.18653/v1/S17-2001) In *Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)*, pages 1–14, Vancouver, Canada. Association for Computational Linguistics. +- Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. *arXiv preprint arXiv:1312.3005*. +- Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018. [Quora question pairs.](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) +- Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In *ACL*. +- Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc Le. 2018. Semi-supervised sequence modeling with cross-view training. In *Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 1914– 1925. +- Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In *Proceedings of the 25th international conference on Machine learning*, pages 160–167. ACM. +- Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc Barrault, and Antoine Bordes. 2017. [Supervised](https://www.aclweb.org/anthology/D17-1070) [learning of universal sentence representations from](https://www.aclweb.org/anthology/D17-1070) [natural language inference data.](https://www.aclweb.org/anthology/D17-1070) In *Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing*, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics. +- Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In *Advances in neural information processing systems*, pages 3079–3087. +- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In *CVPR09*. +- William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In *Proceedings of the Third International Workshop on Paraphrasing (IWP2005)*. +- William Fedus, Ian Goodfellow, and Andrew M Dai. 2018. Maskgan: Better text generation via filling in the . *arXiv preprint arXiv:1801.07736*. +- Dan Hendrycks and Kevin Gimpel. 2016. [Bridging](http://arxiv.org/abs/1606.08415) [nonlinearities and stochastic regularizers with gaus](http://arxiv.org/abs/1606.08415)[sian error linear units.](http://arxiv.org/abs/1606.08415) *CoRR*, abs/1606.08415. +- Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In *Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*. Association for Computational Linguistics. +- Jeremy Howard and Sebastian Ruder. 2018. [Universal](http://arxiv.org/abs/1801.06146) [language model fine-tuning for text classification.](http://arxiv.org/abs/1801.06146) In *ACL*. Association for Computational Linguistics. +- Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. 2018. Reinforced mnemonic reader for machine reading comprehension. In *IJCAI*. +- Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. [Discourse-based objectives for fast un](http://arxiv.org/abs/1705.00557)[supervised sentence representation learning.](http://arxiv.org/abs/1705.00557) *CoRR*, abs/1705.00557. +- Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In *ACL*. +- Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In *Advances in neural information processing systems*, pages 3294–3302. +- Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In *International Conference on Machine Learning*, pages 1188–1196. +- Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The winograd schema challenge. In *Aaai spring symposium: Logical formalizations of commonsense reasoning*, volume 46, page 47. +- Lajanugen Logeswaran and Honglak Lee. 2018. [An](https://openreview.net/forum?id=rJvJXZb0W) [efficient framework for learning sentence represen](https://openreview.net/forum?id=rJvJXZb0W)[tations.](https://openreview.net/forum?id=rJvJXZb0W) In *International Conference on Learning Representations*. +- Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In *NIPS*. +- Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning generic context embedding with bidirectional LSTM. In *CoNLL*. +- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In *Advances in Neural Information Processing Systems 26*, pages 3111–3119. Curran Associates, Inc. +- Andriy Mnih and Geoffrey E Hinton. 2009. [A scal](http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model.pdf)[able hierarchical distributed language model.](http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model.pdf) In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, *Advances in Neural Information Processing Systems 21*, pages 1081–1088. Curran Associates, Inc. +- Ankur P Parikh, Oscar Tackstr ¨ om, Dipanjan Das, and ¨ Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In *EMNLP*. +- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. [Glove: Global vectors for](http://www.aclweb.org/anthology/D14-1162) [word representation.](http://www.aclweb.org/anthology/D14-1162) In *Empirical Methods in Natural Language Processing (EMNLP)*, pages 1532– 1543. +- Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In *ACL*. +- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In *NAACL*. +- Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In *Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 1499–1509. +- Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI. +- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In *Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing*, pages 2383–2392. +- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In *ICLR*. +- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In *Proceedings of the 2013 conference on empirical methods in natural language processing*, pages 1631–1642. +- Fu Sun, Linyang Li, Xipeng Qiu, and Yang Liu. 2018. U-net: Machine reading comprehension with unanswerable questions. *arXiv preprint arXiv:1810.06638*. +- Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. *Journalism Bulletin*, 30(4):415–433. +- Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In *CoNLL*. +- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In *Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics*, ACL '10, pages 384–394. +- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In *Advances in Neural Information Processing Systems*, pages 6000–6010. +- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In *Proceedings of the 25th international conference on Machine learning*, pages 1096–1103. ACM. +- Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018a. Glue: A multi-task benchmark and analysis platform + +for natural language understanding. In *Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP*, pages 353–355. + +- Wei Wang, Ming Yan, and Chen Wu. 2018b. Multigranularity hierarchical attention fusion networks for reading comprehension and question answering. In *Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*. Association for Computational Linguistics. +- Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018. Neural network acceptability judgments. *arXiv preprint arXiv:1805.12471*. +- Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In *NAACL*. +- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. *arXiv preprint arXiv:1609.08144*. +- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In *Advances in neural information processing systems*, pages 3320–3328. +- Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. In *ICLR*. +- Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In *Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. +- Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In *Proceedings of the IEEE international conference on computer vision*, pages 19–27. + +# Appendix for "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" + +We organize the appendix into three sections: + +- Additional implementation details for BERT are presented in Appendix [A;](#page-11-7) +- Additional details for our experiments are presented in Appendix [B;](#page-13-2) and +- Additional ablation studies are presented in Appendix [C.](#page-15-1) + +We present additional ablation studies for BERT including: + +- Effect of Number of Training Steps; and +- Ablation for Different Masking Procedures. + +### A Additional Details for BERT + +### A.1 Illustration of the Pre-training Tasks + +We provide examples of the pre-training tasks in the following. + +Masked LM and the Masking Procedure Assuming the unlabeled sentence is my dog is hairy, and during the random masking procedure we chose the 4-th token (which corresponding to hairy), our masking procedure can be further illustrated by + +- 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK] +- 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple +- 10% of the time: Keep the word unchanged, e.g., my dog is hairy → my dog is hairy. The purpose of this is to bias the representation towards the actual observed word. + +The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of *every* input token. Additionally, because random replacement only occurs for 1.5% of all tokens (i.e., 10% of 15%), this does not seem to harm the model's language understanding capability. In Section [C.2,](#page-15-0) we evaluate the impact this procedure. + +Compared to standard langauge model training, the masked LM only make predictions on 15% of tokens in each batch, which suggests that more pre-training steps may be required for the model + +Figure 3: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-toleft LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right context in all layers. In addition to the architecture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach. + +to converge. In Section [C.1](#page-15-2) we demonstrate that MLM does converge marginally slower than a leftto-right model (which predicts every token), but the empirical improvements of the MLM model far outweigh the increased training cost. + +Next Sentence Prediction The next sentence prediction task can be illustrated in the following examples. + +Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext + +Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext + +#### A.2 Pre-training Procedure + +To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as "sentences" even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the "next sentence prediction" task. They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces. + +We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. We use Adam with learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of 0.1 on all layers. We use a gelu activation [(Hendrycks and Gimpel,](#page-9-19) [2016)](#page-9-19) rather than the standard relu, following OpenAI GPT. The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood. + +Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).[13](#page-12-0) Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete. + +Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings. + +#### A.3 Fine-tuning Procedure + +For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and number of training epochs. The dropout probability was always kept at 0.1. The optimal hyperparameter values are task-specific, but we found the following range of possible values to work well across all tasks: + + Batch size: 16, 32 + +13https://cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-globalavailability.html + +• Learning rate (Adam): 5e-5, 3e-5, 2e-5 + +• Number of epochs: 2, 3, 4 + +We also observed that large data sets (e.g., 100k+ labeled training examples) were far less sensitive to hyperparameter choice than small data sets. Fine-tuning is typically very fast, so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set. + +# A.4 Comparison of BERT, ELMo ,and OpenAI GPT + +Here we studies the differences in recent popular representation learning models including ELMo, OpenAI GPT and BERT. The comparisons between the model architectures are shown visually in Figure [3.](#page-12-1) Note that in addition to the architecture differences, BERT and OpenAI GPT are finetuning approaches, while ELMo is a feature-based approach. + +The most comparable existing pre-training method to BERT is OpenAI GPT, which trains a left-to-right Transformer LM on a large text corpus. In fact, many of the design decisions in BERT were intentionally made to make it as close to GPT as possible so that the two methods could be minimally compared. The core argument of this work is that the bi-directionality and the two pretraining tasks presented in Section [3.1](#page-3-2) account for the majority of the empirical improvements, but we do note that there are several other differences between how BERT and GPT were trained: + +- GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words). +- GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training. +- GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words. +- GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set. + +To isolate the effect of these differences, we perform ablation experiments in Section [5.1](#page-7-0) which demonstrate that the majority of the improvements are in fact coming from the two pre-training tasks and the bidirectionality they enable. + +# A.5 Illustrations of Fine-tuning on Different Tasks + +The illustration of fine-tuning BERT on different tasks can be seen in Figure [4.](#page-14-0) Our task-specific models are formed by incorporating BERT with one additional output layer, so a minimal number of parameters need to be learned from scratch. Among the tasks, (a) and (b) are sequence-level tasks while (c) and (d) are token-level tasks. In the figure, E represents the input embedding, Ti represents the contextual representation of token i, [CLS] is the special symbol for classification output, and [SEP] is the special symbol to separate non-consecutive token sequences. + +# B Detailed Experimental Setup + +# B.1 Detailed Descriptions for the GLUE Benchmark Experiments. + +Our GLUE results in Tabl[e1](#page-5-1) are obtained from [https://gluebenchmark.com/](https://gluebenchmark.com/leaderboard) [leaderboard](https://gluebenchmark.com/leaderboard) and [https://blog.](https://blog.openai.com/language-unsupervised) [openai.com/language-unsupervised](https://blog.openai.com/language-unsupervised). The GLUE benchmark includes the following datasets, the descriptions of which were originally summarized in [Wang et al.](#page-10-16) [(2018a)](#page-10-16): + +MNLI Multi-Genre Natural Language Inference is a large-scale, crowdsourced entailment classification task [(Williams et al.,](#page-11-0) [2018)](#page-11-0). Given a pair of sentences, the goal is to predict whether the second sentence is an *entailment*, *contradiction*, or *neutral* with respect to the first one. + +QQP Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent [(Chen et al.,](#page-9-20) [2018)](#page-9-20). + +QNLI Question Natural Language Inference is a version of the Stanford Question Answering Dataset [(Rajpurkar et al.,](#page-10-3) [2016)](#page-10-3) which has been converted to a binary classification task [(Wang](#page-10-16) [et al.,](#page-10-16) [2018a)](#page-10-16). The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer. + +Figure 4: Illustrations of Fine-tuning BERT on Different Tasks. + +SST-2 The Stanford Sentiment Treebank is a binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment [(Socher](#page-10-14) [et al.,](#page-10-14) [2013)](#page-10-14). + +CoLA The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically "acceptable" or not [(Warstadt](#page-11-8) [et al.,](#page-11-8) [2018)](#page-11-8). + +STS-B The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and other sources [(Cer et al.,](#page-9-21) [2017)](#page-9-21). They were annotated with a score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning. + +MRPC Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent [(Dolan and Brockett,](#page-9-3) [2005)](#page-9-3). + +RTE Recognizing Textual Entailment is a binary entailment task similar to MNLI, but with much less training data [(Bentivogli et al.,](#page-9-22) [2009)](#page-9-22).[14](#page-14-1) + +WNLI Winograd NLI is a small natural language inference dataset [(Levesque et al.,](#page-10-24) [2011)](#page-10-24). The GLUE webpage notes that there are issues with the construction of this dataset, [15](#page-14-2) and every trained system that's been submitted to GLUE has performed worse than the 65.1 baseline accuracy of predicting the majority class. We therefore exclude this set to be fair to OpenAI GPT. For our GLUE submission, we always predicted the ma- + +14Note that we only report single-task fine-tuning results in this paper. A multitask fine-tuning approach could potentially push the performance even further. For example, we did observe substantial improvements on RTE from multitask training with MNLI. + +15 + +jority class. + +### C Additional Ablation Studies + +#### C.1 Effect of Number of Training Steps + +Figure [5](#page-15-3) presents MNLI Dev accuracy after finetuning from a checkpoint that has been pre-trained for k steps. This allows us to answer the following questions: + +- 1. Question: Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy? +Answer: Yes, BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps compared to 500k steps. + +- 2. Question: Does MLM pre-training converge slower than LTR pre-training, since only 15% of words are predicted in each batch rather than every word? +Answer: The MLM model does converge slightly slower than the LTR model. However, in terms of absolute accuracy the MLM model begins to outperform the LTR model almost immediately. + +### C.2 Ablation for Different Masking Procedures + +In Section [3.1,](#page-3-2) we mention that BERT uses a mixed strategy for masking the target tokens when pre-training with the masked language model (MLM) objective. The following is an ablation study to evaluate the effect of different masking strategies. + +Figure 5: Ablation over number of training steps. This shows the MNLI accuracy after fine-tuning, starting from model parameters that have been pre-trained for k steps. The x-axis is the value of k. + +Note that the purpose of the masking strategies is to reduce the mismatch between pre-training and fine-tuning, as the [MASK] symbol never appears during the fine-tuning stage. We report the Dev results for both MNLI and NER. For NER, we report both fine-tuning and feature-based approaches, as we expect the mismatch will be amplified for the feature-based approach as the model will not have the chance to adjust the representations. + + + +| Masking Rates | | | Dev Set Results | | | | +|---------------|------|------|-------------------|-----------------------------------|------|--| +| MASK | SAME | RND | MNLI
Fine-tune | NER
Fine-tune
Feature-based | | | +| 80% | 10% | 10% | 84.2 | 95.4 | 94.9 | | +| 100% | 0% | 0% | 84.3 | 94.9 | 94.0 | | +| 80% | 0% | 20% | 84.1 | 95.2 | 94.6 | | +| 80% | 20% | 0% | 84.4 | 95.2 | 94.7 | | +| 0% | 20% | 80% | 83.7 | 94.8 | 94.6 | | +| 0% | 0% | 100% | 83.6 | 94.9 | 94.6 | | + +Table 8: Ablation over different masking strategies. + +The results are presented in Table [8.](#page-15-4) In the table, MASK means that we replace the target token with the [MASK] symbol for MLM; SAME means that we keep the target token as is; RND means that we replace the target token with another random token. + +The numbers in the left part of the table represent the probabilities of the specific strategies used during MLM pre-training (BERT uses 80%, 10%, 10%). The right part of the paper represents the Dev set results. For the feature-based approach, we concatenate the last 4 layers of BERT as the features, which was shown to be the best approach in Section [5.3.](#page-8-2) + +From the table it can be seen that fine-tuning is surprisingly robust to different masking strategies. However, as expected, using only the MASK strategy was problematic when applying the featurebased approach to NER. Interestingly, using only the RND strategy performs much worse than our strategy as well. \ No newline at end of file diff --git a/examples/fastapi_server_docker/main.py b/examples/fastapi_server_docker/main.py index 5d7bdd18b..11ba61d06 100644 --- a/examples/fastapi_server_docker/main.py +++ b/examples/fastapi_server_docker/main.py @@ -1,46 +1,45 @@ import cocoindex import uvicorn -import os - -from fastapi import FastAPI from dotenv import load_dotenv +from fastapi import FastAPI, Query +from psycopg_pool import ConnectionPool +import os -@cocoindex.op.function() -def extract_extension(filename: str) -> str: - """Extract the extension of a filename.""" - return os.path.splitext(filename)[1] - -def code_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice: +@cocoindex.transform_flow() +def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]: """ Embed the text using a SentenceTransformer model. + This is a shared logic between indexing and querying. """ return text.transform( cocoindex.functions.SentenceTransformerEmbed( model="sentence-transformers/all-MiniLM-L6-v2")) -@cocoindex.flow_def(name="CodeEmbeddingFastApiExample") -def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): +@cocoindex.flow_def(name="MarkdownEmbeddingFastApiExample") +def markdown_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): """ - Define an example flow that embeds files into a vector database. + Define an example flow that embeds markdown files into a vector database. """ - data_scope["files"] = flow_builder.add_source( - cocoindex.sources.LocalFile(path="./", - included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx", "*.ts", "*.tsx"], - excluded_patterns=[".*", "target", "**/node_modules"])) - code_embeddings = data_scope.add_collector() + data_scope["documents"] = flow_builder.add_source( + cocoindex.sources.LocalFile(path="files")) + doc_embeddings = data_scope.add_collector() - with data_scope["files"].row() as file: - file["extension"] = file["filename"].transform(extract_extension) - file["chunks"] = file["content"].transform( + with data_scope["documents"].row() as doc: + doc["chunks"] = doc["content"].transform( cocoindex.functions.SplitRecursively(), - language=file["extension"], chunk_size=1000, chunk_overlap=300) - with file["chunks"].row() as chunk: - chunk["embedding"] = chunk["text"].call(code_to_embedding) - code_embeddings.collect(filename=file["filename"], location=chunk["location"], - code=chunk["text"], embedding=chunk["embedding"]) + language="markdown", chunk_size=2000, chunk_overlap=500) + + with doc["chunks"].row() as chunk: + chunk["embedding"] = text_to_embedding(chunk["text"]) + doc_embeddings.collect( + filename=doc["filename"], + location=chunk["location"], + text=chunk["text"], + embedding=chunk["embedding"] + ) - code_embeddings.export( - "code_embeddings", + doc_embeddings.export( + "doc_embeddings", cocoindex.storages.Postgres(), primary_key_fields=["filename", "location"], vector_indexes=[ @@ -48,20 +47,36 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind field_name="embedding", metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)]) +def search(pool: ConnectionPool, query: str, top_k: int = 5): + # Get the table name, for the export target in the text_embedding_flow above. + table_name = cocoindex.utils.get_target_storage_default_name(markdown_embedding_flow, "doc_embeddings") + # Evaluate the transform flow defined above with the input query, to get the embedding. + query_vector = text_to_embedding.eval(query) + # Run the query and get the results. + with pool.connection() as conn: + with conn.cursor() as cur: + cur.execute(f""" + SELECT filename, text, embedding <=> %s::vector AS distance + FROM {table_name} ORDER BY distance LIMIT %s + """, (query_vector, top_k)) + return [ + {"filename": row[0], "text": row[1], "score": 1.0 - row[2]} + for row in cur.fetchall() + ] + fastapi_app = FastAPI() - -query_handler = cocoindex.query.SimpleSemanticsQueryHandler( - name="SemanticsSearch", - flow=code_embedding_flow, - target_name="code_embeddings", - query_transform_flow=code_to_embedding, - default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY -) -@fastapi_app.get("/query") -def query_endpoint(string: str): - results, _ = query_handler.search(string, 10) - return results +@fastapi_app.on_event("startup") +def startup_event(): + load_dotenv() + cocoindex.init() + # Initialize database connection pool + fastapi_app.state.pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL")) + +@fastapi_app.get("/search") +def search_endpoint(q: str = Query(..., description="Search query"), limit: int = Query(5, description="Number of results")): + results = search(fastapi_app.state.pool, q, limit) + return {"results": results} if __name__ == "__main__": load_dotenv() From 99a1e382f470d5198aa43d644c0ff1d0f21e26e0 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Wed, 21 May 2025 14:28:41 -0700 Subject: [PATCH 43/47] Update README.md --- examples/fastapi_server_docker/README.md | 43 ++++++++++++++++++++---- 1 file changed, 36 insertions(+), 7 deletions(-) diff --git a/examples/fastapi_server_docker/README.md b/examples/fastapi_server_docker/README.md index d65d323ab..b053a7a9e 100644 --- a/examples/fastapi_server_docker/README.md +++ b/examples/fastapi_server_docker/README.md @@ -1,10 +1,39 @@ -## Run cocoindex docker container with a simple query endpoint via fastapi -In this example, we provide a simple docker container using docker compose to build pgvector17 along with a simple python fastapi script than runs a simple query endpoint. This example uses the code from the code embedding example. +## Run docker container with a simple query endpoint via fastapi -## How to run -Edit the sample code directory to include the code you want to query over in -```sample_code/``` +In this example, we will build index for text embedding from local markdown files, and provide a simple query endpoint via fastapi. +We provide a simple docker container using docker compose to build pgvector17 along with a simple python fastapi script -Edit the configuration code from the file ```src/cocoindex_funs.py``` line 23 to 25. +## Run locally without docker +- Install dependencies: -Finally build the docker container via: ```docker compose up``` while inside the directory of the example. + ```bash + pip install -e . + ``` + +- Setup: + + ```bash + cocoindex setup main.py + ``` + +- Update index: + + ```bash + cocoindex update main.py + ``` + +- Run: + + ```bash + uvicorn main:fastapi_app --reload --host 0.0.0.0 --port 8000 + + ## Query the endpoint + + ```bash + curl "http://localhost:8000/search?q=model&limit=3" + ``` + + +## Run Docker +Build the docker container via: +```docker compose up``` From 57ac81d3441973f0b3f943b96bfab14dc38b10e7 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Wed, 21 May 2025 14:29:03 -0700 Subject: [PATCH 44/47] Update README.md --- examples/fastapi_server_docker/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/examples/fastapi_server_docker/README.md b/examples/fastapi_server_docker/README.md index b053a7a9e..d3764b2a3 100644 --- a/examples/fastapi_server_docker/README.md +++ b/examples/fastapi_server_docker/README.md @@ -3,6 +3,9 @@ In this example, we will build index for text embedding from local markdown files, and provide a simple query endpoint via fastapi. We provide a simple docker container using docker compose to build pgvector17 along with a simple python fastapi script +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. + + ## Run locally without docker - Install dependencies: From d8b364f02b3f346d0684199bd0f40ff314f111f8 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Wed, 21 May 2025 15:01:09 -0700 Subject: [PATCH 45/47] fix docker --- examples/fastapi_server_docker/.env | 4 ++-- examples/fastapi_server_docker/README.md | 16 +++++++++++++++- examples/fastapi_server_docker/compose.yaml | 4 +++- examples/fastapi_server_docker/dockerfile | 6 ++++++ examples/fastapi_server_docker/requirements.txt | 4 +++- 5 files changed, 29 insertions(+), 5 deletions(-) diff --git a/examples/fastapi_server_docker/.env b/examples/fastapi_server_docker/.env index f322f4e2d..f685c601b 100644 --- a/examples/fastapi_server_docker/.env +++ b/examples/fastapi_server_docker/.env @@ -1,5 +1,5 @@ # for docker -# COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@coco_db:5432/cocoindex +COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@coco_db:5436/cocoindex # For local testing -COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex +# COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex diff --git a/examples/fastapi_server_docker/README.md b/examples/fastapi_server_docker/README.md index d3764b2a3..a73834e42 100644 --- a/examples/fastapi_server_docker/README.md +++ b/examples/fastapi_server_docker/README.md @@ -7,6 +7,12 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c ## Run locally without docker +In the .env file, use local postgres url +``` +# For local testing +COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex +``` + - Install dependencies: ```bash @@ -38,5 +44,13 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c ## Run Docker + +In the .env file, use docker postgres url +``` +COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@coco_db:5436/cocoindex +``` + Build the docker container via: -```docker compose up``` +```bash +docker compose up +``` diff --git a/examples/fastapi_server_docker/compose.yaml b/examples/fastapi_server_docker/compose.yaml index 33efbe8d8..729a4505b 100644 --- a/examples/fastapi_server_docker/compose.yaml +++ b/examples/fastapi_server_docker/compose.yaml @@ -6,8 +6,10 @@ services: POSTGRES_USER: cocoindex POSTGRES_PASSWORD: cocoindex POSTGRES_DB: cocoindex + POSTGRES_PORT: 5436 ports: - - "5432:5432" + - "5436:5436" + command: postgres -p 5436 coco_api: build: diff --git a/examples/fastapi_server_docker/dockerfile b/examples/fastapi_server_docker/dockerfile index 70a041cb1..619d3af36 100644 --- a/examples/fastapi_server_docker/dockerfile +++ b/examples/fastapi_server_docker/dockerfile @@ -2,6 +2,12 @@ FROM python:3.11-slim WORKDIR /app +# Install PostgreSQL client libraries +RUN apt-get update && apt-get install -y \ + libpq-dev \ + gcc \ + && rm -rf /var/lib/apt/lists/* + COPY requirements.txt . RUN pip install -r requirements.txt diff --git a/examples/fastapi_server_docker/requirements.txt b/examples/fastapi_server_docker/requirements.txt index 18d7c2703..284cf2d65 100644 --- a/examples/fastapi_server_docker/requirements.txt +++ b/examples/fastapi_server_docker/requirements.txt @@ -2,4 +2,6 @@ cocoindex>=0.1.42 python-dotenv>=1.0.1 fastapi==0.115.12 fastapi-cli==0.0.7 -uvicorn==0.34.2 \ No newline at end of file +uvicorn==0.34.2 +psycopg==3.2.6 +psycopg_pool==3.2.6 \ No newline at end of file From a8e94ba404be751d2a6217b13821aadd78bbb8f3 Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Wed, 21 May 2025 15:03:45 -0700 Subject: [PATCH 46/47] Update README.md --- examples/fastapi_server_docker/README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/examples/fastapi_server_docker/README.md b/examples/fastapi_server_docker/README.md index a73834e42..8f0971aa1 100644 --- a/examples/fastapi_server_docker/README.md +++ b/examples/fastapi_server_docker/README.md @@ -52,5 +52,10 @@ COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@coco_db:5436/cocoindex Build the docker container via: ```bash -docker compose up +docker compose up --build +``` + +Test the endpoint: +```bash +curl "http://0.0.0.0:8080/search?q=model&limit=3" ``` From a208c8e38b7d0ecf494702ffbbc54fcb50c0aedd Mon Sep 17 00:00:00 2001 From: Linghua Jin Date: Wed, 21 May 2025 15:15:47 -0700 Subject: [PATCH 47/47] Update README.md --- examples/fastapi_server_docker/README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/examples/fastapi_server_docker/README.md b/examples/fastapi_server_docker/README.md index 8f0971aa1..6fe2d634f 100644 --- a/examples/fastapi_server_docker/README.md +++ b/examples/fastapi_server_docker/README.md @@ -7,7 +7,9 @@ We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/c ## Run locally without docker -In the .env file, use local postgres url + +In the `.env` file, use local Postgres URL + ``` # For local testing COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex @@ -35,6 +37,7 @@ COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex ```bash uvicorn main:fastapi_app --reload --host 0.0.0.0 --port 8000 + ``` ## Query the endpoint @@ -45,7 +48,8 @@ COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex ## Run Docker -In the .env file, use docker postgres url +In the `.env` file, use Docker Postgres URL + ``` COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@coco_db:5436/cocoindex ```