cocoindex-io · georgeh0 · Sep 21, 2025 · Sep 21, 2025
diff --git a/docs/docs/contributing/guide.md b/docs/docs/contributing/guide.md
@@ -5,7 +5,7 @@ description: How to contribute to CocoIndex
 
 [CocoIndex](https://github.com/cocoindex-io/cocoindex) is an open source project. We are respectful, open and friendly. This guide explains how to get involved and contribute to [CocoIndex](https://github.com/cocoindex-io/cocoindex).
 
-Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open. 
+Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
 If you are unsure about anything, it is a good place to discuss! We'd love to collaborate and will always be friendly.
 
 ## Good First Issues

diff --git a/docs/docs/contributing/setup_dev_environment.md b/docs/docs/contributing/setup_dev_environment.md
@@ -44,4 +44,4 @@ Follow the steps below to get CocoIndex built on the latest codebase locally - i
 -   Before running a specific example, set extra environment variables, for exposing extra traces, allowing dev UI, etc.
     ```sh
     . ./.env.lib_debug
-    ```
+    ```
diff --git a/docs/docs/examples/examples/academic_papers_index.md b/docs/docs/examples/examples/academic_papers_index.md
@@ -21,10 +21,10 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c
 
 1. Extract the paper metadata, including file name, title, author information, abstract, and number of pages.
 
-2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search. 
+2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
 This enables better metadata-driven semantic search results. For example, you can match text queries against titles and abstracts.
 
-3. Build an index of authors and all the file names associated with each author 
+3. Build an index of authors and all the file names associated with each author
 to answer questions like "Give me all the papers by Jeff Dean."
 
 4. If you want to perform full PDF embedding for the paper, you can extend the flow.
@@ -108,7 +108,7 @@ After this step, we should have the basic info of each paper.
 
 We will convert the first page to Markdown using Marker. Alternatively, you can easily plug in any PDF parser, such as Docling using CocoIndex's [custom function](https://cocoindex.io/docs/custom_ops/custom_functions).
 
-Define a marker converter function and cache it, since its initialization is resource-intensive. 
+Define a marker converter function and cache it, since its initialization is resource-intensive.
 This ensures that the same converter instance is reused for different input files.
 
 ```python
@@ -137,7 +137,7 @@ def pdf_to_markdown(content: bytes) -> str:
 Pass it to your transform
 
 ```python
-with data_scope["documents"].row() as doc:    
+with data_scope["documents"].row() as doc:
     # ... process
     doc["first_page_md"] = doc["basic_info"]["first_page"].transform(
             pdf_to_markdown
@@ -200,7 +200,7 @@ paper_metadata.collect(
 Just collect anything you need :)
 
 ### Collect `author` to `filename` information
-We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality. 
+We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
 Simply collect by author.
 
 ```python
@@ -229,8 +229,8 @@ doc["title_embedding"] = doc["metadata"]["title"].transform(
 
 ### Abstract
 
-Split abstract into chunks, embed each chunk and collect their embeddings. 
-Sometimes the abstract could be very long. 
+Split abstract into chunks, embed each chunk and collect their embeddings.
+Sometimes the abstract could be very long.
 
 ```python
 doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(
@@ -308,7 +308,7 @@ author_papers.export(
     "author_papers",
     cocoindex.targets.Postgres(),
     primary_key_fields=["author_name", "filename"],
-)    
+)
 metadata_embeddings.export(
     "metadata_embeddings",
     cocoindex.targets.Postgres(),
@@ -328,9 +328,9 @@ In this example we use PGVector as embedding store. With CocoIndex, you can do o
 
 ## Query the index
 
-You can refer to this section of [Text Embeddings](https://cocoindex.io/blogs/text-embeddings-101#3-query-the-index) about 
-how to build query against embeddings. 
-For now CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage. 
+You can refer to this section of [Text Embeddings](https://cocoindex.io/blogs/text-embeddings-101#3-query-the-index) about
+how to build query against embeddings.
+For now CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage.
 
 - Many databases already have optimized query implementations with their own best practices
 - The query space has excellent solutions for querying, reranking, and other search-related functionality.

diff --git a/docs/docs/examples/examples/codebase_index.md b/docs/docs/examples/examples/codebase_index.md
@@ -19,7 +19,7 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c
 ![Codebase Index](/img/examples/codebase_index/cover.png)
 
 ## Overview
-In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed. 
+In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.
 
 ## Use Cases
 A wide range of applications can be built with an effective codebase index that is always up-to-date.
@@ -44,14 +44,14 @@ The flow is composed of the following steps:
 - Generate embeddings for each chunk
 - Store in a vector database for retrieval
 
-## Setup 
+## Setup
 - Install Postgres, follow [installation guide](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
 - Install CocoIndex
   ```bash
   pip install -U cocoindex
   ```
 
-## Add the codebase as a source. 
+## Add the codebase as a source.
 We will index the CocoIndex codebase. Here we use the `LocalFile` source to ingest files from the CocoIndex codebase root directory.
 
 ```python
@@ -67,7 +67,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
 - Include files with the extensions of `.py`, `.rs`, `.toml`, `.md`, `.mdx`
 - Exclude files and directories starting `.`,  `target` in the root and `node_modules` under any directory.
 
-`flow_builder.add_source` will create a table with sub fields (`filename`, `content`). 
+`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
 <DocumentationButton url="https://cocoindex.io/docs/ops/sources" text="Sources" />
 
 
@@ -96,14 +96,14 @@ with data_scope["files"].row() as file:
     file["extension"] = file["filename"].transform(extract_extension)
     file["chunks"] = file["content"].transform(
           cocoindex.functions.SplitRecursively(),
-          language=file["extension"], chunk_size=1000, chunk_overlap=300) 
+          language=file["extension"], chunk_size=1000, chunk_overlap=300)
 ```
 <DocumentationButton url="https://cocoindex.io/docs/ops/functions#splitrecursively" text="SplitRecursively" margin="0 0 16px 0" />
 
 ![SplitRecursively](/img/examples/codebase_index/chunk.png)
 
 ### Embed the chunks
-We use `SentenceTransformerEmbed` to embed the chunks. 
+We use `SentenceTransformerEmbed` to embed the chunks.
 
 ```python
 @cocoindex.transform_flow()
@@ -141,7 +141,7 @@ code_embeddings.export(
     vector_indexes=[cocoindex.VectorIndex("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
 ```
 
-We use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to measure the similarity between the query and the indexed data. 
+We use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to measure the similarity between the query and the indexed data.
 
 ## Query the index
 We match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
@@ -230,4 +230,4 @@ Follow the url from the terminal - `https://cocoindex.io/cocoinsight` to access
 
 SplitRecursively has native support for all major programming languages.
 
-<DocumentationButton url="https://cocoindex.io/docs/ops/functions#supported-languages" text="Supported Languages" margin="0 0 16px 0" />
+<DocumentationButton url="https://cocoindex.io/docs/ops/functions#supported-languages" text="Supported Languages" margin="0 0 16px 0" />
diff --git a/docs/docs/examples/examples/custom_targets.md b/docs/docs/examples/examples/custom_targets.md
@@ -35,7 +35,7 @@ flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
 		refresh_interval=timedelta(seconds=5),
 	)
 ```
-This ingestion creates a table with `filename` and `content` fields. 
+This ingestion creates a table with `filename` and `content` fields.
 <DocumentationButton url="https://cocoindex.io/docs/ops/sources" text="Sources" />
 
 ## Process each file and collect
@@ -92,7 +92,7 @@ class LocalFileTargetConnector:
 
 ```
 
-The `describe()` method returns a human-readable string that describes the target, which is displayed in the CLI logs. 
+The `describe()` method returns a human-readable string that describes the target, which is displayed in the CLI logs.
 For example, it prints:
 
 `Target: Local directory ./data/output`
@@ -104,10 +104,10 @@ def describe(key: str) -> str:
     return f"Local directory {key}"
 ```
 
-`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments, 
+`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments,
 and the method is expected to update the backend setup to match the current state.
 
-A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it, 
+A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it,
 and when `current` is `None`, we need to delete it.
 
 
@@ -135,8 +135,8 @@ def apply_setup_change(
             os.rmdir(previous.directory)
 ```
 
-The `mutate()` method is called by CocoIndex to apply data changes to the target, 
-batching mutations to potentially multiple targets of the same type. 
+The `mutate()` method is called by CocoIndex to apply data changes to the target,
+batching mutations to potentially multiple targets of the same type.
 This allows the target connector flexibility in implementation (e.g., atomic commits, or processing items with dependencies in a specific order).
 
 Each element in the batch corresponds to a specific target and is represented by a tuple containing:
@@ -151,8 +151,8 @@ class LocalFileTargetValues:
     html: str
 ```
 
-The value type of the `dict` is `LocalFileTargetValues | None`, 
-where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`, 
+The value type of the `dict` is `LocalFileTargetValues | None`,
+where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`,
 idempotency is expected here.
 
 ```python
@@ -217,7 +217,5 @@ This keeps your knowledge graph continuously synchronized with your document sou
 Sometimes there may be an internal/homegrown tool or API (e.g. within a company) that's not publicly available.
 These can only be connected through custom targets.
 
-### Faster adoption of new export logic 
+### Faster adoption of new export logic
 When a new tool, database, or API joins your stack, simply define a Target Spec and Target Connector — start exporting right away, with no pipeline refactoring required.
-
-
diff --git a/docs/docs/examples/examples/docs_to_knowledge_graph.md b/docs/docs/examples/examples/docs_to_knowledge_graph.md
@@ -36,7 +36,7 @@ and then build a knowledge graph.
 - CocoIndex can direct map the collected data to Neo4j nodes and relationships.
 
 ## Setup
-*   [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing. 
+*   [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
 *   [Install Neo4j](https://cocoindex.io/docs/ops/targets#neo4j-dev-instance), a graph database.
 *   [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai).  Alternatively, we have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.
 
@@ -51,7 +51,7 @@ and then build a knowledge graph.
 
 ### Add documents as source
 
-We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)). 
+We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)).
 
 ```python
 @cocoindex.flow_def(name="DocsToKG")
@@ -141,7 +141,7 @@ Next, we will use `cocoindex.functions.ExtractByLlm` to extract the relationship
 doc["relationships"] = doc["content"].transform(
     cocoindex.functions.ExtractByLlm(
         llm_spec=cocoindex.LlmSpec(
-            api_type=cocoindex.LlmApiType.OPENAI, 
+            api_type=cocoindex.LlmApiType.OPENAI,
             model="gpt-4o"
         ),
         output_type=list[Relationship],
@@ -187,7 +187,7 @@ with doc["relationships"].row() as relationship:
 
 
 ### Build knowledge graph
- 
+
 #### Basic concepts
 All nodes for Neo4j need two things:
 1. Label: The type of the node. E.g., `Document`, `Entity`.
@@ -236,10 +236,10 @@ This exports Neo4j nodes with label `Document` from the `document_node` collecto
 
 #### Export `RELATIONSHIP` and `Entity` nodes to Neo4j
 
-We don't have explicit collector for `Entity` nodes. 
+We don't have explicit collector for `Entity` nodes.
 They are part of the `entity_relationship` collector and fields are collected during the relationship extraction.
 
-To export them as Neo4j nodes, we need to first declare `Entity` nodes. 
+To export them as Neo4j nodes, we need to first declare `Entity` nodes.
 
 ```python
 flow_builder.declare(
@@ -289,7 +289,7 @@ In a relationship, there's:
 2.  A relationship connecting the source and target.
 Note that different relationships may share the same source and target nodes.
 
-`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes. 
+`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes.
 
 #### Export the `entity_mention` to Neo4j.
 
@@ -334,7 +334,7 @@ It creates relationships by:
     ```sh
     cocoindex update --setup main.py
     ```
-    
+
     You'll see the index updates state in the terminal. For example,
 
     ```
@@ -343,7 +343,7 @@ It creates relationships by:
 
 ## CocoInsight
 
-I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline.  It is in free beta now, you can give it a try. 
+I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline.  It is in free beta now, you can give it a try.
 
 ```sh
 cocoindex server -ci main
@@ -369,7 +369,7 @@ MATCH p=()-->() RETURN p
 ## Kuzu
 Cocoindex natively supports Kuzu - a high performant, embedded open source graph database.
 
-<DocumentationButton url="https://cocoindex.io/docs/ops/targets#kuzu" text="Kuzu" margin="0 0 16px 0" /> 
+<DocumentationButton url="https://cocoindex.io/docs/ops/targets#kuzu" text="Kuzu" margin="0 0 16px 0" />
 
 The GraphDB interface in CocoIndex is standardized, you just need to **switch the configuration** without any additional code changes. CocoIndex supports exporting to Kuzu through its API server. You can bring up a Kuzu API server locally by running:
 
@@ -391,4 +391,3 @@ kuzu_conn_spec = cocoindex.add_auth_entry(
 ```
 
 <GitHubButton url="https://github.com/cocoindex-io/cocoindex/blob/30761f8ab674903d742c8ab2e18d4c588df6d46f/examples/docs_to_knowledge_graph/main.py#L33-L37"  margin="0 0 16px 0" />
-
diff --git a/docs/docs/examples/examples/document_ai.md b/docs/docs/examples/examples/document_ai.md
@@ -21,7 +21,7 @@ CocoIndex is a flexible ETL framework with incremental processing.  We don’t b
 
 ## Set up
 - [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
-- Configure Project and Processor ID for Document AI API 
+- Configure Project and Processor ID for Document AI API
     - [Official Google document AI API](https://cloud.google.com/document-ai/docs/try-docai) with free live demo.
     - Sign in to [Google Cloud Console](https://console.cloud.google.com/), create or open a project, and enable Document AI API.
       - ![image.png](/img/examples/document_ai/document_ai.png)

diff --git a/docs/docs/examples/examples/image_search.md b/docs/docs/examples/examples/image_search.md
@@ -21,7 +21,7 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c
 CocoIndex supports native integration with ColPali - with just a few lines of code, you embed and index images with ColPali’s late-interaction architecture. We also build a light weight image search application with FastAPI.
 
 
-## ColPali 
+## ColPali
 
 **ColPali (Contextual Late-interaction over Patches)** is a powerful model for multimodal retrieval.
 

diff --git a/docs/docs/examples/examples/manual_extraction.md b/docs/docs/examples/examples/manual_extraction.md
@@ -188,7 +188,7 @@ def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
         num_classes=len(module_info.classes),
         num_methods=len(module_info.methods),
     )
-``` 
+```
 
 ### Plug in the function into the flow
 ```python
@@ -249,4 +249,3 @@ SELECT filename, module_info->'title' AS title, module_summary FROM modules_info
 cocoindex server -ci main
 ```
 CocoInsight dashboard is here `https://cocoindex.io/cocoinsight`.  It connects to your local CocoIndex server with zero data retention.
-
-Original file line number
+Diff line change
@@ Expand Up @@
     -   Before running a specific example, set extra environment variables, for exposing extra traces, allowing dev UI, etc.
         ```sh
         . ./.env.lib_debug
-        ```
+        ```