Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/docs/contributing/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: How to contribute to CocoIndex

[CocoIndex](https://github.com/cocoindex-io/cocoindex) is an open source project. We are respectful, open and friendly. This guide explains how to get involved and contribute to [CocoIndex](https://github.com/cocoindex-io/cocoindex).

Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
If you are unsure about anything, it is a good place to discuss! We'd love to collaborate and will always be friendly.

## Good First Issues
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/contributing/setup_dev_environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ Follow the steps below to get CocoIndex built on the latest codebase locally - i
- Before running a specific example, set extra environment variables, for exposing extra traces, allowing dev UI, etc.
```sh
. ./.env.lib_debug
```
```
22 changes: 11 additions & 11 deletions docs/docs/examples/examples/academic_papers_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c

1. Extract the paper metadata, including file name, title, author information, abstract, and number of pages.

2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
This enables better metadata-driven semantic search results. For example, you can match text queries against titles and abstracts.

3. Build an index of authors and all the file names associated with each author
3. Build an index of authors and all the file names associated with each author
to answer questions like "Give me all the papers by Jeff Dean."

4. If you want to perform full PDF embedding for the paper, you can extend the flow.
Expand Down Expand Up @@ -108,7 +108,7 @@ After this step, we should have the basic info of each paper.

We will convert the first page to Markdown using Marker. Alternatively, you can easily plug in any PDF parser, such as Docling using CocoIndex's [custom function](https://cocoindex.io/docs/custom_ops/custom_functions).

Define a marker converter function and cache it, since its initialization is resource-intensive.
Define a marker converter function and cache it, since its initialization is resource-intensive.
This ensures that the same converter instance is reused for different input files.

```python
Expand Down Expand Up @@ -137,7 +137,7 @@ def pdf_to_markdown(content: bytes) -> str:
Pass it to your transform

```python
with data_scope["documents"].row() as doc:
with data_scope["documents"].row() as doc:
# ... process
doc["first_page_md"] = doc["basic_info"]["first_page"].transform(
pdf_to_markdown
Expand Down Expand Up @@ -200,7 +200,7 @@ paper_metadata.collect(
Just collect anything you need :)

### Collect `author` to `filename` information
We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
Simply collect by author.

```python
Expand Down Expand Up @@ -229,8 +229,8 @@ doc["title_embedding"] = doc["metadata"]["title"].transform(

### Abstract

Split abstract into chunks, embed each chunk and collect their embeddings.
Sometimes the abstract could be very long.
Split abstract into chunks, embed each chunk and collect their embeddings.
Sometimes the abstract could be very long.

```python
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(
Expand Down Expand Up @@ -308,7 +308,7 @@ author_papers.export(
"author_papers",
cocoindex.targets.Postgres(),
primary_key_fields=["author_name", "filename"],
)
)
metadata_embeddings.export(
"metadata_embeddings",
cocoindex.targets.Postgres(),
Expand All @@ -328,9 +328,9 @@ In this example we use PGVector as embedding store. With CocoIndex, you can do o

## Query the index

You can refer to this section of [Text Embeddings](https://cocoindex.io/blogs/text-embeddings-101#3-query-the-index) about
how to build query against embeddings.
For now CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage.
You can refer to this section of [Text Embeddings](https://cocoindex.io/blogs/text-embeddings-101#3-query-the-index) about
how to build query against embeddings.
For now CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage.

- Many databases already have optimized query implementations with their own best practices
- The query space has excellent solutions for querying, reranking, and other search-related functionality.
Expand Down
16 changes: 8 additions & 8 deletions docs/docs/examples/examples/codebase_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c
![Codebase Index](/img/examples/codebase_index/cover.png)

## Overview
In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.
In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.

## Use Cases
A wide range of applications can be built with an effective codebase index that is always up-to-date.
Expand All @@ -44,14 +44,14 @@ The flow is composed of the following steps:
- Generate embeddings for each chunk
- Store in a vector database for retrieval

## Setup
## Setup
- Install Postgres, follow [installation guide](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
- Install CocoIndex
```bash
pip install -U cocoindex
```

## Add the codebase as a source.
## Add the codebase as a source.
We will index the CocoIndex codebase. Here we use the `LocalFile` source to ingest files from the CocoIndex codebase root directory.

```python
Expand All @@ -67,7 +67,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
- Include files with the extensions of `.py`, `.rs`, `.toml`, `.md`, `.mdx`
- Exclude files and directories starting `.`, `target` in the root and `node_modules` under any directory.

`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
<DocumentationButton url="https://cocoindex.io/docs/ops/sources" text="Sources" />


Expand Down Expand Up @@ -96,14 +96,14 @@ with data_scope["files"].row() as file:
file["extension"] = file["filename"].transform(extract_extension)
file["chunks"] = file["content"].transform(
cocoindex.functions.SplitRecursively(),
language=file["extension"], chunk_size=1000, chunk_overlap=300)
language=file["extension"], chunk_size=1000, chunk_overlap=300)
```
<DocumentationButton url="https://cocoindex.io/docs/ops/functions#splitrecursively" text="SplitRecursively" margin="0 0 16px 0" />

![SplitRecursively](/img/examples/codebase_index/chunk.png)

### Embed the chunks
We use `SentenceTransformerEmbed` to embed the chunks.
We use `SentenceTransformerEmbed` to embed the chunks.

```python
@cocoindex.transform_flow()
Expand Down Expand Up @@ -141,7 +141,7 @@ code_embeddings.export(
vector_indexes=[cocoindex.VectorIndex("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
```

We use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to measure the similarity between the query and the indexed data.
We use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to measure the similarity between the query and the indexed data.

## Query the index
We match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
Expand Down Expand Up @@ -230,4 +230,4 @@ Follow the url from the terminal - `https://cocoindex.io/cocoinsight` to access

SplitRecursively has native support for all major programming languages.

<DocumentationButton url="https://cocoindex.io/docs/ops/functions#supported-languages" text="Supported Languages" margin="0 0 16px 0" />
<DocumentationButton url="https://cocoindex.io/docs/ops/functions#supported-languages" text="Supported Languages" margin="0 0 16px 0" />
20 changes: 9 additions & 11 deletions docs/docs/examples/examples/custom_targets.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
refresh_interval=timedelta(seconds=5),
)
```
This ingestion creates a table with `filename` and `content` fields.
This ingestion creates a table with `filename` and `content` fields.
<DocumentationButton url="https://cocoindex.io/docs/ops/sources" text="Sources" />

## Process each file and collect
Expand Down Expand Up @@ -92,7 +92,7 @@ class LocalFileTargetConnector:

```

The `describe()` method returns a human-readable string that describes the target, which is displayed in the CLI logs.
The `describe()` method returns a human-readable string that describes the target, which is displayed in the CLI logs.
For example, it prints:

`Target: Local directory ./data/output`
Expand All @@ -104,10 +104,10 @@ def describe(key: str) -> str:
return f"Local directory {key}"
```

`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments,
`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments,
and the method is expected to update the backend setup to match the current state.

A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it,
A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it,
and when `current` is `None`, we need to delete it.


Expand Down Expand Up @@ -135,8 +135,8 @@ def apply_setup_change(
os.rmdir(previous.directory)
```

The `mutate()` method is called by CocoIndex to apply data changes to the target,
batching mutations to potentially multiple targets of the same type.
The `mutate()` method is called by CocoIndex to apply data changes to the target,
batching mutations to potentially multiple targets of the same type.
This allows the target connector flexibility in implementation (e.g., atomic commits, or processing items with dependencies in a specific order).

Each element in the batch corresponds to a specific target and is represented by a tuple containing:
Expand All @@ -151,8 +151,8 @@ class LocalFileTargetValues:
html: str
```

The value type of the `dict` is `LocalFileTargetValues | None`,
where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`,
The value type of the `dict` is `LocalFileTargetValues | None`,
where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`,
idempotency is expected here.

```python
Expand Down Expand Up @@ -217,7 +217,5 @@ This keeps your knowledge graph continuously synchronized with your document sou
Sometimes there may be an internal/homegrown tool or API (e.g. within a company) that's not publicly available.
These can only be connected through custom targets.

### Faster adoption of new export logic
### Faster adoption of new export logic
When a new tool, database, or API joins your stack, simply define a Target Spec and Target Connector — start exporting right away, with no pipeline refactoring required.


21 changes: 10 additions & 11 deletions docs/docs/examples/examples/docs_to_knowledge_graph.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ and then build a knowledge graph.
- CocoIndex can direct map the collected data to Neo4j nodes and relationships.

## Setup
* [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
* [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
* [Install Neo4j](https://cocoindex.io/docs/ops/targets#neo4j-dev-instance), a graph database.
* [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, we have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.

Expand All @@ -51,7 +51,7 @@ and then build a knowledge graph.

### Add documents as source

We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)).
We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)).

```python
@cocoindex.flow_def(name="DocsToKG")
Expand Down Expand Up @@ -141,7 +141,7 @@ Next, we will use `cocoindex.functions.ExtractByLlm` to extract the relationship
doc["relationships"] = doc["content"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI,
api_type=cocoindex.LlmApiType.OPENAI,
model="gpt-4o"
),
output_type=list[Relationship],
Expand Down Expand Up @@ -187,7 +187,7 @@ with doc["relationships"].row() as relationship:


### Build knowledge graph

#### Basic concepts
All nodes for Neo4j need two things:
1. Label: The type of the node. E.g., `Document`, `Entity`.
Expand Down Expand Up @@ -236,10 +236,10 @@ This exports Neo4j nodes with label `Document` from the `document_node` collecto

#### Export `RELATIONSHIP` and `Entity` nodes to Neo4j

We don't have explicit collector for `Entity` nodes.
We don't have explicit collector for `Entity` nodes.
They are part of the `entity_relationship` collector and fields are collected during the relationship extraction.

To export them as Neo4j nodes, we need to first declare `Entity` nodes.
To export them as Neo4j nodes, we need to first declare `Entity` nodes.

```python
flow_builder.declare(
Expand Down Expand Up @@ -289,7 +289,7 @@ In a relationship, there's:
2. A relationship connecting the source and target.
Note that different relationships may share the same source and target nodes.

`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes.
`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes.

#### Export the `entity_mention` to Neo4j.

Expand Down Expand Up @@ -334,7 +334,7 @@ It creates relationships by:
```sh
cocoindex update --setup main.py
```

You'll see the index updates state in the terminal. For example,

```
Expand All @@ -343,7 +343,7 @@ It creates relationships by:

## CocoInsight

I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline. It is in free beta now, you can give it a try.
I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline. It is in free beta now, you can give it a try.

```sh
cocoindex server -ci main
Expand All @@ -369,7 +369,7 @@ MATCH p=()-->() RETURN p
## Kuzu
Cocoindex natively supports Kuzu - a high performant, embedded open source graph database.

<DocumentationButton url="https://cocoindex.io/docs/ops/targets#kuzu" text="Kuzu" margin="0 0 16px 0" />
<DocumentationButton url="https://cocoindex.io/docs/ops/targets#kuzu" text="Kuzu" margin="0 0 16px 0" />

The GraphDB interface in CocoIndex is standardized, you just need to **switch the configuration** without any additional code changes. CocoIndex supports exporting to Kuzu through its API server. You can bring up a Kuzu API server locally by running:

Expand All @@ -391,4 +391,3 @@ kuzu_conn_spec = cocoindex.add_auth_entry(
```

<GitHubButton url="https://github.com/cocoindex-io/cocoindex/blob/30761f8ab674903d742c8ab2e18d4c588df6d46f/examples/docs_to_knowledge_graph/main.py#L33-L37" margin="0 0 16px 0" />

2 changes: 1 addition & 1 deletion docs/docs/examples/examples/document_ai.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ CocoIndex is a flexible ETL framework with incremental processing. We don’t b

## Set up
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
- Configure Project and Processor ID for Document AI API
- Configure Project and Processor ID for Document AI API
- [Official Google document AI API](https://cloud.google.com/document-ai/docs/try-docai) with free live demo.
- Sign in to [Google Cloud Console](https://console.cloud.google.com/), create or open a project, and enable Document AI API.
- ![image.png](/img/examples/document_ai/document_ai.png)
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/examples/examples/image_search.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c
CocoIndex supports native integration with ColPali - with just a few lines of code, you embed and index images with ColPali’s late-interaction architecture. We also build a light weight image search application with FastAPI.


## ColPali
## ColPali

**ColPali (Contextual Late-interaction over Patches)** is a powerful model for multimodal retrieval.

Expand Down
3 changes: 1 addition & 2 deletions docs/docs/examples/examples/manual_extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
num_classes=len(module_info.classes),
num_methods=len(module_info.methods),
)
```
```

### Plug in the function into the flow
```python
Expand Down Expand Up @@ -249,4 +249,3 @@ SELECT filename, module_info->'title' AS title, module_summary FROM modules_info
cocoindex server -ci main
```
CocoInsight dashboard is here `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server with zero data retention.

Loading