Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ It defines an index flow like this:
| [Image Search with Vision API](examples/image_search) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|
| [Face Recognition](examples/face_recognition) | Recognize faces in images and build embedding index |
| [Paper Metadata](examples/paper_metadata) | Index papers in PDF files, and build metadata tables for each paper |
| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |

More coming and stay tuned 👀!

Expand Down
2 changes: 2 additions & 0 deletions examples/custom_output_files/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Postgres database address for cocoindex
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
1 change: 1 addition & 0 deletions examples/custom_output_files/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
output_html/
53 changes: 53 additions & 0 deletions examples/custom_output_files/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Build text embedding and semantic search 🔍
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cocoindex-io/cocoindex/blob/main/examples/text_embedding/Text_Embedding.ipynb)
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)

In this example, we will build index flow to load data from a local directory, convert them to HTML, and save the data to another local directory powered by [CocoIndex Custom Targets](https://cocoindex.io/docs/custom_ops/custom_targets).

We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.

## Steps

### Indexing Flow

1. We ingest a list of local markdown files from the `data/` directory.
2. For each file, convert them to HTML using [markdown-it-py](https://markdown-it-py.readthedocs.io/).
3. We will save the HTML files to a local directory `output_html/`.

## Prerequisite

[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.

## Run

Install dependencies:

```bash
pip install -e .
```

Update the target:

```bash
cocoindex update --setup main.py
```

You can add new files to the `data/` directory, delete or update existing files.
Each time when you run the `update` command, cocoindex will only re-process the files that have changed, and keep the target in sync with the source.

You can also run `update` command in live mode, which will keep the target in sync with the source in real-time:

```bash
cocoindex update --setup -L main.py
```

## CocoInsight

I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:

```
cocoindex server -ci main.py
```

Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
21 changes: 21 additions & 0 deletions examples/custom_output_files/data/bizarre_animals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
In the spirit of Project Zeta’s innovative chaos, here’s a collection of absurdly true facts about the weirdest animals you’ve never heard of:

1. **Tardigrade (Water Bear)**: This microscopic beast can survive outer space, radiation, and being boiled alive. It once crashed a team meeting by stowing away in Bob’s coffee mug and demanding admin access to the server.

2. **Aye-Aye**: A Madagascar primate with a creepy long finger it uses to tap trees for grubs. It tried to “debug” our codebase by tapping the keyboard, resulting in 47 nested for-loops.

3. **Saiga Antelope**: This goofy-nosed critter looks like it’s auditioning for a sci-fi flick. Its sneezes are so powerful they once blew out the office Wi-Fi during a sprint review.

4. **Glaucus Atlanticus (Blue Dragon Sea Slug)**: This tiny ocean dragon steals venom from jellyfish and uses it like a borrowed superpower. It infiltrated our water cooler and left behind a sparkly, toxic trail.

5. **Pink Fairy Armadillo**: A palm-sized digger that looks like a cotton candy tank. It burrowed into the office carpet, mistaking it for a desert, and now we have a “no armadillos” policy.

6. **Dumbo Octopus**: A deep-sea octopus with ear-like fins, flapping around like it’s late for a Zoom call. It once rewired our projector to display memes of itself across the office.

7. **Jerboa**: A hopping desert rodent with kangaroo vibes. It stole the team’s snacks and leaped over three cubicles before anyone noticed, earning the codename "Snack Bandit."

8. **Mantis Shrimp**: This crustacean sees more colors than our graphic designer and punches harder than a failing CI pipeline. It shattered a monitor when we tried to pair-program with it.

9. **Okapi**: A zebra-giraffe hybrid that looks like a Photoshop error. It wandered into our sprint planning and suggested we pivot to a “forest-themed” microservices architecture.

10. **Blobfish**: The ocean’s saddest-looking blob, voted “Most Likely to Crash a Stand-Up” by the team. Its mere presence caused our morale bot to send 200 crying emojis.
19 changes: 19 additions & 0 deletions examples/custom_output_files/data/chunk_norris.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Chuck Norris Project Facts
Date: 2025-07-20
Author: Anonymous (because Chuck Norris knows who you are)

Here are some totally true facts about Chuck Norris's involvement in Project Omega:

1. Chuck Norris doesn't write code; he stares at the computer until it writes itself out of fear.
2. The project deadline was yesterday, but time rescheduled itself to accommodate Chuck Norris.
3. Chuck Norris's code never has bugs—just "features" that are too scared to misbehave.
4. When the database crashed, Chuck Norris roundhouse-kicked the server, and it apologized.
5. The team tried to use Agile, but Chuck Norris declared, "I am the only methodology you need."
6. Version control? Chuck Norris is the only version that matters.
7. The project scope expanded because Chuck Norris added "world domination" as a deliverable.
8. When the CI/CD pipeline failed, Chuck Norris rebuilt it with a single grunt.
9. The codebase is 100% documented because no one dares ask Chuck Norris, "What does this do?"
10. Chuck Norris doesn't deploy to production; production deploys to Chuck Norris.

Last updated: 2025-07-20 06:36 AM MST
Note: If you modify this file, Chuck Norris will know... and he’ll find you.
123 changes: 123 additions & 0 deletions examples/custom_output_files/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
from datetime import timedelta
import os
import dataclasses

import cocoindex
from markdown_it import MarkdownIt

_markdown_it = MarkdownIt("gfm-like")


class LocalFileTarget(cocoindex.op.TargetSpec):
"""Represents the custom target spec."""

# The directory to save the HTML files.
directory: str


@dataclasses.dataclass
class LocalFileTargetValues:
"""Represents value fields of exported data. Used in `mutate` method below."""

html: str


@cocoindex.op.target_connector(spec_cls=LocalFileTarget)
class LocalFileTargetConnector:
@staticmethod
def get_persistent_key(spec: LocalFileTarget, target_name: str) -> str:
"""Use the directory path as the persistent key for this target."""
return spec.directory

@staticmethod
def describe(key: str) -> str:
"""(Optional) Return a human-readable description of the target."""
return f"Local directory {key}"

@staticmethod
def apply_setup_change(
key: str, previous: LocalFileTarget | None, current: LocalFileTarget | None
) -> None:
"""
Apply setup changes to the target.

Best practice: keep all actions idempotent.
"""

# Create the directory if it didn't exist.
if previous is None and current is not None:
os.makedirs(current.directory, exist_ok=True)

# Delete the directory with its contents if it no longer exists.
if previous is not None and current is None:
if os.path.isdir(previous.directory):
for filename in os.listdir(previous.directory):
if filename.endswith(".html"):
os.remove(os.path.join(previous.directory, filename))
os.rmdir(previous.directory)

@staticmethod
def prepare(spec: LocalFileTarget) -> LocalFileTarget:
"""
(Optional) Prepare for execution. To run common operations before applying any mutations.
The returned value will be passed as the first element of tuples in `mutate` method.

If not provided, will directly pass the spec to `mutate` method.
"""
return spec

@staticmethod
def mutate(
*all_mutations: tuple[LocalFileTarget, dict[str, LocalFileTargetValues | None]],
) -> None:
"""
Mutate the target.

The first element of the tuple is the target spec.
The second element is a dictionary of mutations:
- The key is the filename, and the value is the mutation.
- If the value is `None`, the file will be removed.
Otherwise, the file will be written with the content.

Best practice: keep all actions idempotent.
"""
for spec, mutations in all_mutations:
for filename, mutation in mutations.items():
full_path = os.path.join(spec.directory, filename) + ".html"
if mutation is None:
try:
os.remove(full_path)
except FileNotFoundError:
pass
else:
with open(full_path, "w") as f:
f.write(mutation.html)


@cocoindex.op.function()
def markdown_to_html(text: str) -> str:
return _markdown_it.render(text)


@cocoindex.flow_def(name="CustomOutputFiles")
def custom_output_files(
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
) -> None:
"""
Define an example flow that exports markdown files to HTML files.
"""
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="data", included_patterns=["*.md"]),
refresh_interval=timedelta(seconds=5),
)

output_html = data_scope.add_collector()
with data_scope["documents"].row() as doc:
doc["html"] = doc["content"].transform(markdown_to_html)
output_html.collect(filename=doc["filename"], html=doc["html"])

output_html.export(
"OutputHtml",
LocalFileTarget(directory="output_html"),
primary_key_fields=["filename"],
)
9 changes: 9 additions & 0 deletions examples/custom_output_files/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[project]
name = "custom-output-files"
version = "0.1.0"
description = "Simple example for cocoindex: convert markdown files to HTML files and save them to a local directory."
requires-python = ">=3.11"
dependencies = ["cocoindex>=0.1.74", "markdown-it-py[linkify,plugins]"]

[tool.setuptools]
packages = []