diff --git a/README.md b/README.md index 69451a05..bdb18464 100644 --- a/README.md +++ b/README.md @@ -185,6 +185,7 @@ It defines an index flow like this: | [Image Search with Vision API](examples/image_search) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend| | [Face Recognition](examples/face_recognition) | Recognize faces in images and build embedding index | | [Paper Metadata](examples/paper_metadata) | Index papers in PDF files, and build metadata tables for each paper | +| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* | More coming and stay tuned šŸ‘€! diff --git a/examples/custom_output_files/.env b/examples/custom_output_files/.env new file mode 100644 index 00000000..335f3060 --- /dev/null +++ b/examples/custom_output_files/.env @@ -0,0 +1,2 @@ +# Postgres database address for cocoindex +COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex diff --git a/examples/custom_output_files/.gitignore b/examples/custom_output_files/.gitignore new file mode 100644 index 00000000..61e0e829 --- /dev/null +++ b/examples/custom_output_files/.gitignore @@ -0,0 +1 @@ +output_html/ diff --git a/examples/custom_output_files/README.md b/examples/custom_output_files/README.md new file mode 100644 index 00000000..7d1df94f --- /dev/null +++ b/examples/custom_output_files/README.md @@ -0,0 +1,53 @@ +# Build text embedding and semantic search šŸ” +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cocoindex-io/cocoindex/blob/main/examples/text_embedding/Text_Embedding.ipynb) +[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) + +In this example, we will build index flow to load data from a local directory, convert them to HTML, and save the data to another local directory powered by [CocoIndex Custom Targets](https://cocoindex.io/docs/custom_ops/custom_targets). + +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. + +## Steps + +### Indexing Flow + +1. We ingest a list of local markdown files from the `data/` directory. +2. For each file, convert them to HTML using [markdown-it-py](https://markdown-it-py.readthedocs.io/). +3. We will save the HTML files to a local directory `output_html/`. + +## Prerequisite + +[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. + +## Run + +Install dependencies: + +```bash +pip install -e . +``` + +Update the target: + +```bash +cocoindex update --setup main.py +``` + +You can add new files to the `data/` directory, delete or update existing files. +Each time when you run the `update` command, cocoindex will only re-process the files that have changed, and keep the target in sync with the source. + +You can also run `update` command in live mode, which will keep the target in sync with the source in real-time: + +```bash +cocoindex update --setup -L main.py +``` + +## CocoInsight + +I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. +It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight: + +``` +cocoindex server -ci main.py +``` + +Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). diff --git a/examples/custom_output_files/data/bizarre_animals.md b/examples/custom_output_files/data/bizarre_animals.md new file mode 100644 index 00000000..013e7a73 --- /dev/null +++ b/examples/custom_output_files/data/bizarre_animals.md @@ -0,0 +1,21 @@ +In the spirit of Project Zeta’s innovative chaos, here’s a collection of absurdly true facts about the weirdest animals you’ve never heard of: + +1. **Tardigrade (Water Bear)**: This microscopic beast can survive outer space, radiation, and being boiled alive. It once crashed a team meeting by stowing away in Bob’s coffee mug and demanding admin access to the server. + +2. **Aye-Aye**: A Madagascar primate with a creepy long finger it uses to tap trees for grubs. It tried to ā€œdebugā€ our codebase by tapping the keyboard, resulting in 47 nested for-loops. + +3. **Saiga Antelope**: This goofy-nosed critter looks like it’s auditioning for a sci-fi flick. Its sneezes are so powerful they once blew out the office Wi-Fi during a sprint review. + +4. **Glaucus Atlanticus (Blue Dragon Sea Slug)**: This tiny ocean dragon steals venom from jellyfish and uses it like a borrowed superpower. It infiltrated our water cooler and left behind a sparkly, toxic trail. + +5. **Pink Fairy Armadillo**: A palm-sized digger that looks like a cotton candy tank. It burrowed into the office carpet, mistaking it for a desert, and now we have a ā€œno armadillosā€ policy. + +6. **Dumbo Octopus**: A deep-sea octopus with ear-like fins, flapping around like it’s late for a Zoom call. It once rewired our projector to display memes of itself across the office. + +7. **Jerboa**: A hopping desert rodent with kangaroo vibes. It stole the team’s snacks and leaped over three cubicles before anyone noticed, earning the codename "Snack Bandit." + +8. **Mantis Shrimp**: This crustacean sees more colors than our graphic designer and punches harder than a failing CI pipeline. It shattered a monitor when we tried to pair-program with it. + +9. **Okapi**: A zebra-giraffe hybrid that looks like a Photoshop error. It wandered into our sprint planning and suggested we pivot to a ā€œforest-themedā€ microservices architecture. + +10. **Blobfish**: The ocean’s saddest-looking blob, voted ā€œMost Likely to Crash a Stand-Upā€ by the team. Its mere presence caused our morale bot to send 200 crying emojis. diff --git a/examples/custom_output_files/data/chunk_norris.md b/examples/custom_output_files/data/chunk_norris.md new file mode 100644 index 00000000..89952641 --- /dev/null +++ b/examples/custom_output_files/data/chunk_norris.md @@ -0,0 +1,19 @@ +# Chuck Norris Project Facts +Date: 2025-07-20 +Author: Anonymous (because Chuck Norris knows who you are) + +Here are some totally true facts about Chuck Norris's involvement in Project Omega: + +1. Chuck Norris doesn't write code; he stares at the computer until it writes itself out of fear. +2. The project deadline was yesterday, but time rescheduled itself to accommodate Chuck Norris. +3. Chuck Norris's code never has bugs—just "features" that are too scared to misbehave. +4. When the database crashed, Chuck Norris roundhouse-kicked the server, and it apologized. +5. The team tried to use Agile, but Chuck Norris declared, "I am the only methodology you need." +6. Version control? Chuck Norris is the only version that matters. +7. The project scope expanded because Chuck Norris added "world domination" as a deliverable. +8. When the CI/CD pipeline failed, Chuck Norris rebuilt it with a single grunt. +9. The codebase is 100% documented because no one dares ask Chuck Norris, "What does this do?" +10. Chuck Norris doesn't deploy to production; production deploys to Chuck Norris. + +Last updated: 2025-07-20 06:36 AM MST +Note: If you modify this file, Chuck Norris will know... and he’ll find you. diff --git a/examples/custom_output_files/main.py b/examples/custom_output_files/main.py new file mode 100644 index 00000000..5bbfa83d --- /dev/null +++ b/examples/custom_output_files/main.py @@ -0,0 +1,123 @@ +from datetime import timedelta +import os +import dataclasses + +import cocoindex +from markdown_it import MarkdownIt + +_markdown_it = MarkdownIt("gfm-like") + + +class LocalFileTarget(cocoindex.op.TargetSpec): + """Represents the custom target spec.""" + + # The directory to save the HTML files. + directory: str + + +@dataclasses.dataclass +class LocalFileTargetValues: + """Represents value fields of exported data. Used in `mutate` method below.""" + + html: str + + +@cocoindex.op.target_connector(spec_cls=LocalFileTarget) +class LocalFileTargetConnector: + @staticmethod + def get_persistent_key(spec: LocalFileTarget, target_name: str) -> str: + """Use the directory path as the persistent key for this target.""" + return spec.directory + + @staticmethod + def describe(key: str) -> str: + """(Optional) Return a human-readable description of the target.""" + return f"Local directory {key}" + + @staticmethod + def apply_setup_change( + key: str, previous: LocalFileTarget | None, current: LocalFileTarget | None + ) -> None: + """ + Apply setup changes to the target. + + Best practice: keep all actions idempotent. + """ + + # Create the directory if it didn't exist. + if previous is None and current is not None: + os.makedirs(current.directory, exist_ok=True) + + # Delete the directory with its contents if it no longer exists. + if previous is not None and current is None: + if os.path.isdir(previous.directory): + for filename in os.listdir(previous.directory): + if filename.endswith(".html"): + os.remove(os.path.join(previous.directory, filename)) + os.rmdir(previous.directory) + + @staticmethod + def prepare(spec: LocalFileTarget) -> LocalFileTarget: + """ + (Optional) Prepare for execution. To run common operations before applying any mutations. + The returned value will be passed as the first element of tuples in `mutate` method. + + If not provided, will directly pass the spec to `mutate` method. + """ + return spec + + @staticmethod + def mutate( + *all_mutations: tuple[LocalFileTarget, dict[str, LocalFileTargetValues | None]], + ) -> None: + """ + Mutate the target. + + The first element of the tuple is the target spec. + The second element is a dictionary of mutations: + - The key is the filename, and the value is the mutation. + - If the value is `None`, the file will be removed. + Otherwise, the file will be written with the content. + + Best practice: keep all actions idempotent. + """ + for spec, mutations in all_mutations: + for filename, mutation in mutations.items(): + full_path = os.path.join(spec.directory, filename) + ".html" + if mutation is None: + try: + os.remove(full_path) + except FileNotFoundError: + pass + else: + with open(full_path, "w") as f: + f.write(mutation.html) + + +@cocoindex.op.function() +def markdown_to_html(text: str) -> str: + return _markdown_it.render(text) + + +@cocoindex.flow_def(name="CustomOutputFiles") +def custom_output_files( + flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope +) -> None: + """ + Define an example flow that exports markdown files to HTML files. + """ + data_scope["documents"] = flow_builder.add_source( + cocoindex.sources.LocalFile(path="data", included_patterns=["*.md"]), + refresh_interval=timedelta(seconds=5), + ) + + output_html = data_scope.add_collector() + with data_scope["documents"].row() as doc: + doc["html"] = doc["content"].transform(markdown_to_html) + output_html.collect(filename=doc["filename"], html=doc["html"]) + + output_html.export( + "OutputHtml", + LocalFileTarget(directory="output_html"), + primary_key_fields=["filename"], + ) diff --git a/examples/custom_output_files/pyproject.toml b/examples/custom_output_files/pyproject.toml new file mode 100644 index 00000000..939389f4 --- /dev/null +++ b/examples/custom_output_files/pyproject.toml @@ -0,0 +1,9 @@ +[project] +name = "custom-output-files" +version = "0.1.0" +description = "Simple example for cocoindex: convert markdown files to HTML files and save them to a local directory." +requires-python = ">=3.11" +dependencies = ["cocoindex>=0.1.74", "markdown-it-py[linkify,plugins]"] + +[tool.setuptools] +packages = []