Skip to content

balyakin/ragimg

Repository files navigation

ragimg

CI Release License Go Report Card Container

Your RAG chatbot cannot answer questions about diagrams because it never indexed them.

ragimg is a small Go CLI that reads Markdown and HTML documentation, finds useful images, describes them once with a vision model, and writes ordinary text chunks for your existing embedding pipeline.

Index diagrams, screenshots, charts, and tables from your docs. Skip junk. Avoid duplicate API calls. Export JSONL.

ragimg terminal demo: dry-run, JSONL output, stats, and report generation

The demo above is generated from the checked-in fixture in testdata/demo-docs. Regenerate it with scripts/demo.tape, or use the fallback renderer when VHS is not installed:

go run scripts/render-demo-gif.go

Why ragimg exists

Most RAG setups treat documentation as text and quietly lose whatever is stored in images: architecture diagrams, database screenshots, charts, product states, sequence flows, or deployment dashboards. Query-time multimodal retrieval can help, but it usually means higher latency, higher cost, and a different serving stack.

ragimg moves that work to indexing time. Each useful image becomes a text record with source metadata, cache state, model details, and enough context to load into Pinecone, Qdrant, Chroma, LangChain, LlamaIndex, or anything else that already accepts documents.

What it does

Step What happens
Scan Reads local .md, .mdx, .html, and .htm files under --docs.
Filter Drops tiny files, missing images, unsupported formats, obvious logos, badges, icons, trackers, and unsafe symlinks.
Deduplicate Hashes image bytes or canonical remote URLs plus nearby context, provider, model, and detail.
Caption Sends only new work to OpenAI or a local Ollama server.
Export Writes stable JSONL by default, with JSON, CSV, and Markdown exports for review workflows.
Audit Builds a static HTML report with image previews and generated captions.

Install

Download a release binary from GitHub Releases, or install from source with Go 1.25 or newer:

go install github.com/balyakin/ragimg@latest

Docker works well in CI or on machines where you do not want to install Go:

docker run --rm \
  -v "$PWD:/work" \
  ghcr.io/balyakin/ragimg:latest \
  index --docs /work/docs --output /work/chunks.jsonl

Quickstart

Start with a dry run. It scans, filters, checks the cache, estimates the remaining work, and never asks for an API key:

ragimg index --docs ./docs --output chunks.jsonl --dry-run

Then run the actual caption job:

export OPENAI_API_KEY=...
ragimg index --docs ./docs --output chunks.jsonl

Review what came out:

ragimg stats chunks.jsonl
ragimg report --input chunks.jsonl --output ragimg-report.html

Or try the fixture without touching your own repository:

ragimg index --docs testdata/demo-docs --output chunks.jsonl --dry-run --verbose

Output

JSONL is the default because it is easy to stream, diff, upload, and inspect. Each line is one image caption chunk. The example below is expanded for readability:

{
  "id": "img_demo_architecture",
  "text": "OAuth2 architecture diagram showing a Browser PKCE client sending authorize requests through an API Gateway to an Auth Service. The Auth Service issues tokens, stores encrypted refresh tokens in Token Store, and sends login and consent events to Audit Log.",
  "metadata": {
    "chunk_type": "image_caption",
    "source_file": "README.md",
    "source_type": "markdown",
    "image_path": "images/architecture.svg",
    "original_path": "images/architecture.svg",
    "is_remote": false,
    "alt_text": "OAuth2 architecture",
    "title": "OAuth2 architecture",
    "section_heading": "OAuth2 Flow",
    "heading_path": ["OAuth2 Flow"],
    "provider": "openai",
    "model": "gpt-5.4-mini",
    "detail": "low",
    "cached": false,
    "indexed_at": "2026-06-05T08:00:00Z"
  }
}

The same scan can be exported in other formats:

ragimg index --docs ./docs --format json --output chunks.json
ragimg index --docs ./docs --format csv --output chunks.csv
ragimg index --docs ./docs --format md --output chunks.md

HTML report

The report is a single static HTML file. It has no external CSS, no external JavaScript, and it does not fetch remote images. Small local non-SVG previews are embedded by default; large files and SVGs are referenced from disk.

ragimg HTML report sample

ragimg report --input chunks.jsonl --output ragimg-report.html

Use --docs-root when the report is generated away from the original documentation tree:

ragimg report --input chunks.jsonl --docs-root ./docs --output ragimg-report.html

Cache and resume

ragimg is designed for reruns. The default cache file is .ragimg-cache.json; the default progress file is .ragimg-progress.json.

The cache key includes:

  • image bytes for local files, or a canonical URL for remote images
  • surrounding documentation context
  • provider, model, and detail
  • prompt-affecting metadata such as alt text and headings

That means unchanged image/context pairs are not sent to the provider again. If a run is interrupted, the progress file lets the next run reuse completed work before rebuilding the output.

Useful controls:

ragimg index --docs ./docs --max-images 25
ragimg index --docs ./docs --include "**/*.md" --exclude "**/assets/logo*"
ragimg index --docs ./docs --no-cache
ragimg index --docs ./docs --no-resume

Providers

OpenAI is the default provider:

export OPENAI_API_KEY=...
ragimg index --docs ./docs --provider openai --model gpt-5.4-mini

For a local path, run Ollama and pick a vision-capable model:

ragimg index --docs ./docs --provider ollama --model llava

OpenAI can receive local images and remote image URLs. Ollama in v0.1 supports local images only. In both cases, --dry-run is the safe way to see what would be processed before sending anything to a model.

GitHub Action

- uses: balyakin/ragimg@v0.1.0
  with:
    docs: ./docs
    output: chunks.jsonl
    provider: openai
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

A common CI pattern is to run a dry run on pull requests and reserve paid captioning for a scheduled job or a protected branch:

- uses: balyakin/ragimg@v0.1.0
  with:
    docs: ./docs
    output: chunks.jsonl
    dry-run: "true"

Loading chunks

Full examples live in examples. The important part is simple: embed text, keep metadata, and use id as the document key.

Pinecone:

import json
from pinecone import Pinecone

pc = Pinecone()
index = pc.Index("docs")

for line in open("chunks.jsonl", encoding="utf-8"):
    chunk = json.loads(line)
    index.upsert_records(
        "default",
        [{"_id": chunk["id"], "text": chunk["text"], **chunk["metadata"]}],
    )

LangChain:

import json
from langchain_core.documents import Document

docs = []
for line in open("chunks.jsonl", encoding="utf-8"):
    chunk = json.loads(line)
    docs.append(Document(page_content=chunk["text"], metadata=chunk["metadata"]))

LlamaIndex:

import json
from llama_index.core import Document

documents = []
for line in open("chunks.jsonl", encoding="utf-8"):
    chunk = json.loads(line)
    documents.append(Document(text=chunk["text"], metadata=chunk["metadata"]))

Configuration

ragimg reads .ragimg.yaml when present. RAGIMG_CONFIG or --config can point to another file. Paths inside the config file are resolved relative to that file, which makes checked-in configs easier to move between machines.

docs: ./docs
output: chunks.jsonl
format: jsonl
provider: openai
model: gpt-5.4-mini
detail: low
workers: 4
timeout: 30s
cache: .ragimg-cache.json
resume: .ragimg-progress.json
dedup: true
max_images: 0
include:
  - "**/*.md"
  - "**/*.mdx"
  - "**/*.html"
  - "**/*.htm"
exclude:
  - "**/node_modules/**"

Provider secrets are never read from config files. Use OPENAI_API_KEY or --api-key. RAGIMG_CACHE can override the cache path, and NO_COLOR, RAGIMG_NO_COLOR=1, or --no-color disable colored output.

Commands

Command Purpose
ragimg index Scan docs and write caption chunks.
ragimg preview --image path/to/image.png --dry-run Show the resolved prompt and image metadata for one image.
ragimg report --input chunks.jsonl Build the static review report.
ragimg stats chunks.jsonl Print totals, model usage, date range, and top source files.
ragimg completion bash Generate shell completion.
ragimg version Print build version, commit, and date.

What gets skipped

The default filters are intentionally conservative. ragimg skips missing local files, unsupported formats, Git LFS pointer files, very small or very large files, tiny dimensions, extreme aspect ratios, symlinks that leave the docs root, and filenames that look like logos, icons, badges, avatars, spacers, tracking pixels, or social sharing assets.

Supported image extensions are .png, .jpg, .jpeg, .gif, .webp, and .svg.

Supported document extensions are .md, .mdx, .html, and .htm.

Benchmarking

This README does not invent benchmark numbers. Generate current numbers from the checked-in fixture:

scripts/benchmark.sh

When publishing benchmark claims, include the command, date, commit, fixture or repository snapshot, and output summary.

Current limits

ragimg v0.1 is a local indexing tool. It does not crawl websites, parse PDFs, run OCR, upload directly to vector databases, or host a review UI. Those are good future features, but the first release keeps the contract narrow: scan local docs, caption useful images, write portable chunks, and make reruns cheap.

Development

go test ./...
go vet ./...

The repository includes CI for Go 1.25.x and 1.26.x, release binaries for Linux, macOS, and Windows, a Docker image, and a root action.yml for GitHub Actions.

License

Apache-2.0. See LICENSE.

Releases

No releases published

Packages

 
 
 

Contributors

Languages