Skip to content

gipplab/ipfs-archive-tracker

Repository files navigation

ipfs-archive-tracker

  • Tracks announcements of published archives.
  • Serves archive list to requesting clients.
  • CIDs are derived from archvies.
  • Indexes CIDs on IPFS by converting their payload to text via pdftotext (poppler-utils), and sending the text to an LLM for title, field, topic, niche, and 10 keywords.

Build

go build -o ipfs-archive-tracker .

Docker / Compose

A pre-built image is published to GHCR on every push to main and on version tags:

docker pull ghcr.io/gipplab/ipfs-archive-tracker:main

Or build locally:

docker build -t ipfs-archive-tracker .

Run container (persist data and expose only public API):

docker run -d --name ipfs-archive-tracker \
  -p 8385:8385 \
  -v "$(pwd)/tracker-data:/data" \
  -v "$(pwd)/.api_key:/data/.api_key:ro" \
  ghcr.io/gipplab/ipfs-archive-tracker:main \
  -o /data -public-port 8385 -port 8384

See docker-compose.yml for a full Compose example. If Kubo runs in another container/network, set -kubo to that service URL instead.

API key

Indexing needs an API key: .api_key in -o or cwd, or SAIA_API_KEY.

Usage

Two servers by default: internal (Web UI: 127.0.0.1:8384) and public (Archives API: 0.0.0.0:8385). Expose only the public port externally.

# Start both servers; CIDs are taken from archives.json (from announce / IPNS refresh):
./ipfs-archive-tracker

# CLI: index pending CIDs from archives and exit (no web UI):
./ipfs-archive-tracker -cli

# Custom ports:
./ipfs-archive-tracker -o ./index-data -port 9000 -public-port 9001

Configuration

All settings can be provided as CLI flags or environment variables. Flags take precedence.

Flag Env var Default Description
-o TRACKER_DATA_DIR . Output directory for index files
-gateway TRACKER_GATEWAY https://ipfs.io IPFS gateway base URL
-kubo KUBO_API http://localhost:5001 Kubo API URL for IPNS resolution
-workers TRACKER_WORKERS 4 Number of concurrent processing workers
-model TRACKER_MODEL meta-llama-3.1-8b-instruct LLM model for keyword extraction
-fallback-model TRACKER_FALLBACK_MODEL llama-3.3-70b-instruct Model to try if primary returns 429
-api-base TRACKER_API_BASE https://chat-ai.academiccloud.de/v1 OpenAI-compatible API base URL
-spacing TRACKER_SPACING 100ms Minimum delay between dispatching CIDs
-cli TRACKER_CLI false Index pending CIDs from archives and exit (no web UI)
-port TRACKER_PORT 8384 Web UI port (localhost only)
-public-port TRACKER_PUBLIC_PORT 8385 Public API port (archives only, bind all interfaces)
-refresh TRACKER_REFRESH 10m Interval to refresh IPNS for all archives (0 to disable)

With the default -api-base (Chat AI / Academic Cloud), rate limits from the API are 1000 req/min, 10000/hour, 50002/day. Current models and exact API IDs: see docs/chat-ai-api.md or GET https://chat-ai.academiccloud.de/v1/models (with your API key).

Memory

PDF→text runs in a subprocess (pdftotext from poppler-utils); only one conversion runs at a time so peak RAM stays bounded. Optionally set GOMEMLIMIT=8GiB.

Archives API

Method Path Description
POST /api/archives/announce Send a new archive. Response: { "status": "ok" }. Use GET /api/archives to fetch the list.
GET /api/archives Get the full list of archive IDs and CID counts. Response: same archives array.

Announce body: { "archive_id": "k51..." } (tracker resolves via IPNS and reads cids from the document), or { "archive_id": "...", "cids": ["Qm...", ...] }, or a rich cids array with objects. Gateway = -gateway. Example (public port):

curl -s -X POST http://localhost:8385/api/archives/announce \
  -H "Content-Type: application/json" \
  -d '{"archive_id":"k51qzi5uqu5dkq7ek83z2tb3muanwx7y59e5ixuk0mhume92aq98dnystqo5ih"}'

Expose only port 8385 (not 8384).

Output files

File Description
keyword_index.json Indexed metadata keyed by CID (only persisted file)

Example entry in keyword_index.json:

{
  "bafyrei...": {
    "cid": "bafyrei...",
    "title": "Attention Is All You Need",
    "broad_field": "Computer Science",
    "sub_topic": "Machine Learning",
    "research_niche": "Transformer Architectures for Sequence Modeling",
    "keywords": ["transformer", "attention mechanism", "..."],
    "indexed_at": "2026-03-04T14:30:00Z"
  }
}

About

Tracker to maintain a list of known archives and their contents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors