farchive

farchive (far archive) — a local, history-preserving archive for opaque bytes observed at named locators.

Farchive stores raw bytes once by SHA-256 digest, preserves each locator's observation history as contiguous spans, and optimizes physical storage with zstd compression, corpus-trained dictionaries, locator-local delta encoding, and content-defined chunking. One SQLite file, queryable with SQL — efficient corpus packing while keeping the archive directly queryable.

Why

Most tools make you choose between a cache, a blob store, a version-control system, and a web archive. Farchive is the boring local thing in the middle: you record what bytes you observed at a locator and when, read them back exactly, resolve the current state or the state at a past time, and keep repetitive corpora compact.

Preserve what was observed. If a locator goes A -> B -> A, that is three spans, not one collapsed record.
Store bytes once. Identical payloads deduplicate by digest.
Query it simply. Latest, as-of, history, freshness.
Keep it small. Repetitive corpora benefit from trained zstd dictionaries, delta encoding for similar versions, and chunk-level dedup for large blobs.
Keep it local and boring. One SQLite file, no server, no daemon.

I wanted a local tool that combined content-addressed dedup, locator history, and corpus-adaptive compression in one queryable file.

Use cases

Web scraping with change detection. Archive pages as you crawl. Query what changed between observations. Detect when a page reverted. Freshness checks avoid redundant fetches.

with Farchive("scrape.farchive") as fa:
    for url in urls:
        if not fa.has(url, max_age_hours=24):
            resp = httpx.get(url)
            fa.store(url, resp.content, storage_class="html")
    # Later: what changed?
    for span in fa.history("https://example.com/pricing"):
        print(f"{span.observed_from}  {span.digest[:12]}")

API response archival. Store every response from a REST or SOAP API. Dedup means identical responses cost nothing. Point-in-time queries let you reconstruct what you knew at any moment.

Legal/regulatory corpus management. Archive legislation, regulations, court decisions. Track amendments over time. Corpus-trained zstd dictionaries compress thousands of structurally similar XML documents at 5-10x ratios. Delta encoding captures small amendments efficiently. (This is the use case farchive was extracted from.)

ML dataset versioning. Store training data snapshots at locators like dataset://v3/train.jsonl. Content-addressed storage means identical data across versions is stored once. History shows the full lineage. Large datasets benefit from chunk-level dedup.

Configuration/infrastructure snapshots. Periodically archive config files, terraform state, DNS records. Spans show exactly when each change was first observed.

Install

Farchive is published on PyPI: pypi.org/project/farchive.

pip install farchive

Requires Python 3.11+ and zstandard>=0.21.

For content-defined chunking (large-blob dedup):

pip install farchive[chunking]

This adds pyfastcdc for FastCDC-based content-defined chunking. The archive works without it — chunking is an optional optimization.

Quick start

from farchive import Farchive

with Farchive("my_archive.farchive") as fa:
    # Store content at a locator
    fa.store("https://example.com/page", page_bytes, storage_class="html")

    # Retrieve latest content
    data = fa.get("https://example.com/page")

    # Track changes over time
    fa.store("https://example.com/page", new_page_bytes, storage_class="html")
    for span in fa.history("https://example.com/page"):
        print(f"{span.digest[:12]}  {span.observed_from}..{span.observed_until}")

Status

Near-term priorities were:

importer-facing drift detection via compare_current()
machine-readable history and provenance ergonomics
better event filtering and locator metadata workflows
richer batch import ergonomics without changing core archive semantics

Current status: these near-term items are now implemented.

Core concepts

Blob: Immutable raw bytes identified by SHA-256 digest. Stored once, deduped by content.
Locator: Opaque string naming where content was observed (URL, path, any string).
State span: A contiguous run where one locator resolved to one blob. If the same content returns after an interruption, that's a new span — history is preserved.
Event (optional): Append-only audit log of archive operations, including observations and maintenance.
Storage class: A freeform string label (e.g. "html", "xml", "pdf", "bin", "json", whatever you want) that guides compression strategy. There is no fixed set — any string is valid. Blobs in the same class share dictionaries and local candidate strategy. Common convention is to use MIME-like names, but the archive does not enforce or validate them.
Series key: An optional opaque lineage hint used only to widen delta candidate selection across locators in the same version family. It is optional, advisory, and has no read-time semantic meaning. For an open same-digest span, the current behavior is latest-non-null wins if a later confirmation provides a new non-null series_key.

API

Write

fa.put_blob(data, storage_class="xml")                      # store blob, return digest
fa.observe(locator, digest)                                 # record observation
fa.observe(locator, digest, observed_at=ts, metadata={"k": "v"}, series_key="series/doc-123")  # with time, metadata, and lineage hint
fa.store(locator, data)                                     # put_blob + observe (atomic)
fa.store(locator, data, observed_at=ts, storage_class="html", series_key="series/doc-123", metadata={"k": "v"})       # with time, class, lineage hint, and metadata
fa.store_batch([(loc, data), ...], progress=callback, series_key="series/doc-123")       # shared defaults for batch
fa.store_batch(
    [BatchItem(locator=loc, data=data, observed_at=ts, storage_class="html", series_key="series/doc-123", metadata={"k": "v"})],
    progress=callback,
)                                                           # per-item metadata/timestamps

observe(), store(), and store_batch() may use prior blobs from the same locator as delta candidates when beneficial. If series_key is provided, delta candidates can also come from other locators in the same lineage key. put_blob() has no locator context and skips delta encoding. store_batch() accepts both legacy tuples and BatchItem. Shared observed_at / storage_class / series_key defaults are used only when an item does not set its own value.

Series key contract:

optional, typed hint; advisory and additive only
one value per span; never changes archive semantics
additive and explicit: no profile object, no profile switching
does not affect reads, span identity, resolver results, or history semantics
implemented to widen delta candidate lookup only

Machine-readable span outputs:

in the API, StateSpan includes series_key when present
machine-readable CLI outputs (resolve --json, history --json, and ls spans --json) now include series_key when present

Read

fa.read(digest)                    # exact bytes by digest
fa.compare_current(locator, data=bytes)   # locator state vs candidate bytes (status: absent/same/changed)
fa.compare_current(locator, digest=digest)   # locator state vs candidate digest
fa.resolve(locator)                # current StateSpan
fa.resolve(locator, at=timestamp)  # point-in-time span
fa.get(locator)                    # convenience: resolve + read
fa.get(locator, at=timestamp)      # bytes at a point in time
fa.history(locator)                # all spans, newest first
fa.has(locator, max_age_hours=24)  # freshness check
fa.locators(pattern="https://%")   # list locators (LIKE pattern)
fa.events(locator)                 # audit log (if event history exists)
fa.events(locator, since=ts)       # events since timestamp

fa.compare_current() requires exactly one of data or digest.

Maintenance

fa.train_dict(storage_class="xml")          # train zstd dictionary, returns dict_id
fa.repack(storage_class="xml")                                  # recompress with trained dict, returns RepackStats
fa.repack(storage_class="xml", series_key="series/doc-123")       # recompress one lineage cohort
fa.rechunk(storage_class="bin")                                 # convert large blobs to chunked form, returns RechunkStats
fa.rechunk(storage_class="bin", series_key="series/doc-123")      # target one lineage cohort for chunking maintenance
fa.purge(["loc/a", "loc/b"])                # remove locators and unreachable blobs, returns PurgeStats
fa.stats()                                  # archive statistics, returns ArchiveStats
fa.close()                                  # close connection (automatic with context manager)

Data types

All types are importable from farchive:

StateSpan — one contiguous run of a locator resolving to one blob, including optional series_key
Event — one audit record (event_id, occurred_at, locator, digest, kind, metadata)
CompressionPolicy — configurable storage optimization knobs
ImportStats — results from store_batch()
BatchItem — input envelope for richer batch ingestion (series_key optional lineage hint)
LocatorHeadComparison — result of compare_current()
RepackStats — results from repack() (blobs_repacked, bytes_saved)
RechunkStats — results from rechunk() (blobs_rewritten, chunks_added, bytes_saved)
PurgeStats — results from purge() (locators_requested, locators_purged, spans_deleted, blobs_deleted, chunks_deleted, dry_run)
ArchiveStats — snapshot of archive state (locator_count, blob_count, span_count, dict_count, total_raw_bytes, total_stored_bytes, compression_ratio, codec_distribution, db_path, schema_version, chunk_count, db_file_bytes)

Constructor

Farchive(path, compression=CompressionPolicy(), enable_events=False)

Parameter	Type	Default	Notes
`path`	`str \| Path`	required	SQLite file path
`compression`	`CompressionPolicy`	defaults below	Policy knobs
`enable_events`	`bool`	`False`	Creates event table on first use

Compression

Farchive uses layered storage optimization. Phase 1 and Phase 2 are automatic write-path strategies. Phase 3 is an explicit maintenance transform.

Phase 1 — Inline compression (write path)

Raw — blobs under the raw threshold (default 64 bytes) are stored uncompressed
Vanilla zstd — standard compression
Dictionary zstd — corpus-trained dictionaries for configured storage classes

Storage classes are freeform strings — any value is valid ("html", "xml", "bin", "my-app/v2", whatever). The archive does not validate or enforce any convention. They are optimization buckets: dictionaries are trained per-class.

Phase 2 — Delta encoding (write path)

When storing a blob at a locator that has prior versions, farchive may encode it as a zstd_delta against a similar prior blob. This captures small changes (edits, patches, amendments) very efficiently.

Delta is depth-1 (delta bases are never themselves deltas), and only used when it beats the best inline frame by a configurable margin. Candidate selection includes locator-local history and the optional same-series_key lane for related streams. Delta candidates remain inline-only (raw, zstd, zstd_dict) — chunked blobs are excluded to maintain a clean separation between the delta path (small changes between similar inline blobs) and the chunking path (large-blob dedup via maintenance). Disabled by setting delta_enabled=False.

Phase 3 — Content-defined chunking (maintenance only)

Large blobs (default ≥ 1 MiB) can be split into content-defined chunks via FastCDC. Chunks are deduplicated archive-wide by their own SHA-256 digest. This is most effective when many large blobs share substantial regions — different versions of a dataset, VM images, etc.

Chunking is not applied automatically on write. Use rechunk() to rewrite eligible inline blobs into chunked form when beneficial. Requires the chunking extra (pip install farchive[chunking]).

All compression is transparent: read() and get() always return exact raw bytes regardless of physical representation.

Dictionary training is policy-driven. Defaults auto-train for xml (at 1000 blobs), html (at 500), and pdf (at 16). Other classes can use dictionaries trained manually via train_dict(). After training, new blobs use the dictionary immediately. Run repack() to recompress older blobs.

CompressionPolicy

All knobs are configurable at construction time:

from farchive import CompressionPolicy

policy = CompressionPolicy(
    # Phase 1: inline
    raw_threshold=64,
    compression_level=3,
    auto_train_thresholds={"xml": 1000, "html": 500, "pdf": 16},
    dict_target_sizes={"xml": 112*1024, "html": 112*1024, "pdf": 64*1024},

    # Phase 2: delta
    delta_enabled=True,
    delta_min_size=4*1024,
    delta_max_size=8*1024*1024,
    delta_candidate_count=4,
    delta_size_ratio_min=0.5,
    delta_size_ratio_max=2.0,
    delta_min_gain_ratio=0.95,
    delta_min_gain_bytes=128,

    # Phase 3: chunking
    chunk_enabled=True,
    chunk_min_blob_size=1*1024*1024,
    chunk_avg_size=256*1024,
    chunk_min_size=64*1024,
    chunk_max_size=1*1024*1024,
    chunk_min_gain_ratio=0.95,
    chunk_min_gain_bytes=4096,
)

rechunk()

Explicit maintenance operation that converts eligible inline blobs into chunked representation for cross-blob dedup. Not applied automatically on write.

stats = fa.rechunk()                                    # all eligible blobs
stats = fa.rechunk(storage_class="bin")                 # only one class
stats = fa.rechunk(series_key="series/doc-123")         # one lineage cohort only
stats = fa.rechunk(batch_size=50)                       # cap rewrites
stats = fa.rechunk(min_blob_size=2*1024*1024)           # override threshold

Parameter	Type	Default	Notes
`storage_class`	`str \| None`	`None`	Restrict candidates
`series_key`	`str \| None`	`None`	Restrict to one lineage cohort
`batch_size`	`int`	`100`	Max blobs rewritten per call
`min_blob_size`	`int \| None`	from policy	Minimum raw size

Returns RechunkStats(blobs_rewritten, chunks_added, bytes_saved). Preserves digests, raw bytes, spans, and query results.

CLI

farchive stats [db_path]
farchive history [db_path] <locator> [--json]
farchive locators [db_path] [--pattern PAT]
farchive find [db_path] <query> [--prefix]
farchive events [db_path] [--locator LOC] [--locator-prefix LOC] [--kind KIND] [--digest DIGEST] [--since TS] [--limit N]
farchive resolve [db_path] <locator> [--at TS] [--json]
farchive meta [db_path] <locator> [--at TS] [--json]
farchive inspect [db_path] <digest>
farchive train-dict [db_path] [--storage-class xml]
farchive repack [db_path] [--storage-class xml] [--series-key key] [--batch-size 1000]
farchive rechunk [db_path] [--storage-class bin] [--series-key key] [--batch-size 100] [--min-blob-size N]
farchive purge [db_path] <locator> [<locator> ...] [--dry-run] [--confirm] [--json]

inspect shows blob metadata including chunk references and unique stored size for chunked blobs. history --json, resolve --json, and ls spans --json return machine-readable span records including series_key when present. ls spans --series-key filters a relationship cohort to the same lineage for operator inspection. events --kind, events --digest, and events --locator-prefix provide locator-oriented event filters for API-like event queries. find searches locators by substring, or by prefix with --prefix. meta is a thin CLI alias for resolve.

Design

Single SQLite file, WAL mode
SHA-256 content identity
Positive-observation model (records what was seen, not what was absent)
Span-based history (A->B->A creates 3 spans, not 1 collapsed record)
Monotone observation time enforced per locator
Optional event audit log with public read API
Layered compression: inline zstd, trained dictionaries, locator-local deltas, chunk dedup
rechunk() for explicit large-blob chunking maintenance
Configurable CompressionPolicy (training is automatic, repack is explicit)
File-based write lock for multi-process safety (POSIX fcntl; no-lock fallback on Windows)
Not thread-safe (one instance per thread, enforced by SQLite)
No HTTP, no domain-specific logic — the caller brings bytes

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
scripts		scripts
src/farchive		src/farchive
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

farchive

Why

Use cases

Install

Quick start

Status

Core concepts

API

Write

Read

Maintenance

Data types

Constructor

Compression

Phase 1 — Inline compression (write path)

Phase 2 — Delta encoding (write path)

Phase 3 — Content-defined chunking (maintenance only)

CompressionPolicy

rechunk()

CLI

Design

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

farchive

Why

Use cases

Install

Quick start

Status

Core concepts

API

Write

Read

Maintenance

Data types

Constructor

Compression

Phase 1 — Inline compression (write path)

Phase 2 — Delta encoding (write path)

Phase 3 — Content-defined chunking (maintenance only)

CompressionPolicy

rechunk()

CLI

Design

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages