Skip to content

ZenAlexa/toki-bitemporal-memory

Repository files navigation

TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

Reference implementation and reproducibility artifact for the TOKI paper.

Status: preprint, not peer-reviewed Target: PVLDB Vol. 20 / VLDB 2027 License: MIT Python Reproducible

Status

This repository accompanies a preprint being prepared for submission to PVLDB Vol. 20 (VLDB 2027). The work has not yet been peer-reviewed or accepted — the theorems, numbers, and claims here may change before publication. An archival DOI (Zenodo) is minted on acceptance.

TOKI overview: a contradicting write on a key (s,p) passes through an isolation-precondition gate that routes it to one of four typed bitemporal operators (LWW/RC, Evidence/SI, Await/RC+cb, Per-Rule/SR); the operators commit a dual-row live+audit record, which a three-axis soundness contract (isolation, schema, provenance) certifies to exclude the write-time anomalies N1 (replay inconsistency), N2 (belief-drift skew), and N3 (audit erasure).


Paper

The compiled paper and its extended appendix ship with this artifact under paper/:

The supplement is where the load-bearing detail lives: every theorem cited in the main text is proved in full there, alongside the experiment protocols, soundness contracts, and the keyed-logging tightness argument.


TL;DR

Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and the system must decide what to trust when a new claim contradicts a stored one. Production systems answer with four resolution heuristics, yet none declares the isolation level it assumes or the write-time anomalies it admits.

TOKI's thesis: contradiction resolution is write-time concurrency control. TOKI types the four production heuristics as one family of bitemporal operators over a dual-row schema, each carrying an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across three axes (isolation, schema, provenance) and lift to operator pipelines; a tightness companion proves that keyed logging of the adjudicating judge is necessary for replay consistency, a discipline every audited baseline omits.

This repository is the reference implementation and experiment harness behind every theorem, table, and figure in the paper.


Key contributions

  • A typed operator algebra. The four production contradiction-resolution heuristics become one isolation-indexed family of bitemporal operators over a dual-row (live + audit) schema, each carrying an explicit isolation precondition and a K-semiring provenance annotation.
  • A necessity theorem and soundness on three axes. Keyed logging of the adjudicating judge is proven necessary for replay consistency; four soundness theorems close the contract across isolation, schema, and provenance, and lift to operator pipelines.
  • An eight-system verdict matrix with running-code corroboration. Every audited baseline that keeps a language-model judge on the write path admits at least one of three write-time anomalies; TOKI is the only design that excludes all three while keeping the judge on the write path.
  • A reproducible harness. Pre-computed evidence in results/, re-runnable end to end, with a machine-checkable artefact/manifest.json that links each paper claim to its evidence file.

The four production heuristics as isolation-typed operators

This correspondence is the load-bearing claim (paper §1, Table tab:correspondence). Each deployed strategy is the operational mirror of one classical Berenson–Adya multiversion anomaly.

Production strategy Operator Isolation precondition Admitted anomaly
Last-writer-wins LWW Read-committed P4 lost update
Evidence-weighted merge Evidence Snapshot isolation A5B write skew
Await-confirmation AwaitConfirm RC + callback callback boundary
Per-rule policy PerRule Serializable on policy table P3 phantom

Headline result: the verdict matrix over eight systems

The paper audits eight systems (six production agent-memory baselines, one engine-layer comparator, and TOKI) against three write-time anomalies:

  • N1 — replay inconsistency: re-invoking the language-model judge on a committed verdict can flip it.
  • N2 — belief-drift skew: concurrent per-partition updates drift the belief state under snapshot isolation.
  • N3 — audit erasure: the losing fact is overwritten with no recoverable audit trail.

The table below shows the design verdict (the verdict each system's published contradiction logic implies). The paper's full matrix (tab:anomaly-bench, §5) additionally reports the running-code verdict observed from each system's shipped code, transcribed directly from results/anomaly_bench/. Legend: A = admits the anomaly, X = excludes it, = predicate not applicable.

System N1 replay-inconsistency N2 belief-drift skew N3 audit erasure LM judge on write path?
mem0 (v2) A yes
mem0 (v3) A A A yes
Graphiti A A yes
Letta A A yes
Zep A A yes
MIRIX A A yes
WorldDB (engine-layer comparator) X X X no (judge removed)
TOKI (this work) X X X yes

Every baseline that keeps a language-model judge on the write path admits at least one anomaly. The content-addressed engine-layer comparator avoids all three only by removing the judge. TOKI alone excludes all three while keeping the judge on the write path. The MIRIX row is transcribed from the independent MMA-Bench evaluation, which probes N1/N2 but not the audit path, so its N3 cell abstains.

What the experiments measure

Contradiction-resolution latency (p50/p99) stays flat as the store grows from 0 to 10^5 facts. Three structural anchors: every predicted 0/1 anomaly boundary matches the measured outcome.

  • Left: contradiction-resolution latency stays flat as the memory store grows to 10^5 facts (DuckDB backend, evidence-weighted path).
  • Right: three structural anchors (isolation lattice, composition length 1–5, Welch-t equivalence) where every predicted 0/1 boundary matches the measured result.

The audit-row defence recovers a constructed mechanism-stress slice by 0.86 (LoCoMo). End-to-end retrieval shows no significant difference from the baselines: the paper states a write-time correctness contract and makes no utility-superiority claim.


Repository layout

Path Contents
implementation/bitemporal/ Core package (import bitemporal): the four operators, dual-row schema, K-semiring provenance lattice, as_of time-travel, audit log.
implementation/adapters/ Adapter shims for the baseline agent-memory systems.
experiments/anomaly_bench/ AnomalyClaim: the N1/N2/N3 structural verdict matrix.
experiments/anomaly_wire/ AnomalyWire: live-adapter, cross-layer corroboration of the verdicts.
experiments/g2_utility/ Paired end-to-end accuracy (LoCoMo, LongMemEval-S, MultiTQ).
experiments/g3_systems_perf/ DuckDB backend scaling and tail-latency benchmarks.
experiments/g4_ablation/ Operator-family and K-semiring ablations.
experiments/benchmark_{a,b,f}/ Third-party benchmarks (cross-system QA, MMA-Bench, GroupMemBench / STALE-400).
experiments/n1_empirical/, n2_partition/ Empirical lower-bound and partition-isolation sweeps.
results/ Pre-computed CSVs and manifests, one subdirectory per experiment. These are the exact evidence behind the paper's tables and figures.
figures/ The paper's compiled figures (PNG), embedded above; the plot scripts regenerate the underlying plots here from results/.
scripts/ Figure generation (plot_*.py), data aggregation (aggregate_*.py, reconcile_*.py, backfill_*.py), and statistics helpers (derive_g3_5axis_stats.py, build_holm_family.py, build_benchmark_status.py, refresh_manifest_sha.py).
tests/ pytest suites mirroring implementation/ and each experiment.
artefact/ Submission package: REPRODUCE.md reviewer runbook, machine-checkable manifest.json, Dockerfile, dataset manifests + checksums.

Requirements

  • Python 3.11+
  • uv (recommended) or pip
  • Docker 24+ (only for the containerized smoke run)
  • An LLM API key for the live experiments (OPENROUTER_API_KEY, or OPENAI_API_KEY / VLDB2027_JUDGE_API_KEY); see artefact/REPRODUCE.md. Calls default to a public OpenAI-compatible endpoint (OpenRouter); override with the VLDB2027_LLM_BASE_URL env var. The structural verdicts, ablations, and systems-performance benchmarks run without any API key.

Installation

git clone https://github.com/ZenAlexa/toki-bitemporal-memory && cd toki-bitemporal-memory

# Recommended: uv
uv sync --extra test --extra experiments
uv pip install -e .

# Or: pip
pip install -e ".[test,experiments]"

The editable install (-e .) is required before running the test suite or any python -m experiments.* runner, because it registers both the bitemporal package and the experiments package on the path. Verify it:

python -c "from bitemporal import Schema, LWW, Evidence, AwaitConfirm, PerRule; print('OK')"

Datasets

The three benchmark datasets are not redistributed here. Each artefact/datasets/<name>/ directory carries a README.md with download instructions, a LICENSE, and a SHA256SUMS file for integrity verification after you download the raw data.

Dataset Used by Download instructions
LoCoMo g2_utility (mechanism-stress + cross-system ledger) artefact/datasets/locomo/README.md
LongMemEval-S g2_utility artefact/datasets/longmemeval_s/README.md
MultiTQ g2_utility artefact/datasets/multitq/README.md

The structural verdict matrix, systems-performance, and ablation results do not require these datasets; they are needed only for the end-to-end utility experiments.


Reproducing the experiments

This artifact reproduces the experiments behind the paper's theorems, the data behind its tables, and its figures, not the paper document itself. The pre-computed CSVs in results/ already back every table and figure. To regenerate from scratch:

Quick smoke check (no API key)

make smoke          # core operator + AnomalyClaim + AnomalyWire tests, < 2 min
# or, fully containerized:
docker build -t toki-artefact:latest -f artefact/Dockerfile .
docker run --rm toki-artefact:latest make smoke

Claim → command → output

Claim / table / figure Command Output
Verdict matrix, N1/N2/N3 python -m experiments.anomaly_bench.k_sweep_runner --output results/anomaly_bench/ results/anomaly_bench/ (committed verdicts n[1,2,3]_*.csv; K-sweep under k_sweep/)
AnomalyWire cross-layer corroboration python -m experiments.anomaly_wire.iso_level_sweep_runner results/anomaly_wire/
G2 utility (LoCoMo / LongMemEval-S / MultiTQ) python -m experiments.g2_utility.runner --output-dir results/g2_utility/ results/g2_utility/
G3 systems performance (scaling / latency) python -m experiments.g3_systems_perf.runner --output-dir results/g3_systems_perf/ results/g3_systems_perf/
G4 operator + K-semiring ablation python -m experiments.g4_ablation.k_semiring_counterfactual_runner --output results/g4_ablation/ results/g4_ablation/
N1 empirical lower bound python -m experiments.n1_empirical.runner --output results/n1_empirical/run_v1/ results/n1_empirical/run_v1/
Benchmark-A cross-system QA python -m experiments.benchmark_a.cross_system_qa results/benchmark_a/
Benchmark-B MMA-Bench python -m experiments.anomaly_bench.mma_bench_adapter results/benchmark_b/
Benchmark-F GroupMemBench / STALE-400 python -m experiments.benchmark_f.stale_runner results/benchmark_f/
All infra-free tests pytest tests/ -m "not docker and not live" terminal report (the docker / live tests additionally need the pinned upstream clones and an LLM relay)

Figure generation

Each script reads the corresponding CSVs from results/ and writes its plot into figures/:

python scripts/plot_anomaly_bench.py     # -> figures/fig-anomaly-bench.pdf
python scripts/plot_systems_perf.py      # -> figures/fig-systems-perf.pdf
python scripts/plot_iso_pareto.py        # -> figures/fig7-iso-pareto.pdf

The repository carries additional plot_*.py renderers (operator family, forest plot, anchors, trajectory replay, scaling composites); each reads its CSVs from results/ and writes a PDF into figures/.

For the full reviewer runbook (imported-wire rows, optional Zep Cloud setup, honest-abstain protocol), see artefact/REPRODUCE.md.


Strengthening modules

Five modules back the paper's strengthened claims; artefact/REPRODUCE.md § 9 carries the full verification recipe for each.

  • n-ary conflict-set algebraimplementation/bitemporal/operators.py::resolve_conflict_set resolves a set of n mutually-contradicting incumbents in one fold (not just the pairwise case), and implementation/bitemporal/audit.py::merge_provenance_all accumulates every loser's K-semiring provenance into the survivor. Verified by tests/bitemporal/test_conflict_set.py.
  • Real multi-writer PostgreSQL isolationexperiments/g3_systems_perf/isolation_concurrency.py runs concurrent writers against a real PostgreSQL backend across the writers × isolation-level grid; the committed results/g3_systems_perf/isolation_concurrency.csv records 16 cells under PostgreSQL 17.10, so the isolation claim is auditable as genuine multi-process concurrency rather than a single-process simulation.
  • Persistent judge log + crash replay — a persistent judge_log table records the judge verdict before the operator commit, so a verdict survives a crash and replays deterministically. Verified by tests/bitemporal/test_judge_log_persistence.py, including a negative control proving the log is load-bearing.
  • JSON audit-witness codec + provenance-accumulating dedupe — an audit witness round-trips through storage, and a duplicate write folds its provenance into the existing row rather than discarding the duplicate's lineage (tests/bitemporal).
  • Contradiction-density-stratified powered utility (honest null)experiments/g2_utility/{stratify,power,powered}.py run a pre-registered, density-stratified, statistically-powered paired utility test. This is an underpowered / null result, not a utility win: the committed results/g2_utility/cross_system/powered_summary.csv records status=n/a on both strata (achieved power 0.0347 against a 0.80 target, required n ≈ 1570 versus n = 5 measured). On LoCoMo at feasible sample sizes the powered test cannot establish a utility advantage; the artifact reports this honestly rather than claiming a measured-slice win.

Expected hardware and runtime

Experiment Hardware Wall-clock
Unit tests (tests/bitemporal/) any laptop < 30 s
AnomalyClaim smoke (simulated) any laptop ~2 min
G3 systems performance (DuckDB, local) any laptop ~10 min
G4 ablation (local, no LLM) any laptop ~5 min
G2 utility (requires LLM relay) any laptop + LLM relay ~60 min
Full N1/N2/N3 wire runs (LLM relay + optional Zep Cloud) any laptop + LLM relay ~90 min

License

MIT License. See LICENSE.

Citation

@inproceedings{wang2026toki,
  title     = {{TOKI}: A Bitemporal Operator Algebra for Contradiction
               Resolution in {LLM}-Agent Persistent Memory},
  author    = {Wang, Ziming},
  booktitle = {Proc. VLDB Endow.},
  volume    = {20},
  year      = {2026},
  note      = {Preprint; under preparation for submission to PVLDB Vol.~20 (VLDB 2027), not yet peer-reviewed},
}

About

Reference implementation and reproducibility artifact for the TOKI paper (PVLDB Vol 20 / VLDB 2027): a bitemporal operator algebra for contradiction resolution in LLM-agent persistent memory.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages