Reference implementation and reproducibility artifact for the TOKI paper.
This repository accompanies a preprint being prepared for submission to PVLDB Vol. 20 (VLDB 2027). The work has not yet been peer-reviewed or accepted — the theorems, numbers, and claims here may change before publication. An archival DOI (Zenodo) is minted on acceptance.
The compiled paper and its extended appendix ship with this artifact under paper/:
paper/TOKI.pdf— the main paper.paper/TOKI-supplement.pdf— the supplement, an extended appendix carrying the full proofs and protocols referenced from the paper body.
The supplement is where the load-bearing detail lives: every theorem cited in the main text is proved in full there, alongside the experiment protocols, soundness contracts, and the keyed-logging tightness argument.
Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and the system must decide what to trust when a new claim contradicts a stored one. Production systems answer with four resolution heuristics, yet none declares the isolation level it assumes or the write-time anomalies it admits.
TOKI's thesis: contradiction resolution is write-time concurrency control. TOKI types the four production heuristics as one family of bitemporal operators over a dual-row schema, each carrying an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across three axes (isolation, schema, provenance) and lift to operator pipelines; a tightness companion proves that keyed logging of the adjudicating judge is necessary for replay consistency, a discipline every audited baseline omits.
This repository is the reference implementation and experiment harness behind every theorem, table, and figure in the paper.
- A typed operator algebra. The four production contradiction-resolution heuristics become one isolation-indexed family of bitemporal operators over a dual-row (live + audit) schema, each carrying an explicit isolation precondition and a K-semiring provenance annotation.
- A necessity theorem and soundness on three axes. Keyed logging of the adjudicating judge is proven necessary for replay consistency; four soundness theorems close the contract across isolation, schema, and provenance, and lift to operator pipelines.
- An eight-system verdict matrix with running-code corroboration. Every audited baseline that keeps a language-model judge on the write path admits at least one of three write-time anomalies; TOKI is the only design that excludes all three while keeping the judge on the write path.
- A reproducible harness. Pre-computed evidence in
results/, re-runnable end to end, with a machine-checkableartefact/manifest.jsonthat links each paper claim to its evidence file.
This correspondence is the load-bearing claim (paper §1, Table tab:correspondence). Each deployed strategy is the operational mirror of one classical Berenson–Adya multiversion anomaly.
| Production strategy | Operator | Isolation precondition | Admitted anomaly |
|---|---|---|---|
| Last-writer-wins | LWW |
Read-committed | P4 lost update |
| Evidence-weighted merge | Evidence |
Snapshot isolation | A5B write skew |
| Await-confirmation | AwaitConfirm |
RC + callback | callback boundary |
| Per-rule policy | PerRule |
Serializable on policy table | P3 phantom |
The paper audits eight systems (six production agent-memory baselines, one engine-layer comparator, and TOKI) against three write-time anomalies:
- N1 — replay inconsistency: re-invoking the language-model judge on a committed verdict can flip it.
- N2 — belief-drift skew: concurrent per-partition updates drift the belief state under snapshot isolation.
- N3 — audit erasure: the losing fact is overwritten with no recoverable audit trail.
The table below shows the design verdict (the verdict each system's published contradiction logic implies). The paper's full matrix (tab:anomaly-bench, §5) additionally reports the running-code verdict observed from each system's shipped code, transcribed directly from results/anomaly_bench/. Legend: A = admits the anomaly, X = excludes it, – = predicate not applicable.
| System | N1 replay-inconsistency | N2 belief-drift skew | N3 audit erasure | LM judge on write path? |
|---|---|---|---|---|
| mem0 (v2) | – | – | A | yes |
| mem0 (v3) | A | A | A | yes |
| Graphiti | A | – | A | yes |
| Letta | – | A | A | yes |
| Zep | – | A | A | yes |
| MIRIX | A | A | – | yes |
| WorldDB (engine-layer comparator) | X | X | X | no (judge removed) |
| TOKI (this work) | X | X | X | yes |
Every baseline that keeps a language-model judge on the write path admits at least one anomaly. The content-addressed engine-layer comparator avoids all three only by removing the judge. TOKI alone excludes all three while keeping the judge on the write path. The MIRIX row is transcribed from the independent MMA-Bench evaluation, which probes N1/N2 but not the audit path, so its N3 cell abstains.
- Left: contradiction-resolution latency stays flat as the memory store grows to 10^5 facts (DuckDB backend, evidence-weighted path).
- Right: three structural anchors (isolation lattice, composition length 1–5, Welch-t equivalence) where every predicted 0/1 boundary matches the measured result.
The audit-row defence recovers a constructed mechanism-stress slice by 0.86 (LoCoMo). End-to-end retrieval shows no significant difference from the baselines: the paper states a write-time correctness contract and makes no utility-superiority claim.
| Path | Contents |
|---|---|
implementation/bitemporal/ |
Core package (import bitemporal): the four operators, dual-row schema, K-semiring provenance lattice, as_of time-travel, audit log. |
implementation/adapters/ |
Adapter shims for the baseline agent-memory systems. |
experiments/anomaly_bench/ |
AnomalyClaim: the N1/N2/N3 structural verdict matrix. |
experiments/anomaly_wire/ |
AnomalyWire: live-adapter, cross-layer corroboration of the verdicts. |
experiments/g2_utility/ |
Paired end-to-end accuracy (LoCoMo, LongMemEval-S, MultiTQ). |
experiments/g3_systems_perf/ |
DuckDB backend scaling and tail-latency benchmarks. |
experiments/g4_ablation/ |
Operator-family and K-semiring ablations. |
experiments/benchmark_{a,b,f}/ |
Third-party benchmarks (cross-system QA, MMA-Bench, GroupMemBench / STALE-400). |
experiments/n1_empirical/, n2_partition/ |
Empirical lower-bound and partition-isolation sweeps. |
results/ |
Pre-computed CSVs and manifests, one subdirectory per experiment. These are the exact evidence behind the paper's tables and figures. |
figures/ |
The paper's compiled figures (PNG), embedded above; the plot scripts regenerate the underlying plots here from results/. |
scripts/ |
Figure generation (plot_*.py), data aggregation (aggregate_*.py, reconcile_*.py, backfill_*.py), and statistics helpers (derive_g3_5axis_stats.py, build_holm_family.py, build_benchmark_status.py, refresh_manifest_sha.py). |
tests/ |
pytest suites mirroring implementation/ and each experiment. |
artefact/ |
Submission package: REPRODUCE.md reviewer runbook, machine-checkable manifest.json, Dockerfile, dataset manifests + checksums. |
- Python 3.11+
uv(recommended) orpip- Docker 24+ (only for the containerized smoke run)
- An LLM API key for the live experiments (
OPENROUTER_API_KEY, orOPENAI_API_KEY/VLDB2027_JUDGE_API_KEY); seeartefact/REPRODUCE.md. Calls default to a public OpenAI-compatible endpoint (OpenRouter); override with theVLDB2027_LLM_BASE_URLenv var. The structural verdicts, ablations, and systems-performance benchmarks run without any API key.
git clone https://github.com/ZenAlexa/toki-bitemporal-memory && cd toki-bitemporal-memory
# Recommended: uv
uv sync --extra test --extra experiments
uv pip install -e .
# Or: pip
pip install -e ".[test,experiments]"The editable install (-e .) is required before running the test suite or any python -m experiments.* runner, because it registers both the bitemporal package and the experiments package on the path. Verify it:
python -c "from bitemporal import Schema, LWW, Evidence, AwaitConfirm, PerRule; print('OK')"The three benchmark datasets are not redistributed here. Each artefact/datasets/<name>/ directory carries a README.md with download instructions, a LICENSE, and a SHA256SUMS file for integrity verification after you download the raw data.
| Dataset | Used by | Download instructions |
|---|---|---|
| LoCoMo | g2_utility (mechanism-stress + cross-system ledger) |
artefact/datasets/locomo/README.md |
| LongMemEval-S | g2_utility |
artefact/datasets/longmemeval_s/README.md |
| MultiTQ | g2_utility |
artefact/datasets/multitq/README.md |
The structural verdict matrix, systems-performance, and ablation results do not require these datasets; they are needed only for the end-to-end utility experiments.
This artifact reproduces the experiments behind the paper's theorems, the data behind its tables, and its figures, not the paper document itself. The pre-computed CSVs in results/ already back every table and figure. To regenerate from scratch:
make smoke # core operator + AnomalyClaim + AnomalyWire tests, < 2 min
# or, fully containerized:
docker build -t toki-artefact:latest -f artefact/Dockerfile .
docker run --rm toki-artefact:latest make smoke| Claim / table / figure | Command | Output |
|---|---|---|
| Verdict matrix, N1/N2/N3 | python -m experiments.anomaly_bench.k_sweep_runner --output results/anomaly_bench/ |
results/anomaly_bench/ (committed verdicts n[1,2,3]_*.csv; K-sweep under k_sweep/) |
| AnomalyWire cross-layer corroboration | python -m experiments.anomaly_wire.iso_level_sweep_runner |
results/anomaly_wire/ |
| G2 utility (LoCoMo / LongMemEval-S / MultiTQ) | python -m experiments.g2_utility.runner --output-dir results/g2_utility/ |
results/g2_utility/ |
| G3 systems performance (scaling / latency) | python -m experiments.g3_systems_perf.runner --output-dir results/g3_systems_perf/ |
results/g3_systems_perf/ |
| G4 operator + K-semiring ablation | python -m experiments.g4_ablation.k_semiring_counterfactual_runner --output results/g4_ablation/ |
results/g4_ablation/ |
| N1 empirical lower bound | python -m experiments.n1_empirical.runner --output results/n1_empirical/run_v1/ |
results/n1_empirical/run_v1/ |
| Benchmark-A cross-system QA | python -m experiments.benchmark_a.cross_system_qa |
results/benchmark_a/ |
| Benchmark-B MMA-Bench | python -m experiments.anomaly_bench.mma_bench_adapter |
results/benchmark_b/ |
| Benchmark-F GroupMemBench / STALE-400 | python -m experiments.benchmark_f.stale_runner |
results/benchmark_f/ |
| All infra-free tests | pytest tests/ -m "not docker and not live" |
terminal report (the docker / live tests additionally need the pinned upstream clones and an LLM relay) |
Each script reads the corresponding CSVs from results/ and writes its plot into figures/:
python scripts/plot_anomaly_bench.py # -> figures/fig-anomaly-bench.pdf
python scripts/plot_systems_perf.py # -> figures/fig-systems-perf.pdf
python scripts/plot_iso_pareto.py # -> figures/fig7-iso-pareto.pdfThe repository carries additional plot_*.py renderers (operator family, forest plot, anchors, trajectory replay, scaling composites); each reads its CSVs from results/ and writes a PDF into figures/.
For the full reviewer runbook (imported-wire rows, optional Zep Cloud setup, honest-abstain protocol), see artefact/REPRODUCE.md.
Five modules back the paper's strengthened claims; artefact/REPRODUCE.md § 9 carries the full verification recipe for each.
- n-ary conflict-set algebra —
implementation/bitemporal/operators.py::resolve_conflict_setresolves a set ofnmutually-contradicting incumbents in one fold (not just the pairwise case), andimplementation/bitemporal/audit.py::merge_provenance_allaccumulates every loser's K-semiring provenance into the survivor. Verified bytests/bitemporal/test_conflict_set.py. - Real multi-writer PostgreSQL isolation —
experiments/g3_systems_perf/isolation_concurrency.pyruns concurrent writers against a real PostgreSQL backend across thewriters × isolation-levelgrid; the committedresults/g3_systems_perf/isolation_concurrency.csvrecords 16 cells under PostgreSQL 17.10, so the isolation claim is auditable as genuine multi-process concurrency rather than a single-process simulation. - Persistent judge log + crash replay — a persistent
judge_logtable records the judge verdict before the operator commit, so a verdict survives a crash and replays deterministically. Verified bytests/bitemporal/test_judge_log_persistence.py, including a negative control proving the log is load-bearing. - JSON audit-witness codec + provenance-accumulating dedupe — an audit witness round-trips through storage, and a duplicate write folds its provenance into the existing row rather than discarding the duplicate's lineage (
tests/bitemporal). - Contradiction-density-stratified powered utility (honest null) —
experiments/g2_utility/{stratify,power,powered}.pyrun a pre-registered, density-stratified, statistically-powered paired utility test. This is an underpowered / null result, not a utility win: the committedresults/g2_utility/cross_system/powered_summary.csvrecordsstatus=n/aon both strata (achieved power 0.0347 against a 0.80 target, required n ≈ 1570 versus n = 5 measured). On LoCoMo at feasible sample sizes the powered test cannot establish a utility advantage; the artifact reports this honestly rather than claiming a measured-slice win.
| Experiment | Hardware | Wall-clock |
|---|---|---|
Unit tests (tests/bitemporal/) |
any laptop | < 30 s |
| AnomalyClaim smoke (simulated) | any laptop | ~2 min |
| G3 systems performance (DuckDB, local) | any laptop | ~10 min |
| G4 ablation (local, no LLM) | any laptop | ~5 min |
| G2 utility (requires LLM relay) | any laptop + LLM relay | ~60 min |
| Full N1/N2/N3 wire runs (LLM relay + optional Zep Cloud) | any laptop + LLM relay | ~90 min |
MIT License. See LICENSE.
@inproceedings{wang2026toki,
title = {{TOKI}: A Bitemporal Operator Algebra for Contradiction
Resolution in {LLM}-Agent Persistent Memory},
author = {Wang, Ziming},
booktitle = {Proc. VLDB Endow.},
volume = {20},
year = {2026},
note = {Preprint; under preparation for submission to PVLDB Vol.~20 (VLDB 2027), not yet peer-reviewed},
}

