TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

Reference implementation and reproducibility artifact for the TOKI paper.

Status

This repository accompanies a preprint being prepared for submission to PVLDB Vol. 20 (VLDB 2027). The work has not yet been peer-reviewed or accepted — the theorems, numbers, and claims here may change before publication. An archival DOI (Zenodo) is minted on acceptance.

Paper

The compiled paper and its extended appendix ship with this artifact under paper/:

paper/TOKI.pdf — the main paper.
paper/TOKI-supplement.pdf — the supplement, an extended appendix carrying the full proofs and protocols referenced from the paper body.

The supplement is where the load-bearing detail lives: every theorem cited in the main text is proved in full there, alongside the experiment protocols, soundness contracts, and the keyed-logging tightness argument.

TL;DR

Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and the system must decide what to trust when a new claim contradicts a stored one. Production systems answer with four resolution heuristics, yet none declares the isolation level it assumes or the write-time anomalies it admits.

TOKI's thesis: contradiction resolution is write-time concurrency control. TOKI types the four production heuristics as one family of bitemporal operators over a dual-row schema, each carrying an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across three axes (isolation, schema, provenance) and lift to operator pipelines; a tightness companion proves that keyed logging of the adjudicating judge is necessary for replay consistency, a discipline every audited baseline omits.

This repository is the reference implementation and experiment harness behind every theorem, table, and figure in the paper.

Key contributions

A typed operator algebra. The four production contradiction-resolution heuristics become one isolation-indexed family of bitemporal operators over a dual-row (live + audit) schema, each carrying an explicit isolation precondition and a K-semiring provenance annotation.
A necessity theorem and soundness on three axes. Keyed logging of the adjudicating judge is proven necessary for replay consistency; four soundness theorems close the contract across isolation, schema, and provenance, and lift to operator pipelines.
An eight-system verdict matrix with running-code corroboration. Every audited baseline that keeps a language-model judge on the write path admits at least one of three write-time anomalies; TOKI is the only design that excludes all three while keeping the judge on the write path.
A reproducible harness. Pre-computed evidence in results/, re-runnable end to end, with a machine-checkable artefact/manifest.json that links each paper claim to its evidence file.

The four production heuristics as isolation-typed operators

This correspondence is the load-bearing claim (paper §1, Table tab:correspondence). Each deployed strategy is the operational mirror of one classical Berenson–Adya multiversion anomaly.

Production strategy	Operator	Isolation precondition	Admitted anomaly
Last-writer-wins	`LWW`	Read-committed	P4 lost update
Evidence-weighted merge	`Evidence`	Snapshot isolation	A5B write skew
Await-confirmation	`AwaitConfirm`	RC + callback	callback boundary
Per-rule policy	`PerRule`	Serializable on policy table	P3 phantom

Headline result: the verdict matrix over eight systems

The paper audits eight systems (six production agent-memory baselines, one engine-layer comparator, and TOKI) against three write-time anomalies:

N1 — replay inconsistency: re-invoking the language-model judge on a committed verdict can flip it.
N2 — belief-drift skew: concurrent per-partition updates drift the belief state under snapshot isolation.
N3 — audit erasure: the losing fact is overwritten with no recoverable audit trail.

The table below shows the design verdict (the verdict each system's published contradiction logic implies). The paper's full matrix (tab:anomaly-bench, §5) additionally reports the running-code verdict observed from each system's shipped code, transcribed directly from results/anomaly_bench/. Legend: A = admits the anomaly, X = excludes it, – = predicate not applicable.

System	N1 replay-inconsistency	N2 belief-drift skew	N3 audit erasure	LM judge on write path?
mem0 (v2)	–	–	A	yes
mem0 (v3)	A	A	A	yes
Graphiti	A	–	A	yes
Letta	–	A	A	yes
Zep	–	A	A	yes
MIRIX	A	A	–	yes
WorldDB (engine-layer comparator)	X	X	X	no (judge removed)
TOKI (this work)	X	X	X	yes

Every baseline that keeps a language-model judge on the write path admits at least one anomaly. The content-addressed engine-layer comparator avoids all three only by removing the judge. TOKI alone excludes all three while keeping the judge on the write path. The MIRIX row is transcribed from the independent MMA-Bench evaluation, which probes N1/N2 but not the audit path, so its N3 cell abstains.

What the experiments measure

Left: contradiction-resolution latency stays flat as the memory store grows to 10^5 facts (DuckDB backend, evidence-weighted path).
Right: three structural anchors (isolation lattice, composition length 1–5, Welch-t equivalence) where every predicted 0/1 boundary matches the measured result.

The audit-row defence recovers a constructed mechanism-stress slice by 0.86 (LoCoMo). End-to-end retrieval shows no significant difference from the baselines: the paper states a write-time correctness contract and makes no utility-superiority claim.

Repository layout

Path	Contents
`implementation/bitemporal/`	Core package (`import bitemporal`): the four operators, dual-row schema, K-semiring provenance lattice, `as_of` time-travel, audit log.
`implementation/adapters/`	Adapter shims for the baseline agent-memory systems.
`experiments/anomaly_bench/`	AnomalyClaim: the N1/N2/N3 structural verdict matrix.
`experiments/anomaly_wire/`	AnomalyWire: live-adapter, cross-layer corroboration of the verdicts.
`experiments/g2_utility/`	Paired end-to-end accuracy (LoCoMo, LongMemEval-S, MultiTQ).
`experiments/g3_systems_perf/`	DuckDB backend scaling and tail-latency benchmarks.
`experiments/g4_ablation/`	Operator-family and K-semiring ablations.
`experiments/benchmark_{a,b,f}/`	Third-party benchmarks (cross-system QA, MMA-Bench, GroupMemBench / STALE-400).
`experiments/n1_empirical/`, `n2_partition/`	Empirical lower-bound and partition-isolation sweeps.
`results/`	Pre-computed CSVs and manifests, one subdirectory per experiment. These are the exact evidence behind the paper's tables and figures.
`figures/`	The paper's compiled figures (PNG), embedded above; the plot scripts regenerate the underlying plots here from `results/`.
`scripts/`	Figure generation (`plot_.py`), data aggregation (`aggregate_.py`, `reconcile_.py`, `backfill_.py`), and statistics helpers (`derive_g3_5axis_stats.py`, `build_holm_family.py`, `build_benchmark_status.py`, `refresh_manifest_sha.py`).
`tests/`	`pytest` suites mirroring `implementation/` and each experiment.
`artefact/`	Submission package: `REPRODUCE.md` reviewer runbook, machine-checkable `manifest.json`, `Dockerfile`, dataset manifests + checksums.

Requirements

Python 3.11+
uv (recommended) or pip
Docker 24+ (only for the containerized smoke run)
An LLM API key for the live experiments (OPENROUTER_API_KEY, or OPENAI_API_KEY / VLDB2027_JUDGE_API_KEY); see artefact/REPRODUCE.md. Calls default to a public OpenAI-compatible endpoint (OpenRouter); override with the VLDB2027_LLM_BASE_URL env var. The structural verdicts, ablations, and systems-performance benchmarks run without any API key.

Installation

git clone https://github.com/ZenAlexa/toki-bitemporal-memory && cd toki-bitemporal-memory

# Recommended: uv
uv sync --extra test --extra experiments
uv pip install -e .

# Or: pip
pip install -e ".[test,experiments]"

The editable install (-e .) is required before running the test suite or any python -m experiments.* runner, because it registers both the bitemporal package and the experiments package on the path. Verify it:

python -c "from bitemporal import Schema, LWW, Evidence, AwaitConfirm, PerRule; print('OK')"

Datasets

The three benchmark datasets are not redistributed here. Each artefact/datasets/<name>/ directory carries a README.md with download instructions, a LICENSE, and a SHA256SUMS file for integrity verification after you download the raw data.

Dataset	Used by	Download instructions
LoCoMo	`g2_utility` (mechanism-stress + cross-system ledger)	`artefact/datasets/locomo/README.md`
LongMemEval-S	`g2_utility`	`artefact/datasets/longmemeval_s/README.md`
MultiTQ	`g2_utility`	`artefact/datasets/multitq/README.md`

The structural verdict matrix, systems-performance, and ablation results do not require these datasets; they are needed only for the end-to-end utility experiments.

Reproducing the experiments

This artifact reproduces the experiments behind the paper's theorems, the data behind its tables, and its figures, not the paper document itself. The pre-computed CSVs in results/ already back every table and figure. To regenerate from scratch:

Quick smoke check (no API key)

make smoke          # core operator + AnomalyClaim + AnomalyWire tests, < 2 min
# or, fully containerized:
docker build -t toki-artefact:latest -f artefact/Dockerfile .
docker run --rm toki-artefact:latest make smoke

Claim → command → output

Claim / table / figure	Command	Output
Verdict matrix, N1/N2/N3	`python -m experiments.anomaly_bench.k_sweep_runner --output results/anomaly_bench/`	`results/anomaly_bench/` (committed verdicts `n[1,2,3]_*.csv`; K-sweep under `k_sweep/`)
AnomalyWire cross-layer corroboration	`python -m experiments.anomaly_wire.iso_level_sweep_runner`	`results/anomaly_wire/`
G2 utility (LoCoMo / LongMemEval-S / MultiTQ)	`python -m experiments.g2_utility.runner --output-dir results/g2_utility/`	`results/g2_utility/`
G3 systems performance (scaling / latency)	`python -m experiments.g3_systems_perf.runner --output-dir results/g3_systems_perf/`	`results/g3_systems_perf/`
G4 operator + K-semiring ablation	`python -m experiments.g4_ablation.k_semiring_counterfactual_runner --output results/g4_ablation/`	`results/g4_ablation/`
N1 empirical lower bound	`python -m experiments.n1_empirical.runner --output results/n1_empirical/run_v1/`	`results/n1_empirical/run_v1/`
Benchmark-A cross-system QA	`python -m experiments.benchmark_a.cross_system_qa`	`results/benchmark_a/`
Benchmark-B MMA-Bench	`python -m experiments.anomaly_bench.mma_bench_adapter`	`results/benchmark_b/`
Benchmark-F GroupMemBench / STALE-400	`python -m experiments.benchmark_f.stale_runner`	`results/benchmark_f/`
All infra-free tests	`pytest tests/ -m "not docker and not live"`	terminal report (the `docker` / `live` tests additionally need the pinned upstream clones and an LLM relay)

Figure generation

Each script reads the corresponding CSVs from results/ and writes its plot into figures/:

python scripts/plot_anomaly_bench.py     # -> figures/fig-anomaly-bench.pdf
python scripts/plot_systems_perf.py      # -> figures/fig-systems-perf.pdf
python scripts/plot_iso_pareto.py        # -> figures/fig7-iso-pareto.pdf

The repository carries additional plot_*.py renderers (operator family, forest plot, anchors, trajectory replay, scaling composites); each reads its CSVs from results/ and writes a PDF into figures/.

For the full reviewer runbook (imported-wire rows, optional Zep Cloud setup, honest-abstain protocol), see artefact/REPRODUCE.md.

Strengthening modules

Five modules back the paper's strengthened claims; artefact/REPRODUCE.md § 9 carries the full verification recipe for each.

n-ary conflict-set algebra — implementation/bitemporal/operators.py::resolve_conflict_set resolves a set of n mutually-contradicting incumbents in one fold (not just the pairwise case), and implementation/bitemporal/audit.py::merge_provenance_all accumulates every loser's K-semiring provenance into the survivor. Verified by tests/bitemporal/test_conflict_set.py.
Real multi-writer PostgreSQL isolation — experiments/g3_systems_perf/isolation_concurrency.py runs concurrent writers against a real PostgreSQL backend across the writers × isolation-level grid; the committed results/g3_systems_perf/isolation_concurrency.csv records 16 cells under PostgreSQL 17.10, so the isolation claim is auditable as genuine multi-process concurrency rather than a single-process simulation.
Persistent judge log + crash replay — a persistent judge_log table records the judge verdict before the operator commit, so a verdict survives a crash and replays deterministically. Verified by tests/bitemporal/test_judge_log_persistence.py, including a negative control proving the log is load-bearing.
JSON audit-witness codec + provenance-accumulating dedupe — an audit witness round-trips through storage, and a duplicate write folds its provenance into the existing row rather than discarding the duplicate's lineage (tests/bitemporal).
Contradiction-density-stratified powered utility (honest null) — experiments/g2_utility/{stratify,power,powered}.py run a pre-registered, density-stratified, statistically-powered paired utility test. This is an underpowered / null result, not a utility win: the committed results/g2_utility/cross_system/powered_summary.csv records status=n/a on both strata (achieved power 0.0347 against a 0.80 target, required n ≈ 1570 versus n = 5 measured). On LoCoMo at feasible sample sizes the powered test cannot establish a utility advantage; the artifact reports this honestly rather than claiming a measured-slice win.

Expected hardware and runtime

Experiment	Hardware	Wall-clock
Unit tests (`tests/bitemporal/`)	any laptop	< 30 s
AnomalyClaim smoke (simulated)	any laptop	~2 min
G3 systems performance (DuckDB, local)	any laptop	~10 min
G4 ablation (local, no LLM)	any laptop	~5 min
G2 utility (requires LLM relay)	any laptop + LLM relay	~60 min
Full N1/N2/N3 wire runs (LLM relay + optional Zep Cloud)	any laptop + LLM relay	~90 min

License

MIT License. See LICENSE.

Citation

@inproceedings{wang2026toki,
  title     = {{TOKI}: A Bitemporal Operator Algebra for Contradiction
               Resolution in {LLM}-Agent Persistent Memory},
  author    = {Wang, Ziming},
  booktitle = {Proc. VLDB Endow.},
  volume    = {20},
  year      = {2026},
  note      = {Preprint; under preparation for submission to PVLDB Vol.~20 (VLDB 2027), not yet peer-reviewed},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

Status

Paper

TL;DR

Key contributions

The four production heuristics as isolation-typed operators

Headline result: the verdict matrix over eight systems

What the experiments measure

Repository layout

Requirements

Installation

Datasets

Reproducing the experiments

Quick smoke check (no API key)

Claim → command → output

Figure generation

Strengthening modules

Expected hardware and runtime

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
artefact		artefact
experiments		experiments
figures		figures
implementation		implementation
paper		paper
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

Status

Paper

TL;DR

Key contributions

The four production heuristics as isolation-typed operators

Headline result: the verdict matrix over eight systems

What the experiments measure

Repository layout

Requirements

Installation

Datasets

Reproducing the experiments

Quick smoke check (no API key)

Claim → command → output

Figure generation

Strengthening modules

Expected hardware and runtime

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages