Skip to content

feat: temporal repeated-event emphasis + AUDN session timestamps + extraction fallback#47

Merged
ethanj merged 10 commits intomainfrom
benchmark/temporal-repeated-event-emphasis
Apr 25, 2026
Merged

feat: temporal repeated-event emphasis + AUDN session timestamps + extraction fallback#47
ethanj merged 10 commits intomainfrom
benchmark/temporal-repeated-event-emphasis

Conversation

@ethanj
Copy link
Copy Markdown
Contributor

@ethanj ethanj commented Apr 25, 2026

Summary

Four feature concerns shipped as one branch, each with regression tests:

1. Repeated-event temporal endpoint formatting (`feat(retrieval)`)

Adds a query-aware temporal endpoint block to tiered injection. Recognizes "first ... second" temporal questions (e.g. "How many months between the first and second appointment?") and emits a compact two-endpoint block with elapsed duration when the retrieved memories contain two distinct dates matching the queried event terms.

  • Concept-group matching: a candidate must hit a synonym in EVERY query concept group (doctor AND appointment), not just one — partial matches like "only doctor" + "only appointment" no longer falsely become endpoints.
  • Endpoint block tokens are subtracted from the tier-assignment budget up front and counted in `estimatedContextTokens`, so the block never silently exceeds the caller's budget.
  • Plural↔canonical synonym resolution via reverse index (`appointments` → `appointment`).
  • Extracted `query-term-visibility.ts` and `temporal-format.ts` modules to keep `retrieval-format.ts` focused.

2. AUDN session timestamp threading (`feat(audn)`)

Adds an `observed_at` companion to `created_at` on stored memory rows. Logical session timestamps from ingest now flow through canonical fact storage, projection storage, supersede, and clarification writes. Without this, mutations recorded during transcript replay all stamped wall-clock `created_at`, breaking temporal ordering on those slices.

  • `StoreMemoryInput` accepts `observedAt` (default = `createdAt`).
  • `storeProjection` groups its trailing args into `StoreProjectionOptions` (`cmoId`, `logicalTimestamp`, `workspace`) — avoids a fifth positional argument.
  • AUDN clarify, opinion-confidence-collapse, and supersede write paths thread the timestamp.
  • Composite generation in `ingest-post-write` picks up `observedAt`.

3. Chunked extraction fallback (`feat(extraction)`)

Adds a default-off `CHUNKED_EXTRACTION_FALLBACK_ENABLED` flag. When normal extraction returns zero facts on a conversation longer than the configured chunk size, the consensus path retries with chunked extraction.

Also fixes a runtime-config bug: `extractOnce` now branches directly on `config.extractionCacheEnabled` rather than always routing through `cachedExtractFacts` (which reads the singleton and silently ignored `config_override.extractionCacheEnabled=false`).

4. Conflict policy fixes (`fix(conflict-policy)`)

  • Stop matching medical "check-up" wording as uncertainty (the bare "check" marker fired on routine medical phrases). Replaced with regexes that only match real uncertainty: `need/needs/will/should to check`, `check later/tomorrow/again/back`.
  • CLARIFY + explicit replacement signal ("replacing X", "no longer Y", "correction: ...") now upgrades to SUPERSEDE only when the target ID is present in the candidate set; otherwise stays CLARIFY. The previous fall-through to ADD silently kept the stale memory active alongside the new one. Stale/invalid target IDs that AUDN may return now keep CLARIFY rather than producing a SUPERSEDE that downstream rejects.
  • Refactored `applyClarificationOverrides` from a 14-cyclomatic if/else chain into a POLICIES list of small named transformers; dispatcher is a 6-line loop.

Codex review trail

Codex reviewed in 5 passes. All findings addressed:

  • Pass 1 (HIGH): `extractionCacheEnabled` runtime override bypass — fixed in `extractOnce`.
  • Pass 1 (MEDIUM): endpoint block tokens unbudgeted — counted in `estimatedContextTokens`, subtracted from assignment budget up front.
  • Pass 1 (MEDIUM): plural-asymmetric synonym expansion — reverse index added.
  • Pass 2 (MEDIUM): partial-match false endpoints — concept-group requirement.
  • Pass 2 (MEDIUM): CLARIFY + explicit replacement → ADD silently kept stale memory — upgrades to SUPERSEDE on valid target.
  • Pass 2 (LOW): new fallow complexity_moderate baseline entry — refactored function below threshold, baseline cleaned.
  • Pass 3 (MEDIUM): SUPERSEDE on stale targetMemoryId not in candidates — added `candidates.some` check.

Verification

  • `npx tsc --noEmit` clean
  • `npx fallow audit` clean ("No issues in 23 changed files")
  • `npm run build` succeeds
  • Focused tests: 83/83 passing across conflict-policy, temporal-endpoint-evidence, retrieval-format, consensus-extraction-runtime-config, audn-workspace-scope-fence, ingest-post-write, memory-ingest-runtime-config, memory-storage-runtime-config, memory-route-config-override

🤖 Generated with Claude Code

ethanj and others added 10 commits April 24, 2026 23:10
The UNCERTAIN_MARKERS list contained the bare token "check", which fired
on routine medical phrases like "check-up with the doctor" and routed
those facts to clarification instead of ADD. Replace it with a pair of
regexes that only match real uncertainty wording:

- "need/needs/needed/will/should to check"
- "check later/tomorrow/again/back"

Adds a regression test covering "Sam had a check-up with Sam's doctor".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion

Adds a default-off CHUNKED_EXTRACTION_FALLBACK_ENABLED flag. When normal
extraction returns zero facts on a conversation longer than the configured
chunk size, the consensus path now retries with chunked extraction. This
recovers from extraction failures on long inputs without enabling chunked
extraction unconditionally.

Refactors chunkedExtractFacts to take its config as an explicit argument
instead of reading the module-level singleton, so per-request runtime
overrides flow through. extractOnce now also branches directly on
runtime extractionCacheEnabled rather than always routing through
cachedExtractFacts (which reads the singleton internally) — this lets
config_override.extractionCacheEnabled actually take effect during
benchmark sweeps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an observed_at companion to created_at on stored memory rows and
threads logical session timestamps from the ingest path through to
canonical fact storage, projection storage, supersede, and clarification
writes. Without this, mutations recorded during transcript replay (and
benchmark sweeps with explicit session timestamps) all stamped wall-clock
created_at, breaking temporal ordering and ranking on those slices.

- StoreMemoryInput accepts observedAt; defaults to createdAt at the
  repository layer.
- storeProjection groups its trailing arguments into a
  StoreProjectionOptions object (cmoId, logicalTimestamp, workspace) —
  now that the call sites need three optional fields, the bag avoids
  adding a fifth positional argument.
- AUDN clarify, opinion-confidence-collapse, and supersede write paths
  pass through the logical timestamp.
- Composite generation in ingest-post-write picks up observedAt to
  match created_at.
- Unit tests cover the timestamp threading at storage, ingest, AUDN,
  composite, and integration (temporal-mutation-regression) layers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a query-aware temporal endpoint block to tiered injection. Recognizes
repeated-event temporal questions (e.g. "How many months between the first
and second appointment?") and emits a compact two-endpoint summary plus
the elapsed duration when retrieved memories contain two distinct dates
matching the queried event terms.

- temporal-endpoint-evidence.ts: identifies the question shape, scores
  candidate memories by event-term overlap, picks the earliest two
  distinct dates, emits the block. Synonym table covers the common
  appointment/doctor surface; reverse index resolves plurals back to
  the canonical singular so "first and second appointments" expands
  via "appointment".
- retrieval-format.ts: builds the endpoint block before tier
  assignment so its tokens are subtracted from the assignment budget
  and counted in estimatedContextTokens. Otherwise the appended block
  would silently exceed the caller's budget and underreport packaged
  tokens.
- query-term-visibility.ts: extracted from retrieval-format. Same
  upgrade-tier-when-query-terms-hidden logic, just split out so
  retrieval-format stays under the file-size guideline and the helper
  is independently testable.
- temporal-format.ts: shared formatDateLabel + formatDuration so
  retrieval-format and temporal-endpoint-evidence don't each carry
  their own copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #46's paydown landed StoreMemoryInput centralization, response-schema
namespace imports, and the postJson test helper into main while this
branch was open. Rebasing onto the new main shifted line numbers in
the accepted-debt entries; regenerate the baselines so fallow audit
compares against the post-paydown layout.

The original baseline-refresh commit from this branch was dropped during
rebase since #46's baselines superseded it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndidates

Previously buildRepeatedEventEndpointBlock flattened every query event
term and its synonyms into one list, then accepted any memory with at
least one match. For "first and second doctor appointment", a memory
mentioning only "doctor" and another mentioning only "appointment"
would falsely become the two endpoints — neither proves the combined
event happened.

Group synonyms by canonical concept (doctor synonyms vs appointment
synonyms), and require a candidate to hit at least one synonym in EVERY
group. Adds a regression test covering the partial-match false-positive
case codex flagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t, refactor to policy chain

Two related changes:

1. CLARIFY decisions with an explicit replacement signal ("replacing X",
   "no longer Y", "instead of Z", "correction: ...") were being promoted
   to ADD. promoteToAdd clears targetMemoryId, leaving the stale memory
   active alongside the new one — which silently fails the user's
   explicit replacement intent for current-state facts like "replacing
   Alice Morgan with Bob Chen". When AUDN identified a target, upgrade
   to SUPERSEDE so the stale memory is expired. When no target was
   identified, keep CLARIFY rather than fall through to ADD; the user
   asked for a replacement we can't pin down, so defer to them.

2. applyClarificationOverrides was at 14 cyclomatic / 19 cognitive,
   above fallow's moderate threshold and growing every time a new
   policy was added. Refactor into a POLICIES list of small, named
   transformers; the dispatcher becomes a 6-line loop. Each policy
   returns null to defer or a transformed AUDNDecision to commit.

Adds two regression tests: CLARIFY+target+replacement → SUPERSEDE,
and CLARIFY+no-target+replacement → keep CLARIFY.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aseline

The applyClarificationOverrides refactor in the previous commit cleared
the moderate-complexity finding for conflict-policy.ts (the dispatcher
went from 14 cyclomatic to ~3, with each policy now a separately-named
small function). Re-save the baseline to remove the stale entry — per
the workspace rule, fix fallow findings rather than baseline them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…seding on CLARIFY

The previous fix upgraded CLARIFY+explicit-replacement to SUPERSEDE
whenever decision.targetMemoryId was set, but didn't verify the target
was actually in the candidate set. If AUDN returned a stale or invalid
target ID, memory-audn rejects the SUPERSEDE (target not found) and
falls back to canonical storage, which silently leaves the old memory
active — the same bug the SUPERSEDE upgrade was meant to fix.

Check candidates.some(c => c.id === targetMemoryId) before superseding;
keep CLARIFY otherwise.

Adds a regression test using a target ID that's not in the candidate set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…seline ratchet

CI's shrink-only baseline ratchet flagged a +1 dupes entry on this branch.
The cause: PR #46 unified StoreMemoryInput in repository-types and
re-exported it from stores.ts and repository-write.ts, but missed
memory-repository.ts which still carried its own local copy. With this
branch's observedAt addition, the local copy fell back into clone-group
overlap with the centralized type.

Drop the local StoreMemoryInput from memory-repository.ts and import
the centralized one from repository-types instead. Regenerate baselines
to reflect the current line numbers.

Net: dupes baseline 30 -> 29 entries (matches origin/main); ratchet
script reports "dupes: 29 -> 29 (unchanged)".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ethanj ethanj merged commit eb43e1b into main Apr 25, 2026
1 check passed
@ethanj ethanj deleted the benchmark/temporal-repeated-event-emphasis branch April 25, 2026 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant