Skip to content

feat: LoCoMo accuracy improvements — auto-read + multi-word OR#67

Merged
kaghni merged 4 commits intomainfrom
feat/locomo-accuracy-v3
Apr 21, 2026
Merged

feat: LoCoMo accuracy improvements — auto-read + multi-word OR#67
kaghni merged 4 commits intomainfrom
feat/locomo-accuracy-v3

Conversation

@kaghni
Copy link
Copy Markdown
Collaborator

@kaghni kaghni commented Apr 21, 2026

Summary

Two focused improvements to memory retrieval accuracy, benchmarked against the LoCoMo conversational memory benchmark on a balanced 60-question subset (15 per category × 4 non-adversarial categories, drawn across all 10 conversations).

Benchmark results

Model: claude -p --model haiku. Judge: openai/gpt-5-mini. See deeplake-blitz PR #1 for methodology and raw results.

Config Overall single-hop temporal multi-hop open-domain
origin/main (this branch's base) 46.7%
+ auto-read (64f2246) 58.3%
+ multi-word OR (04bf1d5, this PR) 59.2%

Net: +12.5% absolute over origin/main on a category-balanced 60-question sample.

All 918 existing tests pass.

What's in the PR

1. feat(pre-tool-use): auto-rewrite unsupported interpreter reads to cat (64f2246)

When the agent invokes an unsupported interpreter (python/python3/node/deno/bun/ruby/perl) on a single memory path with no shell metacharacters, rewrite the call to cat '<path>' instead of returning the generic [RETRY REQUIRED] guidance. Avoids burning a turn when the intent is clearly a plain file read.

Narrow safety conditions:

  • Command starts with a recognized read-like interpreter.
  • No shell metacharacters ($, `, ;, |, &, <, >, (), \).
  • Memory path matches conservative [\w./_-]+ regex (no trailing quotes/parens can glue onto it).
  • Final cat command wraps the path in single quotes with '\''-escape.

Composite commands (curl, wget, chains, pipes, substitutions) still fall through to the RETRY guidance.

2. feat(grep-core): multi-word OR split for non-regex patterns (04bf1d5)

When a non-regex grep pattern contains multiple words, split into per-word OR filters instead of requiring the full literal phrase to appear verbatim. Natural-language queries like Melanie kids dinosaurs now match sessions that mention any of the three tokens, not only those that contain all three in that exact order.

Single-word and regex patterns are unchanged.

Adds an optional multiWordPatterns field on SearchOptions. buildGrepSearchOptions populates it when the pattern splits into >1 word-long tokens. searchDeeplakeTables consumes it in the non-contentScanOnly branch, falling back to escapedPattern. All downstream OR joining is handled by the existing buildContentFilter helper.

What was considered and dropped

During development, a third change — prepending extracted date_time to the first grep match per file — was committed and tested. It regressed from 59.2% → 50.8%, so it was removed from this PR. Session date extraction may still be a valid idea in a different form (e.g. as a separate structured field instead of inline prefix), but the naive prefix hurts more than it helps on this benchmark.

A previous branch (closed PR #55) had 9 commits based on the pre-refactor codebase. Only these two improvements survived the port to main's new grep-core.ts/pre-tool-use.ts architecture and actually improved accuracy under proper benchmarking conditions. The rest were either obsolete (already done by grep-core.ts UNION ALL, PR #64's index fix) or didn't carry over cleanly.

Methodology note

Early benchmark runs (on PR #55) had two bugs that inflated apparent gains:

  1. Benchmark repo checked out to wrong branch, so --subset flag was silently ignored and benchmark ran all 1540 questions with many timeouts → 0-5% false "results".
  2. Benchmark script set DEEPLAKE_CAPTURE=false but main's new code reads HIVEMIND_CAPTURE, so session writes leaked into locomo_sessions mid-run, polluting search results.

Both fixes live in the deeplake-blitz PR. All numbers above are post-fix.

Test plan

  • npm test passes (918/918 tests, including 32 pre-tool-use.test.ts tests that specifically cover the RETRY-vs-cat rewrite boundary for curl, command substitution, backticks, interpreter commands, etc.)
  • npm run build produces matching bundles
  • Balanced 60q LoCoMo benchmark: +12.5% over main

kaghni added 2 commits April 21, 2026 18:35
When the agent invokes an unsupported interpreter (python/python3/node/
deno/bun/ruby/perl) on a single memory path with no shell metacharacters,
rewrite the call to cat <path> instead of returning the generic
RETRY guidance. This avoids burning a turn when the intent is clearly a
plain file read.

Narrow safety conditions:
  - Command starts with a recognized read-like interpreter.
  - No shell metacharacters (dollar-paren, backtick, semicolon, pipe,
    ampersand, angle brackets, parens, backslash).
  - Memory path matches a conservative [word/./_/-] regex (no trailing
    quotes or parens can glue onto the path).
  - Final cat command wraps the path in single quotes and escapes embedded
    single quotes, preventing injection.

Composite or non-interpreter commands (curl, wget, etc.) still get the
RETRY guidance.

Tests: 918/918 passing.
When a non-regex grep pattern contains multiple words, split into
per-word OR filters instead of requiring the full literal phrase to
appear verbatim in the summary. Natural-language queries like
Melanie kids dinosaurs now match sessions that mention any of the
three tokens, not only those that contain all three in that exact order.

Single-word and regex patterns are unchanged.

Implementation: buildGrepSearchOptions now populates a new optional
multiWordPatterns field on SearchOptions when the pattern splits into
>1 word-long tokens. searchDeeplakeTables consumes that field in the
non-contentScanOnly branch, falling back to the single escapedPattern
otherwise. All downstream OR joining is handled by the existing
buildContentFilter helper.

Tests: 918/918 passing.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude finished @kaghni's task in 4m 3s —— View job


Left 2 inline comments:

  • src/hooks/pre-tool-use.ts:238 — path regex misses absolute-path invocations (/home/…/.deeplake/memory/…) that touchesMemory() would catch, silently falling back to RETRY guidance even though the rewrite would be safe.
  • src/hooks/pre-tool-use.ts:244 — no positive-path test: the existing tests only confirm that commands with shell metacharacters still return RETRY guidance; there's no assertion that a clean python3 ~/.deeplake/memory/data.json actually produces cat '/data.json'.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Coverage Report

Scope: files changed in this PR. Enforced threshold: 90% per metric (per file via vitest.config.ts).

Status Category Percentage Covered / Total
🟢 Lines 99.50% (🎯 90%) 399 / 401
🟢 Statements 97.75% (🎯 90%) 478 / 489
🟢 Functions 96.15% (🎯 90%) 50 / 52
🟢 Branches 90.73% (🎯 90%) 470 / 518
File Coverage — 2 files changed
File Stmts Branches Functions Lines
src/hooks/pre-tool-use.ts 🟢 98.7% 🟢 90.4% 🟢 93.8% 🟢 99.0%
src/shell/grep-core.ts 🟢 97.0% 🟢 91.0% 🟢 97.2% 🟢 100.0%

Generated for commit f6852f7.

Comment thread src/hooks/pre-tool-use.ts Outdated
const hasShellMeta = /[$`;|&<>()\\]/.test(cmd);
if (isReadLike && !hasShellMeta) {
const pathMatch = cmd.match(/~\/\.deeplake\/memory\/[\w./_-]+/)
|| toolPath.match(/~\/\.deeplake\/memory\/[\w./_-]+/);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path regex only matches ~/ prefix, so python3 /home/runner/.deeplake/memory/data.json (an absolute path) passes touchesMemory() and isReadLike, but silently falls through to RETRY guidance because neither cmd.match(...) nor toolPath.match(...) fires. Extend the regex to also match the MEMORY_PATH absolute prefix, or use rewritePaths on the whole cmd/toolPath and inspect the result.

Suggested change
|| toolPath.match(/~\/\.deeplake\/memory\/[\w./_-]+/);
const pathMatch = cmd.match(/(?:~\/\.deeplake\/memory|\/[^\s]*\.deeplake\/memory)\/[\w./_-]+/)
|| toolPath.match(/(?:~\/\.deeplake\/memory|\/[^\s]*\.deeplake\/memory)\/[\w./_-]+/);

Comment thread src/hooks/pre-tool-use.ts
if (cleanPath && !cleanPath.endsWith("/")) {
logFn(`unsupported command on file, converting to cat: ${cleanPath}`);
return buildAllowDecision(
`cat '${cleanPath.replace(/'/g, "'\\''")}'`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No test exercises the new happy path — the existing tests only verify that commands with shell metacharacters still produce RETRY guidance (the negative path). Add a test that python3 ~/.deeplake/memory/data.jsonupdatedCommand contains cat '/data.json' to prevent silent regressions in the rewrite logic.

kaghni added 2 commits April 21, 2026 21:10
Addresses two PR review comments on the auto-read conversion:

1. Path regex only matched tilde-prefixed memory paths. When haiku
   generated an absolute path (e.g. an expanded home path pointing into
   the memory tree), touchesMemory() returned true and isReadLike
   passed, but pathMatch never fired, so cleanPath was empty and
   auto-read silently fell through to the RETRY guidance — dead code
   for the most common case.

   Fix: run rewritePaths() on both cmd and toolPath first. rewritePaths
   already normalizes all three variants (tilde, dollar-HOME, absolute
   home dir). Then extract the normalized /-leading path from the result.

2. No happy-path test for the rewrite itself. The existing 32
   pre-tool-use tests all covered the negative path — commands with
   shell metacharacters still returning RETRY. Nothing asserted that
   python3 on a clean memory file actually becomes cat. The regex bug
   above could have silently regressed with no test failure.

   Added:
   - 9 pre-tool-use.test.ts cases covering python3/node/bun/deno/ruby
     happy path, absolute-path variant, trailing-slash rejection,
     shell-meta fall-through, dollar-HOME variable fall-through (must
     NOT rewrite because dollar is a shell metacharacter), single-quote
     escape check.
   - 5 grep-core.test.ts cases covering buildGrepSearchOptions
     multi-word splitting: basic, single-word, short-token filter,
     regex fallback, 4-word cap.

Tests: 934 passing total (up from 918). Branch coverage for grep-core
89.88% and pre-tool-use 87.25% — improved but still below the 90%
thresholds enforced by the coverage CI job.
Coverage CI was failing at 87.25% (pre-tool-use) and 89.88% (grep-core),
both below the 90% threshold. The subprocess-based e2e tests in
pre-tool-use.test.ts exercise the rewrite-to-cat path but vitest c8
does not instrument subprocesses, so those branches looked uncovered.

Added two targeted test batches that call the exported functions directly:

- pre-tool-use-branches.test.ts: 7 new cases for processPreToolUse
  covering the auto-read happy path (python3/node/ruby on tilde and
  absolute memory paths, description assertion) and its guards
  (trailing-slash rejection, shell-meta fall-through, non-interpreter
  command fall-through).
- grep-core.test.ts: 6 new cases for extractRegexAlternationPrefilters
  covering the null-return branches (char class, anchor, parens, quantifier,
  empty branch, leading/trailing pipe, trailing escape) plus dedupe and
  no-pipe fast-return.

Results: 947 tests (up from 934). Branch coverage
pre-tool-use 87.25% -> 90.43%, grep-core 89.88% -> 91.01%.
@efenocchi
Copy link
Copy Markdown
Collaborator

LGTM

@kaghni kaghni merged commit cada222 into main Apr 21, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants