Skip to content

KS67: Schema-driven fact extraction + Greptile AI review#1

Merged
Liorrr merged 10 commits into
masterfrom
feat/schema-extraction
Apr 5, 2026
Merged

KS67: Schema-driven fact extraction + Greptile AI review#1
Liorrr merged 10 commits into
masterfrom
feat/schema-extraction

Conversation

@Liorrr
Copy link
Copy Markdown
Contributor

@Liorrr Liorrr commented Apr 5, 2026

Summary

  • Greptile AI code review: Added greptile.json config — strictness 2 for core crates, custom rules (no unwrap outside tests, no lock-across-await, semver enforcement), directory-level review scoping across all 11 crates
  • KS67 Phase 1 — schema-driven fact extraction: Open-domain extraction pipeline replacing the hardcoded 5-fact cap and "The user [verb]" whitelist. New ExtractedFact/FactType types, dynamic max_facts based on content length, v2 LLM prompt with structured JSON output, near-duplicate child dedup (cosine >0.95), embedding-based supersession detection
  • Supersession fix: Parent-content supersession (Pass 3) detects contradictions against parent memory text directly, not just enrichment children. Expanded relationship regex (working at, started at, living in, etc.), lowered embedding threshold to 0.70, enabled supersedes_demotion default (was 0.0)
  • CI fix: Exclude platform-specific crates (shrimpk-tray, shrimpk-ros2) from CI matrix, resolve all clippy/fmt/doc errors

Micro-benchmark recall: 55% baseline → 80% (target was 75%)

Files changed

Area Files What
Types traits.rs, lib.rs, config.rs FactType enum, ExtractedFact struct, supersedes_demotion default
Extraction consolidator.rs v2 open-domain prompt, structured JSON parser, token budget 512→768
Consolidation consolidation.rs dynamic_max_facts, parent-content supersession, regex expansion, dedup guard, subject fallback
Echo pipeline echo.rs build_subject_map, enforce_subject_diversity cap
CI/CD ci.yml, hebbian.rs Platform exclusions, clippy fixes
Greptile greptile.json AI review config with per-crate rules
Tests echo_micro_benchmark.rs + 9 test files 20-memory micro-benchmark, updated test configs

Test plan

  • cargo test --workspace — 333+ tests pass (shrimpk-memory), 73 (shrimpk-core)
  • cargo test --release --test echo_micro_benchmark — 16/20 (80%) with Ollama consolidation
  • Supersession unit tests: parent-content, parent-location, v2 verb forms
  • CI pipeline passes (clippy, fmt, test on ubuntu)

🤖 Generated with Claude Code

Liorrr added 4 commits April 5, 2026 09:06
  - Add ExtractedFact struct + FactType enum (7 variants) to shrimpk-core
  - Replace hardcoded 5-fact cap with dynamic_max_facts (1-12 based on content length)
  - v2 open-domain extraction prompt: removes verb whitelist, supports multi-entity facts
  - Structured JSON parser: text/subject/type/confidence per fact
  - Consolidation migrated to extract_facts_and_labels with structured_facts
  - Near-duplicate child dedup (cosine > 0.95 skip)
  - Embedding-based supersession detection (0.80 supersede, 0.95 identity-skip)
  - Subject diversity cap in echo results (max 3 per subject entity)

  Target: 55% → 75% recall (benchmark validation pending)
  - Strictness 2 globally, 3 for stubs/tests
  - Logic + syntax only (style deferred to clippy)
  - Per-crate directoryRules: core (semver), memory (math), daemon (async), mcp, python (PyO3)
  - Custom rules: no unwrap outside tests, no lock-across-await, serde consistency
  - Ignores target/, lockfiles, dist/, profiling artifacts
  - Add parent-content supersession (Pass 3): detect contradictions
    against parent memory text, not just enrichment children. Fixes
    cases where LLM extracts 0 matching child facts but parent content
    contains the relationship.
  - Set supersedes_demotion default to 0.15 (was 0.0 — supersession
    edges had zero scoring effect)
  - Expand relationship regex: working at, started at, living in,
    moving to, using, switching to
  - Restrict regex supersession to single-valued relations (WorksAt,
    LivesIn) — prevents false positives from PrefersTool
  - Lower embedding supersession threshold from 0.80 to 0.70
  - Add verb pattern hints to v2 extraction prompt
  - Micro-benchmark: 14/20 (70%) → 16/20 (80%), target was 75%
… errors

  - ci.yml: exclude shrimpk-tray and shrimpk-viz from Linux/macOS builds
    (glib-sys dependency only available on Windows)
  - Fix ~60 clippy errors masked behind glib-sys build failure (5 sprints)
    - Collapsible if → let chains, clamp, is_none_or, div_ceil,
      from_ref, char arrays, dead code annotations, type complexity allows
  - Fix rustfmt diffs in persistence.rs and store.rs
  - Fix 2 rustdoc errors: broken intra-doc link, unescaped generics
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 5, 2026

Greptile Summary

This PR delivers KS67's schema-driven fact extraction pipeline on top of a supersession correctness overhaul and a CI fix for platform-specific crates.

Key changes:

  • New FactType / ExtractedFact types in shrimpk-core::traits replace the flat Vec<String> fact representation. The v2 Ollama prompt emits structured JSON objects; the parser handles both v2 objects and legacy strings in the same array with graceful fallback.
  • dynamic_max_facts scales the LLM fact budget by content word-count (1–12) instead of the fixed cap of 5; the operator default max_facts_per_memory is raised from 5→12.
  • Three-pass supersession detection: Pass 1 (regex, unchanged), Pass 2 (new — embedding cosine ≥ 0.70 + subject overlap), Pass 3 (new — parent-content scan for memories with zero extracted children). Passes 1 and 3 correctly restrict to works_at/lives_in; Pass 2 does not, which can generate false-positive supersession for multi-valued facts (see inline P1 comment).
  • supersedes_demotion default changed 0.0 → 0.15, activating the demotion penalty globally for all existing deployments on upgrade.
  • ConsolidationOutput gains a new public field structured_facts: Vec<ExtractedFact>. Without #[non_exhaustive], this is a semver-breaking change for external Consolidator implementors that construct the struct by name.
  • Subject diversity cap (enforce_subject_diversity, hardcoded at 3 per subject) and build_subject_map added to the echo pipeline to prevent any single entity from monopolising results.
  • CI fixed by excluding shrimpk-tray/shrimpk-viz from clippy, test, and doc steps across all OS targets.
  • The two bugs flagged in earlier review threads — the Supersedes self-loop and Pass 3's missing subjects_overlap guard — are both correctly addressed in this PR.

Confidence Score: 3/5

Mostly ready; Pass 2 embedding supersession can silently corrupt rankings for multi-valued facts now that supersedes_demotion is active by default — needs the same category guard as Passes 1 and 3 before merge.

The two previously-flagged P1 bugs (Supersedes self-loop, Pass 3 missing subjects_overlap) are both correctly resolved. A new P1 logic gap remains: Pass 2 fires on any semantically-similar enrichment pair with subject overlap and no category restriction, meaning normal user activity facts (two concurrent hobbies, two concurrent preferences) will be wrongly demoted. With supersedes_demotion now defaulting to 0.15 instead of 0.0, each false-positive supersession causes a silent ranking penalty that degrades recall for valid memories in production. The fix is a two-line guard identical to what Passes 1 and 3 already apply, making this a targeted, low-risk change.

crates/shrimpk-memory/src/consolidation.rs (Pass 2 category guard, stale doc-comment); crates/shrimpk-core/src/traits.rs (ConsolidationOutput #[non_exhaustive])

Important Files Changed

Filename Overview
crates/shrimpk-memory/src/consolidation.rs KS67 supersession overhaul: Passes 1 & 3 correctly restrict to single-valued relationships, but Pass 2 (embedding path, line 874) is missing that guard and can falsely demote multi-valued enrichments; stale doc-comment cites 0.80 but code uses 0.70
crates/shrimpk-core/src/traits.rs New FactType enum and ExtractedFact struct added correctly; ConsolidationOutput gains structured_facts field — additive inside the workspace but breaks exhaustive struct construction for external Consolidator implementors without #[non_exhaustive]
crates/shrimpk-memory/src/consolidator.rs v2 open-domain enrichment prompt removes The-user prefix and 13-verb whitelist; structured JSON parser handles both v1 string and v2 object fact formats with graceful fallback to plain-text; token budget bumped 512→768
crates/shrimpk-core/src/config.rs default_max_facts_per_memory raised 5→12 (dynamic_max_facts will cap short memories); supersedes_demotion default changed 0.0→0.15 — activates demotion penalty for all existing deployments on upgrade
crates/shrimpk-memory/src/echo.rs Adds build_subject_map + enforce_subject_diversity (hardcoded cap of 3 per subject) to prevent identity gravity well in results; remaining changes are formatting and clippy fixes
.github/workflows/ci.yml Platform-specific crates (shrimpk-tray, shrimpk-viz) correctly excluded from clippy, test, and doc steps; Windows intentionally keeps empty exclude string to include all crates
crates/shrimpk-memory/src/hebbian.rs Formatting and clippy fixes only — no logic changes
crates/shrimpk-memory/src/store.rs Formatting and clippy fixes only — no logic changes

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[consolidate loop — idx] --> B[extract_facts_and_labels\nv2 structured prompt]
    B --> C{structured_facts\nnon-empty?}
    C -- yes --> D[fact_entries = structured_facts]
    C -- no --> E[fact_entries = legacy strings]
    D --> F[embed each fact]
    E --> F
    F --> G{near-dup cosine >0.95\nvs existing children?}
    G -- yes --> H[skip child creation]
    G -- no --> I[add enrichment child]
    I --> J[detect_relationship\n→ Hebbian edge]
    F --> K[fact_embeddings collected]
    K --> L[detect_supersedes_pairs]
    L --> M[Pass 1: regex\nworks_at/lives_in only ✓]
    L --> N[Pass 2: embedding cosine >0.70\n+ subject overlap\nNO category guard ⚠️]
    L --> O[Pass 3: parent-content\nworks_at/lives_in only ✓]
    M --> P[Supersedes Hebbian edge\ndemotion=0.15 default]
    N --> P
    O --> P
Loading

Reviews (7): Last reviewed commit: "Fix fmt + clippy in consolidation.rs (si..." | Re-trigger Greptile

Comment thread crates/shrimpk-memory/src/consolidation.rs Outdated
Liorrr added 4 commits April 6, 2026 00:10
- persistence.rs:334: merge parent/dir if-let chain (KS67 code)
  - echo_multimodal_bench.rs:595: triple-nested if → let chain
  - Only visible on Linux CI (cfg-gated block, invisible on Windows)
  - Add subjects_overlap check to Pass 3 parent-content supersession                                                     to prevent cross-persona false positives (@greptile P1 review)                                                     - Cap dynamic_max_facts with config.max_facts_per_memory so operators
    can override the dynamic scaling
  - Bump max_facts_per_memory default from 5 to 12 (matches dynamic ceiling)
@Liorrr Liorrr marked this pull request as draft April 5, 2026 21:47
@Liorrr Liorrr marked this pull request as ready for review April 5, 2026 21:47
Comment thread crates/shrimpk-memory/src/consolidation.rs
  - Prevent Supersedes(idx, idx) self-edge when old child's parent_id
    resolves to current parent being enriched — with demotion=0.15
    this silently demoted the memory in every ranking pass
  - Skip same-parent siblings in detect_supersedes_pairs Pass 1 so
    temporal facts from a single extraction ("worked at Google, now
    works at Meta") don't trigger mutual supersession
@Liorrr Liorrr marked this pull request as draft April 5, 2026 22:36
@Liorrr Liorrr marked this pull request as ready for review April 5, 2026 22:37
  - Collapse nested if-let for parent_id check
  - Fix line wrapping on current_parent_id binding
@Liorrr Liorrr merged commit 9b8e20b into master Apr 5, 2026
6 checks passed
Comment on lines +834 to +880
// Pass 2: Embedding-based supersession (KS67)
// For facts that have pre-computed embeddings, check cosine similarity
// against existing enrichment children. This catches semantic supersession
// that regex patterns miss (e.g., paraphrased contradictions).
for (fact_idx, fact) in new_facts.iter().enumerate() {
let new_emb = match new_embeddings.get(fact_idx) {
Some(emb) if !emb.is_empty() => emb,
_ => continue,
};

for i in 0..store.len() {
// Skip if already matched by regex pass
if matched_old_indices.contains(&i) {
continue;
}

let entry = match store.entry_at(i) {
Some(e) => e,
None => continue,
};

// Only compare against enrichment-sourced entries (child facts)
if entry.source != "enrichment" {
continue;
}

let existing_emb = match store.embedding_at(i) {
Some(emb) if !emb.is_empty() => emb,
_ => continue,
};

let cosine = crate::similarity::cosine_similarity(new_emb, existing_emb);

// >0.95: near-identity repeat — skip (not a supersession, just a duplicate)
if cosine > 0.95 {
continue;
}

// >0.70 + subject overlap: semantic supersession (lowered from 0.80
// to catch v2 open-domain facts with different verb forms)
if cosine > 0.70 && subjects_overlap(fact, &entry.content) {
matched_old_indices.insert(i);
pairs.push((i, fact.clone()));
break; // One supersession per fact is enough
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Pass 2 embedding supersession missing single-valued-relationship guard

Pass 1 (line 789) and Pass 3 (line 929) both restrict supersession to works_at / lives_in — the only categories where a person can hold exactly one current value. Pass 2 applies no such restriction:

// Line 874 — no category check before firing a supersession
if cosine > 0.70 && subjects_overlap(fact, &entry.content) {
    matched_old_indices.insert(i);
    pairs.push((i, fact.clone()));
    break;
}

Concrete false-positive with supersedes_demotion = 0.15 now enabled by default:

  • Enrichment child A: "Alice enjoys hiking on weekends"
  • New fact B: "Alice enjoys rock-climbing on weekends"

Both are about Alice, both likely cosine ≥ 0.70 (shared activity frame), subjects_overlap fires → child A is incorrectly demoted. Activities and preferences are multi-valued: Alice can do both.

Add the same guard already used by Passes 1 and 3:

// Pass 2: Embedding-based supersession (KS67)
for (fact_idx, fact) in new_facts.iter().enumerate() {
+   // Limit to single-valued categories (same rationale as Pass 1 and Pass 3)
+   let new_rel_opt = detect_relationship(fact);
+   let category = new_rel_opt.as_ref().map(|r| categorize_relationship(r).0);
+   if !matches!(category.as_deref(), Some("works_at") | Some("lives_in")) {
+       continue;
+   }
    let new_emb = match new_embeddings.get(fact_idx) {
        Some(emb) if !emb.is_empty() => emb,
        _ => continue,
    };
    // ... rest of loop unchanged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant