feat(6.24): Clio FTS title boosting (Cerefox parity)#32
Merged
fstamatelopoulos merged 4 commits intomainfrom May 3, 2026
Merged
feat(6.24): Clio FTS title boosting (Cerefox parity)#32fstamatelopoulos merged 4 commits intomainfrom
fstamatelopoulos merged 4 commits intomainfrom
Conversation
Cerefox indexes a chunks-FTS column with `setweight(to_tsvector(doc_title),
'A') || setweight(to_tsvector(chunk_title), 'A') || setweight(...content,
'B')` so query terms in titles/headings outrank body-only matches in
ts_rank_cd's default A:B ratio of 1.0:0.4 (≈ 2.5×). cfcf-Clio's existing
2-column FTS5 (chunk_title + content) lacked the doc_title column entirely
and called `bm25(clio_chunks_fts)` with no per-column weights.
Migration 0002_fts_title_boost.sql:
- Drops the 2-column clio_chunks_fts + its triggers from migration 0001
- Recreates as a 3-column FTS5 (doc_title, chunk_title, content)
- Backfills via INSERT...SELECT joining clio_chunks → clio_documents
so existing corpora pick up the title column without any user action
- Reinstalls the per-chunk INSERT/UPDATE/DELETE triggers, now joining
clio_documents so doc_title is populated from the live row
- Adds a BEFORE DELETE trigger on clio_documents that pre-clears FTS
for all current chunks before the cascade-delete on clio_chunks
fires (otherwise the chunk-AD trigger can't reconstruct doc_title
via JOIN once the doc is gone)
- Adds an AFTER UPDATE OF title trigger on clio_documents that
refreshes FTS doc_title for all current chunks of the renamed doc
(mirrors Cerefox's cerefox_update_chunk_fts RPC) -- this is the
`cfcf clio docs edit --title` path
LocalClio.searchFts (and the chunk-level branch in searchHybrid):
bm25(clio_chunks_fts) → bm25(clio_chunks_fts, 4.0, 4.0, 1.0)
with the per-column constants extracted to FTS_BM25_WEIGHT_DOC_TITLE /
CHUNK_TITLE / CONTENT for one-spot tuning. 4× is slightly stronger than
Cerefox's effective 2.5× -- gives a clearer signal in the dense corpora
the agent roles ingest; tunable in one place if the ratio needs adjusting.
6 new unit tests (90 total, was 84):
- title-only matches outrank body-only matches
- chunk-heading matches outrank body-only matches
- renaming a doc refreshes FTS doc_title for all current chunks
- rename trigger doesn't blow away unrelated docs
- soft-delete still removes from search (existing JOIN-level filter
-- belt-and-suspenders to confirm the migration didn't break it)
- end-to-end backfill verification (search by doc_title only term)
Drive-by: stale assertion in update.test.ts that asserted on the
removed-for-security `releaseNotesUrl` field is now updated to assert
its absence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaced while running the full server test suite during 6.24 work: the v0.18.0 review pass (commit 'fix(web): swap Clio Project ...') removed releaseNotesUrl from the flag file shape for security reasons (the file lives in ~/.cfcf/, user-writable; an attacker-controlled link rendered as <a target="_blank"> would be a phishing surface). The corresponding test still asserted on the field's presence and was quietly failing. Replace the contains-URL check with a toBeUndefined check that locks in the security property: the response must not carry any clickable URL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback: Clio is the local-memory migration of Cerefox; we want
weight ratios to match unless there's a specific reason to deviate.
Cerefox's effective A:B ratio is 2.5× (Postgres `ts_rank_cd` defaults
{D:0.1, C:0.2, B:0.4, A:1.0} → A/B = 1.0/0.4). Drop our SQLite FTS5
bm25() title weights from 4.0 → 2.5 to match.
The underlying ranker (BM25 vs ts_rank_cd) is necessarily different
across the two stacks -- that's a "we picked SQLite" consequence, not
a tuning choice -- but the title-boost ratio is the user-visible
knob that decides "how much do title matches outrank body matches?".
With both at 2.5× the ordering of top hits should be very close
across the two systems for any given query.
Migration header comment updated to reflect parity. Tests still pass
unchanged (the title-only-outranks-body-only assertion holds at any
ratio ≥ ~1.5×, so the lower number doesn't weaken the test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan housekeeping that ships with this PR rather than as a separate post-merge patch (the work itself isn't on main yet): - Row 6.24 ❌ → ✅ with the implemented design captured (3-column FTS with doc_title + chunk_title + content, 5 triggers covering chunk + doc lifecycle, bm25(2.5, 2.5, 1.0) per-column weights matching Cerefox's effective A:B = 2.5× ratio). - Iter-6 active-set + headline summary lines updated: 6.24 marked as shipped-on-branch awaiting v0.19.0; remaining set is now 6.9, 6.11, 6.13, 6.18 (+ 6.19 partial). - New F.19 in the Backlog: Clio FTS search optimisations (stop-word handling + Google-style query operators). Captures both follow-ups surfaced during the 6.24 Cerefox-parity review under one entry, with mitigation options ranked by effort. Trigger: agent searches surface irrelevant common-word docs during dogfooding (stop-word case), or a power user asks for `-foo` / `"phrase"` syntax (parser case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fstamatelopoulos
added a commit
that referenced
this pull request
May 3, 2026
PR #32 (Clio FTS title boosting) merged. Open the v0.19.0 release entry with the 6.24 section + the drive-by stale-test fix carried in the same PR. Leave [Unreleased] empty above it for the next batch (6.18 web Clio tab + further iter-6 work will land under the same minor before this gets cut to npm). Minor bump because the migration changes the on-disk FTS schema, even though reads + writes remain backward compatible (auto-runs on first Clio touch after upgrade; transactional rollback on failure). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings Clio's FTS ranking up to Cerefox parity for title-boosted search:
query terms that match a document title or a chunk heading now outrank
body-only matches at the same ratio Cerefox uses (effective A:B = 2.5×).
Migrates the existing 2-column
clio_chunks_fts(chunk_title+content) to a 3-column shape (doc_title+chunk_title+content),populated automatically via SQLite triggers and weighted at query time
through SQLite FTS5's per-column
bm25()knob.Cerefox → Clio mapping
setweight(...,'A') || setweight(...,'B')baked intotsvectorat insert timebm25(table, w0, w1, w2)weights at query timets_rank_cddefaultsbm25(2.5, 2.5, 1.0)cerefox_update_chunk_fts(doc_id, new_title)RPC, called from app codeclio_documents_fts_title_autrigger, fires automatically onUPDATE OF titlescripts/reindex_all.pyafter migrationINSERT...SELECTinside the migrationThe underlying ranker is necessarily different (Postgres
ts_rank_cdcover-density vs SQLite
bm25Okapi BM25) — that's a stack consequence,not a tuning choice. Within each stack we apply the same conceptual
title-boost ratio. For typical agent retrieval (top-3 hits, small-to-mid
corpora) the ranker difference is barely visible; mid-list shuffling
that BM25's lack of proximity sensitivity introduces is hidden by hybrid
mode (semantic + FTS) once an embedder is installed.
Files
Added:
packages/core/src/clio/migrations/0002_fts_title_boost.sql— dropsthe old 2-column FTS table + its triggers, recreates as 3-column,
backfills via JOIN, reinstalls 5 triggers covering every path:
clio_chunks_fts_ai/_ad/_au— chunk lifecycle, JOIN to docs fordoc_titleclio_documents_fts_bd(BEFORE DELETE) — pre-clears FTS forcurrent chunks before the cascade-delete fires; required because
the chunk_ad trigger can't reconstruct
doc_titlevia JOIN oncethe doc row is gone
clio_documents_fts_title_au(AFTER UPDATE OF title) — refreshesFTS
doc_titlefor all current chunks of the renamed doc;mirrors Cerefox's
cerefox_update_chunk_ftsRPCChanged:
packages/core/src/clio/db.ts— registers migration 0002.packages/core/src/clio/backend/local-clio.ts— bothbm25(...)callsites (chunk-level path in
searchFts+ chunk-level path in thehybrid SQL) take per-column weights. Constants extracted as
FTS_BM25_WEIGHT_DOC_TITLE / CHUNK_TITLE / CONTENTfor one-spottuning if the ratio ever needs adjusting.
packages/core/src/clio/backend/local-clio.test.ts— 6 new tests(90 total, was 84): title-only outranks body-only, chunk-heading
outranks body-only, rename refreshes FTS, rename doesn't blow away
unrelated docs, soft-delete still removes from search, end-to-end
backfill verification.
Drive-by fix:
packages/server/src/routes/update.test.ts— stale assertion from6.20 (the URL-removal-for-security pass) updated to assert the
field's absence; was quietly failing the server suite and got
caught running tests for this work.
Plan housekeeping (in this PR):
this work: stop-word handling + Google-style query operators (both
noisier/dumber than Cerefox-on-Postgres; both pure-TS query-time
fixes). Trigger criteria documented; not in scope here.
Migration behaviour
getClioBackend()call after server boot — i.e.the next time anything touches Clio (Memory page stats, search,
ingest, …). Sub-second on small corpora; sub-minute on 10k chunks.
BEGIN IMMEDIATE; ...; COMMITper the migration runner —partial failures roll back cleanly, leaving the old 2-column table
in place.
doc_titleautomatically via theINSERT...SELECTbackfill — no separatecfcf clio reindex --reftscommand needed (the original 6.24 scope mentioned one; the in-place
backfill made it unnecessary).
Test plan
bun run typecheckcleanbun run test(585 / 100 / 66 / 9 — all green; +6 new inlocal-clio.test.ts; server-suite glob is silent under bun-test as
noted in prior PRs, individual files verified)
only, one in the body only — title-doc ranks first
title, re-search — doc now appears
~/.cfcf/clio.db(sub-second expected) and post-migration searchesshow titled docs ranked above body-only hits