feat: case-insensitive search_by_prefix() via parallel lowercase trie#16
feat: case-insensitive search_by_prefix() via parallel lowercase trie#16
Conversation
altLabels with xml:lang attributes (90% of all altLabels — 52,238 of 57,510) were added to translations but excluded from alternative_labels, making them invisible to search_by_label(). Now always append to alternative_labels regardless of language tag, so searches like "Patent Prosecution" correctly match "Patent Registration Process". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevent duplicate entries in alternative_labels when the same text appears as both a lang-tagged and non-lang-tagged altLabel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for picking this up, @damienriehl — design is clean and the implementation is solid. Tests and the core functionality work as advertised. Before we merge I want to flag two things: a few mechanical cleanups, and a ranking observation that fell out of testing this PR which I think is worth a brief conversation. Mechanical cleanups (blockers)
All three should be quick. Ranking observation worth discussingWhile probing the new behavior I noticed that
The mechanism: A few questions, in increasing scope:
One concrete sub-question that's #16-specific: the new dedup is good, but it's "first matching key wins by length," which means alt-label matches consume the IRI slot before the canonical-label match for the same class. You can see this in the Happy with whatever direction you want to take — I just wanted to surface this before merging since the case-sensitivity work and the ranking quality are entangled enough that touching one invites questions about the other. |
- Replace manual try/finally trie patching with idiomatic monkeypatch.setattr - Remove unused `import folio.graph as graph_module` (ruff F401) - Add `import folio.graph` at module level for monkeypatch target - Add type: ignore comments for ty diagnostics on trie .keys() and .get() calls - Run ruff format on both files
- Add dedup-by-index to _search_by_prefix_sensitive (matching insensitive pattern) - Change sort key to (k not in label_to_index, len(k)) on both search paths - Primary-label matches now rank before alt-label matches in all code paths - Both trie and pure-Python fallback branches use same sort key
- Add test_search_prefix_case_sensitive_no_duplicates for CS dedup - Add test_search_prefix_primary_label_ranks_first for label-over-alt ranking - Update test_search_prefix_fallback_parity to verify IRI set equality - All 45 tests pass including new dedup and ranking assertions
caa7c85 to
9018e27
Compare
|
Thanks for the thorough review, @mjbommar. All feedback addressed in 3 new commits: Mechanical cleanups (510b425)
Dedup-with-tiebreak + label-first ranking (6890a2a)Implemented your Option 2 on both paths (
Before → After (your examples)
Zero duplicate IRIs across all combinations (3 prefixes × CS/CI). New tests (4b8262a)
45/45 tests pass, ruff clean, format clean. Follow-upFiled #17 for Option 3 (real scoring function) — linked back to this discussion. That's the right place to tackle branch popularity weighting, exact-prefix bonuses, and ontology depth. |
…refix # Conflicts: # CLAUDE.md
The PR #16 squash merge included Damien's private GSD workflow files (.planning/PROJECT.md, REQUIREMENTS.md, ROADMAP.md, codebase/*, phases/*, research/*, etc.) which were never intended for the public repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bumps version in pyproject.toml, folio/__init__.py (which had drifted to 0.3.0 since the 0.3.0 release), and uv.lock self-reference. Adds CHANGES.md entry covering the case-insensitive prefix search feature, label-first ranking, and dedup fixes from #16. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Implements Option 1 from #15 (maintainer-approved): parallel lowercase
marisa-triefor case-insensitive prefix search.search_by_prefix()now acceptscase_sensitive: bool = False"securit"returns 31 results (was 0)"dui"match"DUI"/"Driving Under the Influence"(was 0)case_sensitive=Truepreserves exact original behaviorChanges
folio/graph.pyFOLIOGraph:_lowercase_label_trie,_lowercase_to_original(bridge dict),_ci_prefix_cacheparse_owl(): builds lowercase trie + bridge dict usingstr.casefold()from bothlabel_to_indexandalt_label_to_indexkeyssearch_by_prefix(prefix, case_sensitive=False)routes to_search_by_prefix_insensitive()(new) or_search_by_prefix_sensitive()(original behavior)seenset (prevents duplicates from lowercase label collisions)marisa_trienot installed) also supports case-insensitive search_prefix_cacheand_ci_prefix_cacheat start of trie-building block (fixes pre-existingrefresh()cache staleness)tests/test_folio.pytest_search_prefixto usecase_sensitive=True(preserves original assertion)Design
Memory: ~1-2 MB additional for the second MARISA-compressed trie over ~18K labels.
Test plan
search_by_prefix("securit")→ 31 results (was 0)search_by_prefix("dui")→ 2 results (was 0)search_by_prefix("Securit", case_sensitive=True)→ 45 results (preserved)search_by_prefix("securit", case_sensitive=True)→ 0 results (preserved)Closes #15
🤖 Generated with Claude Code