feat: case-insensitive search_by_prefix() via parallel lowercase trie by damienriehl · Pull Request #16 · alea-institute/folio-python

damienriehl · 2026-04-07T23:36:14Z

Summary

Implements Option 1 from #15 (maintainer-approved): parallel lowercase marisa-trie for case-insensitive prefix search.

search_by_prefix() now accepts case_sensitive: bool = False
Lowercase input like "securit" returns 31 results (was 0)
Acronyms like "dui" match "DUI" / "Driving Under the Influence" (was 0)
case_sensitive=True preserves exact original behavior

Changes

`folio/graph.py`

3 new attributes on FOLIOGraph: _lowercase_label_trie, _lowercase_to_original (bridge dict), _ci_prefix_cache
Index building in parse_owl(): builds lowercase trie + bridge dict using str.casefold() from both label_to_index and alt_label_to_index keys
Search API: search_by_prefix(prefix, case_sensitive=False) routes to _search_by_prefix_insensitive() (new) or _search_by_prefix_sensitive() (original behavior)
Deduplication: results deduplicated by IRI index via seen set (prevents duplicates from lowercase label collisions)
Fallback parity: pure-Python fallback (when marisa_trie not installed) also supports case-insensitive search
Bug fix: clears _prefix_cache and _ci_prefix_cache at start of trie-building block (fixes pre-existing refresh() cache staleness)

`tests/test_folio.py`

Updated test_search_prefix to use case_sensitive=True (preserves original assertion)
5 new tests: case-insensitive search, acronym handling, backward compat, no-duplicate check, fallback parity

Design

case_sensitive=False:
  prefix.casefold() → _lowercase_label_trie.keys()
    → _lowercase_to_original[folded_key] → [original_labels]
      → label_to_index / alt_label_to_index → [indices]
        → deduplicate by index → [OWLClass]

Memory: ~1-2 MB additional for the second MARISA-compressed trie over ~18K labels.

Test plan

search_by_prefix("securit") → 31 results (was 0)
search_by_prefix("dui") → 2 results (was 0)
search_by_prefix("Securit", case_sensitive=True) → 45 results (preserved)
search_by_prefix("securit", case_sensitive=True) → 0 results (preserved)
No duplicate IRIs in case-insensitive results
Pure-Python fallback matches trie results
32 tests pass (27 existing + 5 new)

Closes #15

🤖 Generated with Claude Code

altLabels with xml:lang attributes (90% of all altLabels — 52,238 of 57,510) were added to translations but excluded from alternative_labels, making them invisible to search_by_label(). Now always append to alternative_labels regardless of language tag, so searches like "Patent Prosecution" correctly match "Patent Registration Process". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevent duplicate entries in alternative_labels when the same text appears as both a lang-tagged and non-lang-tagged altLabel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eter

mjbommar · 2026-04-08T06:46:00Z

Thanks for picking this up, @damienriehl — design is clean and the implementation is solid. Tests and the core functionality work as advertised. Before we merge I want to flag two things: a few mechanical cleanups, and a ranking observation that fell out of testing this PR which I think is worth a brief conversation.

Mechanical cleanups (blockers)

ruff check fails on tests/test_folio.py:303 — unused import folio.graph as graph_module. The monkeypatch fixture is also injected but never used; consider replacing the manual attribute swap in test_search_prefix_fallback_parity with monkeypatch.setattr(folio.graph, "marisa_trie", None), which is what the fixture is for.
ruff format --check wants to reformat both folio/graph.py and tests/test_folio.py.
ty check folio/ goes from 48 → 53 diagnostics. The 5 new ones are all in _search_by_prefix_insensitive and mirror pre-existing patterns in _search_by_prefix_sensitive (dict overload mismatches on the cache assignment + return type). CLAUDE.md asks us not to introduce new diagnostics even when they match existing ones — happy to either accept matching # type: ignore lines on both methods or to actually annotate _ci_prefix_cache correctly.

All three should be quick.

Ranking observation worth discussing

While probing the new behavior I noticed that search_by_prefix("Cal") is in rough shape on both branches:

Cal (main, case-sensitive)              cal (PR default, case-insensitive)
 0. Caldas               [CO+CAL]        0. California Supreme Court  [CAL]
 1. Caldas               [CO+CAL] (dup)  1. California Attorney General Reports [CALAG]
 2. Calista                              2. Caldas                    [CO+CAL]
 3. Military Draft       [Call-Up]       3. Calista
 4. Calabria             [IT+78]         4. Military Draft            [Call-Up]
 5. Calabria             [IT+78] (dup)   5. Calabria                  [IT+78]
...                                      ...
15. California           [US+CA]  ←     12. California                [US+CA]  ←

California (the U.S. state) is at position 15 on main and position 12 on PR — buried under a Colombian department, an Alaskan village, a Romanian county, an Italian region, and Military Draft (matched via the alt label Call-Up, which is 7 chars and therefore sorts ahead of the 10-char California). The same shape shows up for Mich and Tax:

'Mich'
  main:  ['Michigan', 'Michigan', 'Michoacan de Ocampo']
  PR:    ['Michigan Supreme Court', 'U.S. District Court - D. Michigan', 'Michigan']

'Tax'
  main:  ['Tax Law', '"Taxes" Definition', 'Tax Law']
  PR:    ['U.S. Tax Court', 'Tax Law', 'Tax and Revenue Law']

The mechanism: _search_by_prefix_* sorts matching keys purely by len(), then resolves to OWL classes. So a short alt label on a tangential class — Caldas (6), Call-Up (7), CAL (3), MICH (4), TAX (3) — always outranks the longer canonical label on the class users actually want. The PR doesn't introduce this (it's in the pre-existing case-sensitive path), but case-insensitivity surfaces it more visibly because all-caps reporter codes (CAL, MICH, TAX, MICHD, CALAG, CALCTAPP, ARB) now match mixed-case user input. The new dedup-by-IRI then "consumes" the canonical-label slot in favor of the shorter alt-label match.

A few questions, in increasing scope:

Land feat: case-insensitive search_by_prefix() via parallel lowercase trie #16 as-is and open a separate issue for the ranking heuristic? This is my default recommendation — your PR does what it advertises and the ranking problem predates it. We can file a follow-up ticket and link back to this discussion.
Or: extend feat: case-insensitive search_by_prefix() via parallel lowercase trie #16 with a small ranking tweak — e.g. sort by (is_alt_label, len(matched_key), len(class.label)) instead of just len(matched_key), so primary-label matches always beat alt-label matches when they resolve to different classes. That would unburry California/Michigan/Tax Law without changing the API.
Or: a real scoring function — label vs alt, branch popularity (Areas of Law / Jurisdictions get a boost), exact-prefix vs interior-prefix. Bigger lift, definitely a separate PR.

One concrete sub-question that's #16-specific: the new dedup is good, but it's "first matching key wins by length," which means alt-label matches consume the IRI slot before the canonical-label match for the same class. You can see this in the Mich example above — Michigan Supreme Court is matched first via MICH (4 chars), and the later 22-char Michigan Supreme Court key is deduped away. If we lift dedup into the case-sensitive path too (which I think we should — main currently returns duplicate Michigan/Caldas entries), it'd be worth deciding the dedup tiebreak rule explicitly: prefer label-key over alt-key when both resolve to the same IRI?

Happy with whatever direction you want to take — I just wanted to surface this before merging since the case-sensitivity work and the ranking quality are entangled enough that touching one invites questions about the other.

- Replace manual try/finally trie patching with idiomatic monkeypatch.setattr - Remove unused `import folio.graph as graph_module` (ruff F401) - Add `import folio.graph` at module level for monkeypatch target - Add type: ignore comments for ty diagnostics on trie .keys() and .get() calls - Run ruff format on both files

- Add dedup-by-index to _search_by_prefix_sensitive (matching insensitive pattern) - Change sort key to (k not in label_to_index, len(k)) on both search paths - Primary-label matches now rank before alt-label matches in all code paths - Both trie and pure-Python fallback branches use same sort key

- Add test_search_prefix_case_sensitive_no_duplicates for CS dedup - Add test_search_prefix_primary_label_ranks_first for label-over-alt ranking - Update test_search_prefix_fallback_parity to verify IRI set equality - All 45 tests pass including new dedup and ranking assertions

damienriehl · 2026-04-08T12:30:39Z

Thanks for the thorough review, @mjbommar. All feedback addressed in 3 new commits:

Mechanical cleanups (`510b425`)

ruff lint: Removed unused import folio.graph as graph_module; rewrote test_search_prefix_fallback_parity to use idiomatic monkeypatch.setattr(folio.graph, "marisa_trie", None) instead of the manual attribute swap + try/finally
ruff format: Both files now pass ruff format --check
ty diagnostics: Added # type: ignore comments on the 5 new diagnostics in _search_by_prefix_insensitive, matching the pre-existing pattern in _search_by_prefix_sensitive

Dedup-with-tiebreak + label-first ranking (`6890a2a`)

Implemented your Option 2 on both paths (_search_by_prefix_sensitive and _search_by_prefix_insensitive), including pure-Python fallbacks:

Sort key: Changed from key=len to key=lambda k: (k not in self.label_to_index, len(k)) — primary-label matches sort before alt-label matches (False < True), then by length within each group
Dedup: Added dedup-by-IRI-index to the case-sensitive path (it had none — this fixes the duplicate Michigan/Caldas entries you flagged). The case-insensitive path already had dedup; the new sort order naturally gives the tiebreak you suggested (label wins over alt-label for same IRI)

Before → After (your examples)

Cal:

	main (before)	PR v1	PR v2 (now)
California position	#15	#12	#6
Military Draft (Call-Up)	Top 5	Top 5	Gone from top 20
Duplicate Caldas	Yes	No	No
Duplicate Calabria	Yes	No	No

Mich:

	main	PR v1	PR v2
Michigan position	#0 (but duplicated)	#2 (behind MICH→Mich Supreme Court)	#0, no duplicates

Tax:

	main	PR v1	PR v2
Tax Law position	#0 (but duplicated)	#1 (behind TAX→U.S. Tax Court)	#0, no duplicates

Zero duplicate IRIs across all combinations (3 prefixes × CS/CI).

New tests (`4b8262a`)

test_search_prefix_case_sensitive_no_duplicates — asserts no duplicate IRIs in CS results
test_search_prefix_primary_label_ranks_first — asserts Michigan is first result for "Mich" (primary label beats alt-label matches)
Updated fallback parity test with monkeypatch idiom

45/45 tests pass, ruff clean, format clean.

Follow-up

Filed #17 for Option 3 (real scoring function) — linked back to this discussion. That's the right place to tackle branch popularity weighting, exact-prefix bonuses, and ontology depth.

…refix # Conflicts: # CLAUDE.md

The PR #16 squash merge included Damien's private GSD workflow files (.planning/PROJECT.md, REQUIREMENTS.md, ROADMAP.md, codebase/*, phases/*, research/*, etc.) which were never intended for the public repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bumps version in pyproject.toml, folio/__init__.py (which had drifted to 0.3.0 since the 0.3.0 release), and uv.lock self-reference. Adds CHANGES.md entry covering the case-insensitive prefix search feature, label-first ranking, and dedup fixes from #16. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Damien Riehl and others added 13 commits March 16, 2026 21:30

fix: Deduplicate alternative_labels when adding lang-tagged altLabels

303eb54

Prevent duplicate entries in alternative_labels when the same text appears as both a lang-tagged and non-lang-tagged altLabel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: map existing codebase

aee2211

docs: initialize project

b7b5038

chore: add project config

14dd597

docs: complete domain research

e23e038

docs: define v1 requirements

e001b82

docs: create roadmap (4 phases)

3cecf4b

docs(01): auto-generated context (discuss skipped)

2cb05a7

feat(01): declare lowercase trie, bridge dict, and CI cache attributes

a92bd60

feat(02): build lowercase trie and bridge dict in parse_owl()

ad695ec

feat(03): case-insensitive search_by_prefix with case_sensitive param…

b63fc97

…eter

test(04): add case-insensitive prefix search tests

ea1b520

damienriehl requested a review from mjbommar April 7, 2026 23:36

Damien Riehl added 4 commits April 8, 2026 07:17

docs(quick-260408-9yz): Address PR #16 review feedback

9018e27

damienriehl force-pushed the pr/case-insensitive-prefix branch from caa7c85 to 9018e27 Compare April 8, 2026 12:29

damienriehl mentioned this pull request Apr 8, 2026

search_by_prefix: implement relevance scoring function #17

Open

Merge remote-tracking branch 'origin/main' into pr/case-insensitive-p…

50f3e31

…refix # Conflicts: # CLAUDE.md

mjbommar merged commit fe05fd4 into main Apr 8, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: case-insensitive search_by_prefix() via parallel lowercase trie#16

feat: case-insensitive search_by_prefix() via parallel lowercase trie#16
mjbommar merged 18 commits intomainfrom
pr/case-insensitive-prefix

damienriehl commented Apr 7, 2026

Uh oh!

mjbommar commented Apr 8, 2026 •

edited

Loading

Uh oh!

damienriehl commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

damienriehl commented Apr 7, 2026

Summary

Changes

folio/graph.py

tests/test_folio.py

Design

Test plan

Uh oh!

mjbommar commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mechanical cleanups (blockers)

Ranking observation worth discussing

Uh oh!

damienriehl commented Apr 8, 2026

Mechanical cleanups (510b425)

Dedup-with-tiebreak + label-first ranking (6890a2a)

Before → After (your examples)

New tests (4b8262a)

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`folio/graph.py`

`tests/test_folio.py`

mjbommar commented Apr 8, 2026 •

edited

Loading

Mechanical cleanups (`510b425`)

Dedup-with-tiebreak + label-first ranking (`6890a2a`)

New tests (`4b8262a`)