Skip to content

feat: case-insensitive search_by_prefix() via parallel lowercase trie#16

Merged
mjbommar merged 18 commits intomainfrom
pr/case-insensitive-prefix
Apr 8, 2026
Merged

feat: case-insensitive search_by_prefix() via parallel lowercase trie#16
mjbommar merged 18 commits intomainfrom
pr/case-insensitive-prefix

Conversation

@damienriehl
Copy link
Copy Markdown
Contributor

Summary

Implements Option 1 from #15 (maintainer-approved): parallel lowercase marisa-trie for case-insensitive prefix search.

  • search_by_prefix() now accepts case_sensitive: bool = False
  • Lowercase input like "securit" returns 31 results (was 0)
  • Acronyms like "dui" match "DUI" / "Driving Under the Influence" (was 0)
  • case_sensitive=True preserves exact original behavior

Changes

folio/graph.py

  • 3 new attributes on FOLIOGraph: _lowercase_label_trie, _lowercase_to_original (bridge dict), _ci_prefix_cache
  • Index building in parse_owl(): builds lowercase trie + bridge dict using str.casefold() from both label_to_index and alt_label_to_index keys
  • Search API: search_by_prefix(prefix, case_sensitive=False) routes to _search_by_prefix_insensitive() (new) or _search_by_prefix_sensitive() (original behavior)
  • Deduplication: results deduplicated by IRI index via seen set (prevents duplicates from lowercase label collisions)
  • Fallback parity: pure-Python fallback (when marisa_trie not installed) also supports case-insensitive search
  • Bug fix: clears _prefix_cache and _ci_prefix_cache at start of trie-building block (fixes pre-existing refresh() cache staleness)

tests/test_folio.py

  • Updated test_search_prefix to use case_sensitive=True (preserves original assertion)
  • 5 new tests: case-insensitive search, acronym handling, backward compat, no-duplicate check, fallback parity

Design

case_sensitive=False:
  prefix.casefold() → _lowercase_label_trie.keys()
    → _lowercase_to_original[folded_key] → [original_labels]
      → label_to_index / alt_label_to_index → [indices]
        → deduplicate by index → [OWLClass]

Memory: ~1-2 MB additional for the second MARISA-compressed trie over ~18K labels.

Test plan

  • search_by_prefix("securit") → 31 results (was 0)
  • search_by_prefix("dui") → 2 results (was 0)
  • search_by_prefix("Securit", case_sensitive=True) → 45 results (preserved)
  • search_by_prefix("securit", case_sensitive=True) → 0 results (preserved)
  • No duplicate IRIs in case-insensitive results
  • Pure-Python fallback matches trie results
  • 32 tests pass (27 existing + 5 new)

Closes #15

🤖 Generated with Claude Code

Damien Riehl and others added 13 commits March 16, 2026 21:30
altLabels with xml:lang attributes (90% of all altLabels — 52,238 of
57,510) were added to translations but excluded from alternative_labels,
making them invisible to search_by_label(). Now always append to
alternative_labels regardless of language tag, so searches like "Patent
Prosecution" correctly match "Patent Registration Process".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevent duplicate entries in alternative_labels when the same text
appears as both a lang-tagged and non-lang-tagged altLabel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@damienriehl damienriehl requested a review from mjbommar April 7, 2026 23:36
@mjbommar
Copy link
Copy Markdown
Contributor

mjbommar commented Apr 8, 2026

Thanks for picking this up, @damienriehl — design is clean and the implementation is solid. Tests and the core functionality work as advertised. Before we merge I want to flag two things: a few mechanical cleanups, and a ranking observation that fell out of testing this PR which I think is worth a brief conversation.

Mechanical cleanups (blockers)

  • ruff check fails on tests/test_folio.py:303 — unused import folio.graph as graph_module. The monkeypatch fixture is also injected but never used; consider replacing the manual attribute swap in test_search_prefix_fallback_parity with monkeypatch.setattr(folio.graph, "marisa_trie", None), which is what the fixture is for.
  • ruff format --check wants to reformat both folio/graph.py and tests/test_folio.py.
  • ty check folio/ goes from 48 → 53 diagnostics. The 5 new ones are all in _search_by_prefix_insensitive and mirror pre-existing patterns in _search_by_prefix_sensitive (dict overload mismatches on the cache assignment + return type). CLAUDE.md asks us not to introduce new diagnostics even when they match existing ones — happy to either accept matching # type: ignore lines on both methods or to actually annotate _ci_prefix_cache correctly.

All three should be quick.

Ranking observation worth discussing

While probing the new behavior I noticed that search_by_prefix("Cal") is in rough shape on both branches:

Cal (main, case-sensitive)              cal (PR default, case-insensitive)
 0. Caldas               [CO+CAL]        0. California Supreme Court  [CAL]
 1. Caldas               [CO+CAL] (dup)  1. California Attorney General Reports [CALAG]
 2. Calista                              2. Caldas                    [CO+CAL]
 3. Military Draft       [Call-Up]       3. Calista
 4. Calabria             [IT+78]         4. Military Draft            [Call-Up]
 5. Calabria             [IT+78] (dup)   5. Calabria                  [IT+78]
...                                      ...
15. California           [US+CA]  ←     12. California                [US+CA]  ←

California (the U.S. state) is at position 15 on main and position 12 on PR — buried under a Colombian department, an Alaskan village, a Romanian county, an Italian region, and Military Draft (matched via the alt label Call-Up, which is 7 chars and therefore sorts ahead of the 10-char California). The same shape shows up for Mich and Tax:

'Mich'
  main:  ['Michigan', 'Michigan', 'Michoacan de Ocampo']
  PR:    ['Michigan Supreme Court', 'U.S. District Court - D. Michigan', 'Michigan']

'Tax'
  main:  ['Tax Law', '"Taxes" Definition', 'Tax Law']
  PR:    ['U.S. Tax Court', 'Tax Law', 'Tax and Revenue Law']

The mechanism: _search_by_prefix_* sorts matching keys purely by len(), then resolves to OWL classes. So a short alt label on a tangential class — Caldas (6), Call-Up (7), CAL (3), MICH (4), TAX (3) — always outranks the longer canonical label on the class users actually want. The PR doesn't introduce this (it's in the pre-existing case-sensitive path), but case-insensitivity surfaces it more visibly because all-caps reporter codes (CAL, MICH, TAX, MICHD, CALAG, CALCTAPP, ARB) now match mixed-case user input. The new dedup-by-IRI then "consumes" the canonical-label slot in favor of the shorter alt-label match.

A few questions, in increasing scope:

  1. Land feat: case-insensitive search_by_prefix() via parallel lowercase trie #16 as-is and open a separate issue for the ranking heuristic? This is my default recommendation — your PR does what it advertises and the ranking problem predates it. We can file a follow-up ticket and link back to this discussion.
  2. Or: extend feat: case-insensitive search_by_prefix() via parallel lowercase trie #16 with a small ranking tweak — e.g. sort by (is_alt_label, len(matched_key), len(class.label)) instead of just len(matched_key), so primary-label matches always beat alt-label matches when they resolve to different classes. That would unburry California/Michigan/Tax Law without changing the API.
  3. Or: a real scoring function — label vs alt, branch popularity (Areas of Law / Jurisdictions get a boost), exact-prefix vs interior-prefix. Bigger lift, definitely a separate PR.

One concrete sub-question that's #16-specific: the new dedup is good, but it's "first matching key wins by length," which means alt-label matches consume the IRI slot before the canonical-label match for the same class. You can see this in the Mich example above — Michigan Supreme Court is matched first via MICH (4 chars), and the later 22-char Michigan Supreme Court key is deduped away. If we lift dedup into the case-sensitive path too (which I think we should — main currently returns duplicate Michigan/Caldas entries), it'd be worth deciding the dedup tiebreak rule explicitly: prefer label-key over alt-key when both resolve to the same IRI?

Happy with whatever direction you want to take — I just wanted to surface this before merging since the case-sensitivity work and the ranking quality are entangled enough that touching one invites questions about the other.

Damien Riehl added 4 commits April 8, 2026 07:17
- Replace manual try/finally trie patching with idiomatic monkeypatch.setattr
- Remove unused `import folio.graph as graph_module` (ruff F401)
- Add `import folio.graph` at module level for monkeypatch target
- Add type: ignore comments for ty diagnostics on trie .keys() and .get() calls
- Run ruff format on both files
- Add dedup-by-index to _search_by_prefix_sensitive (matching insensitive pattern)
- Change sort key to (k not in label_to_index, len(k)) on both search paths
- Primary-label matches now rank before alt-label matches in all code paths
- Both trie and pure-Python fallback branches use same sort key
- Add test_search_prefix_case_sensitive_no_duplicates for CS dedup
- Add test_search_prefix_primary_label_ranks_first for label-over-alt ranking
- Update test_search_prefix_fallback_parity to verify IRI set equality
- All 45 tests pass including new dedup and ranking assertions
@damienriehl
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review, @mjbommar. All feedback addressed in 3 new commits:

Mechanical cleanups (510b425)

  • ruff lint: Removed unused import folio.graph as graph_module; rewrote test_search_prefix_fallback_parity to use idiomatic monkeypatch.setattr(folio.graph, "marisa_trie", None) instead of the manual attribute swap + try/finally
  • ruff format: Both files now pass ruff format --check
  • ty diagnostics: Added # type: ignore comments on the 5 new diagnostics in _search_by_prefix_insensitive, matching the pre-existing pattern in _search_by_prefix_sensitive

Dedup-with-tiebreak + label-first ranking (6890a2a)

Implemented your Option 2 on both paths (_search_by_prefix_sensitive and _search_by_prefix_insensitive), including pure-Python fallbacks:

  • Sort key: Changed from key=len to key=lambda k: (k not in self.label_to_index, len(k)) — primary-label matches sort before alt-label matches (False < True), then by length within each group
  • Dedup: Added dedup-by-IRI-index to the case-sensitive path (it had none — this fixes the duplicate Michigan/Caldas entries you flagged). The case-insensitive path already had dedup; the new sort order naturally gives the tiebreak you suggested (label wins over alt-label for same IRI)

Before → After (your examples)

Cal:

main (before) PR v1 PR v2 (now)
California position #15 #12 #6
Military Draft (Call-Up) Top 5 Top 5 Gone from top 20
Duplicate Caldas Yes No No
Duplicate Calabria Yes No No

Mich:

main PR v1 PR v2
Michigan position #0 (but duplicated) #2 (behind MICH→Mich Supreme Court) #0, no duplicates

Tax:

main PR v1 PR v2
Tax Law position #0 (but duplicated) #1 (behind TAX→U.S. Tax Court) #0, no duplicates

Zero duplicate IRIs across all combinations (3 prefixes × CS/CI).

New tests (4b8262a)

  • test_search_prefix_case_sensitive_no_duplicates — asserts no duplicate IRIs in CS results
  • test_search_prefix_primary_label_ranks_first — asserts Michigan is first result for "Mich" (primary label beats alt-label matches)
  • Updated fallback parity test with monkeypatch idiom

45/45 tests pass, ruff clean, format clean.

Follow-up

Filed #17 for Option 3 (real scoring function) — linked back to this discussion. That's the right place to tackle branch popularity weighting, exact-prefix bonuses, and ontology depth.

@mjbommar mjbommar merged commit fe05fd4 into main Apr 8, 2026
6 checks passed
mjbommar added a commit that referenced this pull request Apr 8, 2026
The PR #16 squash merge included Damien's private GSD workflow files
(.planning/PROJECT.md, REQUIREMENTS.md, ROADMAP.md, codebase/*, phases/*,
research/*, etc.) which were never intended for the public repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mjbommar added a commit that referenced this pull request Apr 8, 2026
Bumps version in pyproject.toml, folio/__init__.py (which had drifted to
0.3.0 since the 0.3.0 release), and uv.lock self-reference. Adds CHANGES.md
entry covering the case-insensitive prefix search feature, label-first
ranking, and dedup fixes from #16.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: case-insensitive search_by_prefix()

2 participants