Context
Identified during review of #16 (case-insensitive prefix search). The current search_by_prefix() sorts results by (is_alt_label, len(matched_key)), which is a significant improvement over the prior len-only sort (see #16 discussion), but still produces suboptimal ranking for common queries.
Problem
The sort heuristic doesn't account for:
- Branch popularity — Areas of Law and Jurisdictions are far more commonly queried than reporter codes or geographic subdivisions, but rank equally
- Exact-prefix vs interior-prefix — a label that is the prefix (exact match) should outrank one that merely starts with it
- Ontology depth — top-level concepts (California, Tax Law) are more likely targets than deep leaves (California Superior Court - Kern Cty.)
Examples (after #16 merges)
search_by_prefix("Cal"):
0. Caldas (Colombian department)
6. California (U.S. state — most users want this)
search_by_prefix("Tax"):
0. Tax Law ✓ (correct — primary label, short)
3. tax_type (property name, unlikely search target)
Proposed approach
Implement a scoring function that considers multiple signals:
score = w1 * is_primary_label
+ w2 * (1 / label_length)
+ w3 * branch_boost(class)
+ w4 * exact_prefix_bonus
+ w5 * (1 / ontology_depth)
Sort by score descending instead of the current tuple sort.
Branch boost / penalty model
Rather than boosting every branch, apply a penalty to low-utility branches — branches that rarely represent what a user is actually searching for. All other branches receive the default (boosted) treatment.
Penalized branches (less commonly the search target):
- Language
- Location
- Standards Compatibility
- System Identifiers
- Matter Narrative
- Currency
- Data Format
Default (boosted) branches (everything else — the ones users typically want):
- Area of Law
- Jurisdiction / Forum Venue
- Legal Authority
- Legal Entity
- Actor / Player
- Service
- Document / Artifact
- Industry
- Event
- Engagement Terms
- Objective
- Asset Type
- Communication Modality
- Governmental Body
- Matter Narrative Format
- (and any other branches not in the penalty list)
This "penalty" framing is simpler to maintain — new branches get the boost by default, and only demonstrably low-utility branches are explicitly penalized. Implementation could be as simple as a Set of penalized FOLIOTypes values checked during scoring.
Scope
- New scoring function in
folio/graph.py
- Applied to both
_search_by_prefix_sensitive and _search_by_prefix_insensitive
- Backward compatible — no API changes
- Tests for ranking quality on known queries (Cal, Mich, Tax, etc.)
References
Context
Identified during review of #16 (case-insensitive prefix search). The current
search_by_prefix()sorts results by(is_alt_label, len(matched_key)), which is a significant improvement over the priorlen-only sort (see #16 discussion), but still produces suboptimal ranking for common queries.Problem
The sort heuristic doesn't account for:
Examples (after #16 merges)
Proposed approach
Implement a scoring function that considers multiple signals:
Sort by score descending instead of the current tuple sort.
Branch boost / penalty model
Rather than boosting every branch, apply a penalty to low-utility branches — branches that rarely represent what a user is actually searching for. All other branches receive the default (boosted) treatment.
Penalized branches (less commonly the search target):
Default (boosted) branches (everything else — the ones users typically want):
This "penalty" framing is simpler to maintain — new branches get the boost by default, and only demonstrably low-utility branches are explicitly penalized. Implementation could be as simple as a
Setof penalizedFOLIOTypesvalues checked during scoring.Scope
folio/graph.py_search_by_prefix_sensitiveand_search_by_prefix_insensitiveReferences