Skip to content

search_by_prefix: implement relevance scoring function #17

@damienriehl

Description

@damienriehl

Context

Identified during review of #16 (case-insensitive prefix search). The current search_by_prefix() sorts results by (is_alt_label, len(matched_key)), which is a significant improvement over the prior len-only sort (see #16 discussion), but still produces suboptimal ranking for common queries.

Problem

The sort heuristic doesn't account for:

  • Branch popularity — Areas of Law and Jurisdictions are far more commonly queried than reporter codes or geographic subdivisions, but rank equally
  • Exact-prefix vs interior-prefix — a label that is the prefix (exact match) should outrank one that merely starts with it
  • Ontology depth — top-level concepts (California, Tax Law) are more likely targets than deep leaves (California Superior Court - Kern Cty.)

Examples (after #16 merges)

search_by_prefix("Cal"):
  0. Caldas          (Colombian department)
  6. California      (U.S. state — most users want this)

search_by_prefix("Tax"):
  0. Tax Law         ✓ (correct — primary label, short)
  3. tax_type        (property name, unlikely search target)

Proposed approach

Implement a scoring function that considers multiple signals:

score = w1 * is_primary_label
      + w2 * (1 / label_length)
      + w3 * branch_boost(class)
      + w4 * exact_prefix_bonus
      + w5 * (1 / ontology_depth)

Sort by score descending instead of the current tuple sort.

Branch boost / penalty model

Rather than boosting every branch, apply a penalty to low-utility branches — branches that rarely represent what a user is actually searching for. All other branches receive the default (boosted) treatment.

Penalized branches (less commonly the search target):

  • Language
  • Location
  • Standards Compatibility
  • System Identifiers
  • Matter Narrative
  • Currency
  • Data Format

Default (boosted) branches (everything else — the ones users typically want):

  • Area of Law
  • Jurisdiction / Forum Venue
  • Legal Authority
  • Legal Entity
  • Actor / Player
  • Service
  • Document / Artifact
  • Industry
  • Event
  • Engagement Terms
  • Objective
  • Asset Type
  • Communication Modality
  • Governmental Body
  • Matter Narrative Format
  • (and any other branches not in the penalty list)

This "penalty" framing is simpler to maintain — new branches get the boost by default, and only demonstrably low-utility branches are explicitly penalized. Implementation could be as simple as a Set of penalized FOLIOTypes values checked during scoring.

Scope

  • New scoring function in folio/graph.py
  • Applied to both _search_by_prefix_sensitive and _search_by_prefix_insensitive
  • Backward compatible — no API changes
  • Tests for ranking quality on known queries (Cal, Mich, Tax, etc.)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions