# Sanctions Screening Evaluation

- **Purpose:** Evaluate sanctions screening accuracy and validate precision/recall targets
- **Author:** Devbrew LLC  
- **Last Updated:** November 18, 2025  
- **Status:** In progress  
- **License:** Apache 2.0

## Overview

This notebook implements the evaluation protocol for the sanctions screening module. The evaluation measures matching accuracy through a labeled test set and validates that the system meets production accuracy targets.

**Evaluation Metrics:**
- Precision@1: Percentage of queries where top candidate is the correct match (target: ≥95%)
- Recall@top3: Percentage of queries where ground truth match appears in top 3 (target: ≥98%)
- False Positive Rate: Percentage of non-matches incorrectly flagged as matches
- Decision Accuracy: Alignment between predicted and expected decision categories

The evaluation validates that the screening system correctly identifies sanctioned entities while minimizing false positives, meeting production readiness requirements.

## Setup: Artifacts and Functions

The evaluation loads artifacts generated by the implementation pipeline:

- **Sanctions Index**: Canonicalized names and metadata (`sanctions_index.parquet`)
- **Blocking Indices**: Inverted indices for candidate retrieval (`blocking_indices.json`)
- **Metadata**: Version tracking and dataset statistics

Helper functions for text normalization, tokenization, and screening are loaded to enable independent evaluation runs without re-executing the full implementation pipeline.

### Environment Configuration

We configure the Python environment with standardized settings, import required libraries, and set a fixed random seed for reproducibility. This ensures consistent evaluation results across runs.

In [42]:
import sys
import warnings
from pathlib import Path
import json
import random
from typing import Dict, Any, List, Optional, Tuple
import time


import pandas as pd
import numpy as np
import rapidfuzz as rf
from rapidfuzz import fuzz, process




# Configuration
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", '{:.2f}'.format)

# Reproducibility
RANDOM_STATE = 42
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)

print("Environment configured successfully")
print(f" pandas: {pd.__version__}")
print(f" numpy: {np.__version__}")
print(f" rapidfuzz: {rf.__version__}")

Environment configured successfully
 pandas: 2.3.3
 numpy: 2.3.3
 rapidfuzz: 3.14.1


### Load Artifacts

The evaluation loads pre-computed artifacts from the implementation pipeline. The sanctions index contains 39,350 canonicalized name records with metadata. Blocking indices enable O(1) candidate retrieval through inverted index lookups.

In [19]:
# Path configuration
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
sys.path.insert(0, str(PROJECT_ROOT))
MODELS_DIR = PROJECT_ROOT / "packages" / "models"
DATA_DIR = PROJECT_ROOT / "data_catalog" / "processed"


print("Loading artifacts...\n")

# Load sanctions index
sanctions_index_path = MODELS_DIR / "sanctions_index.parquet"
if not sanctions_index_path.exists():
    raise FileNotFoundError(f"Sanctions index not found: {sanctions_index_path}\n"
                          f"Please run notebooks/04_sanctions_screening.ipynb first to generate artifacts.")

sanctions_index = pd.read_parquet(sanctions_index_path)
print(f"Loaded sanctions index: {len(sanctions_index):,} records")

# Load blocking indices
blocking_indices_path = MODELS_DIR / "blocking_indices.json"
if not blocking_indices_path.exists():
    raise FileNotFoundError(f"Blocking indices not found: {blocking_indices_path}\n"
                          f"Please run notebooks/04_sanctions_screening.ipynb first to generate artifacts.")

with open(blocking_indices_path, 'r') as f:
    blocking_indices = json.load(f)

first_token_index = {k: v for k, v in blocking_indices['first_token'].items()}
bucket_index = {k: v for k, v in blocking_indices['bucket'].items()}
initials_index = {k: v for k, v in blocking_indices['initials'].items()}

print(f"Loaded blocking indices:")
print(f" - First token index: {len(first_token_index):,} keys")
print(f" - Bucket index: {len(bucket_index):,} keys")
print(f" - Initials index: {len(initials_index):,} keys")

# Load metadata (optional, for version tracking)
metadata_path = MODELS_DIR / "sanctions_index_metadata.json"
if metadata_path.exists():
    with open(metadata_path, 'r') as f:
        sanctions_index_metadata = json.load(f)
    print(f"\nLoaded metadata: version {sanctions_index_metadata.get('created_at', 'unknown')}")
else:
    sanctions_index_metadata = {}
    print("[Warning] Metadata not found (optional)")

print(f"\nAll artifacts loaded successfully")

Loading artifacts...

Loaded sanctions index: 39,350 records
Loaded blocking indices:
 - First token index: 15,597 keys
 - Bucket index: 4 keys
 - Initials index: 15,986 keys

Loaded metadata: version 2025-11-17T06:00:56.218723

All artifacts loaded successfully


### Helper Functions

Text normalization and tokenization functions are imported from the shared `packages.compliance.sanctions` module. This module provides standardized functions used by both `04_sanctions_screening.ipynb` and this evaluation notebook, ensuring consistency across the screening pipeline.

The shared functions include:
- `normalize_text()`: Text normalization for robust fuzzy matching
- `tokenize()`: Tokenization with stopword filtering

In [20]:
from packages.compliance.sanctions import (
    normalize_text,
    tokenize
)

# Verify imports work
print("Helper functions imported successfully")
print(f"  - normalize_text: {normalize_text.__name__}")
print(f"  - tokenize: {tokenize.__name__}")

Helper functions imported successfully
  - normalize_text: normalize_text
  - tokenize: tokenize


## Create Labeled Test Set

To evaluate the screening system's accuracy, we need a labeled test set with known ground truth matches. This test set will include:

- **Positive examples**: Query variations of names in the sanctions index (exact matches, normalized versions, case variations, typos)
- **Negative examples**: Names that should NOT match any sanctions record (to test false positive rate)

We'll sample diverse names from the sanctions index and create query variations to test different matching scenarios. This approach allows us to measure:
- **Precision@1**: How often the top candidate is the correct match
- **Recall@top3**: How often the ground truth appears in the top 3 results
- **False Positive Rate**: How often non-matches are incorrectly flagged

In [21]:
# Function to introduce typos
def _introduce_typo(name: str, n_typos: int = 1) -> str:
    """
    Introduce minor typos for testing robustness.
    
    Randomly replaces or removes characters to simulate real-world
    data entry errors.
    """
    chars = list(name)
    for _ in range(n_typos):
        if len(chars) > 0:
            idx = random.randint(0, len(chars) - 1)
            # Replace with random character or remove
            if random.random() < 0.5:
                chars[idx] = random.choice('abcdefghijklmnopqrstuvwxyz')
            else:
                chars.pop(idx)
    return ''.join(chars)

# Function to create labeled test set
def create_labeled_test_set(
    sanctions_index: pd.DataFrame,
    n_samples: int = 80,
    random_state: int = 42
) -> pd.DataFrame:
    """
    Create a labeled test set with ground truth matches.
    
    Samples names from sanctions index and creates query variations
    with known ground truth matches.
    
    Args:
        sanctions_index: DataFrame with sanctions records
        n_samples: Number of base names to sample
        random_state: Random seed for reproducibility
        
    Returns:
        DataFrame with test queries and ground truth labels
    """
    np.random.seed(random_state)
    random.seed(random_state)
    
    # Sample diverse names from sanctions index
    # Note: For a production system, you might want stratified sampling
    # by country/program/entity_type, but for this case study we keep it simple
    sampled = sanctions_index.sample(
        n=min(n_samples, len(sanctions_index)),
        random_state=random_state
    )
    
    # Validate UIDs exist
    valid_uids = set(sanctions_index['uid'].values)
    
    test_queries = []
    
    for _, row in sampled.iterrows():
        original_name = row['name']
        uid = row['uid']
        
        # Skip if UID is invalid
        if uid not in valid_uids:
            continue
        
        # Create query variations to test different matching scenarios
        variations = [
            # Exact match (should score very high)
            {
                'query': original_name,
                'ground_truth_uid': uid,
                'expected_score_min': 0.95,
                'variation_type': 'exact'
            },
            # Normalized version (tests normalization pipeline)
            {
                'query': row['name_norm'],
                'ground_truth_uid': uid,
                'expected_score_min': 0.90,
                'variation_type': 'normalized'
            },
            # Case variation (tests case-insensitive matching)
            {
                'query': original_name.upper(),
                'ground_truth_uid': uid,
                'expected_score_min': 0.90,
                'variation_type': 'case'
            },
            # Minor typo (tests robustness to errors)
            {
                'query': _introduce_typo(original_name, n_typos=1),
                'ground_truth_uid': uid,
                'expected_score_min': 0.85,
                'variation_type': 'typo'
            }
        ]
        
        # Add negative examples (non-matches) to test false positive rate
        # Sample a random name that shouldn't match
        non_match_candidates = sanctions_index[sanctions_index['uid'] != uid]
        if len(non_match_candidates) > 0:
            non_match_name = non_match_candidates.sample(1, random_state=random_state).iloc[0]['name']
            
            variations.append({
                'query': non_match_name,
                'ground_truth_uid': None,  # No match expected
                'expected_score_min': None,
                'variation_type': 'non_match'
            })
        
        test_queries.extend(variations)
    
    # Create DataFrame
    test_df = pd.DataFrame(test_queries)
    
    # Validate all ground truth UIDs
    if len(test_df) > 0:
        invalid_uids = test_df[
            test_df['ground_truth_uid'].notna() & 
            ~test_df['ground_truth_uid'].isin(valid_uids)
        ]
        if len(invalid_uids) > 0:
            raise ValueError(f"Found {len(invalid_uids)} invalid ground truth UIDs")
    
    # Add metadata
    test_df['query_id'] = range(len(test_df))
    test_df['created_at'] = pd.Timestamp.now()
    
    return test_df

# Create test set
print("Creating labeled test set...")
labeled_test_set = create_labeled_test_set(
    sanctions_index,
    n_samples=50,  # 50 names × ~4 variations = ~200 queries
    random_state=RANDOM_STATE
)

print(f"\nCreated {len(labeled_test_set):,} test queries")
print(f" - With ground truth: {labeled_test_set['ground_truth_uid'].notna().sum():,}")
print(f" - Non-matches: {labeled_test_set['ground_truth_uid'].isna().sum():,}")

# Show distribution of variation types
if 'variation_type' in labeled_test_set.columns:
    print(f"\nVariation type distribution:")
    for var_type, count in labeled_test_set['variation_type'].value_counts().items():
        print(f" - {var_type}: {count:,}")

# Save test set for reproducibility
test_set_path = DATA_DIR / "sanctions_eval_labels.csv"
test_set_path.parent.mkdir(parents=True, exist_ok=True)
labeled_test_set.to_csv(test_set_path, index=False)
print(f"\nSaved test set to: {test_set_path}")

Creating labeled test set...

Created 250 test queries
 - With ground truth: 200
 - Non-matches: 50

Variation type distribution:
 - exact: 50
 - normalized: 50
 - case: 50
 - typo: 50
 - non_match: 50

Saved test set to: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/data_catalog/processed/sanctions_eval_labels.csv


## Screening Function for Evaluation

To evaluate the screening system, we need to implement the same screening logic used in the production pipeline. This ensures our evaluation results accurately reflect how the system will perform in production.

**Implementation Approach:**

For this case study, we keep the screening functions inline in the notebook rather than extracting them to a shared module. This makes the evaluation notebook self-contained and easier to follow, demonstrating the complete evaluation flow in one place.

The screening pipeline consists of:
1. **Blocking**: Retrieve candidate records using blocking indices (first token, token bucket, initials)
2. **Scoring**: Compute similarity scores for each candidate using RapidFuzz
3. **Ranking**: Sort candidates by composite score and return top-K results

This matches the implementation in `04_sanctions_screening.ipynb` to ensure consistent evaluation results.

In [55]:
# Blocking helper functions (matching implementation from sanctions screening)

def get_first_token(tokens: List[str]) -> str:
    """Extract first token for prefix blocking."""
    return tokens[0] if tokens else ""

def get_token_count_bucket(tokens: List[str]) -> str:
    """
    Bucket names by token count for length-based blocking.
    
    Groups:
    - "tiny": 0-1 tokens
    - "small": 2 tokens  
    - "medium": 3-4 tokens
    - "large": 5+ tokens
    """
    count = len(tokens)
    if count <= 1:
        return "tiny"
    elif count == 2:
        return "small"
    elif count <= 4:
        return "medium"
    else:
        return "large"

def get_initials_signature(tokens: List[str]) -> str:
    """
    Create initials signature from first letter of each token.
    
    Examples:
        ['john', 'doe'] → 'j-d'
        ['al', 'qaida'] → 'a-q'
        ['banco', 'nacional', 'cuba'] → 'b-n-c'
    """
    if not tokens:
        return ""
    return "-".join(t[0] for t in tokens if t)

def get_candidates_eval(
    query_name: str,
    first_token_index: Dict[str, List[int]],
    bucket_index: Dict[str, List[int]],
    initials_index: Dict[str, List[int]],
    max_candidates: Optional[int] = None
) -> Tuple[List[int], Dict[int, int]]:
    """
    Get candidate indices using blocking keys with prioritization.
    
    Returns:
        Tuple of (candidate_list, priority_scores)
        - candidate_list: Sorted list of candidate indices
        - priority_scores: Dict mapping candidate index to priority (higher = appears in more strategies)
    """
    # Normalize and tokenize query
    query_norm = normalize_text(query_name)
    query_tokens = tokenize(query_norm)
    
    if not query_tokens:
        return [], {}
    
    # Extract blocking keys using helper functions
    query_first = get_first_token(query_tokens)
    query_bucket = get_token_count_bucket(query_tokens)
    query_initials = get_initials_signature(query_tokens)
    
    # Collect candidates from each strategy separately
    first_token_candidates = set()
    bucket_candidates = set()
    initials_candidates = set()
    
    # Strategy 1: First token match (most specific)
    if query_first in first_token_index:
        first_token_candidates.update(first_token_index[query_first])
    
    # Strategy 2: Token bucket match (less specific - can be huge)
    if query_bucket in bucket_index:
        bucket_candidates.update(bucket_index[query_bucket])
    
    # Strategy 3: Initials match (moderately specific)
    if query_initials in initials_index:
        initials_candidates.update(initials_index[query_initials])
    
    # Count how many strategies each candidate appears in (priority)
    candidate_counts = {}
    all_candidates = first_token_candidates | bucket_candidates | initials_candidates
    
    for idx in all_candidates:
        count = 0
        if idx in first_token_candidates:
            count += 3  # First token is most specific - weight higher
        if idx in initials_candidates:
            count += 2  # Initials are moderately specific
        if idx in bucket_candidates:
            count += 1  # Bucket is least specific
        candidate_counts[idx] = count
    
    # Sort by priority (higher priority first), then by index
    candidate_list = sorted(all_candidates, key=lambda x: (-candidate_counts[x], x))
    
    # For very large candidate sets, prioritize high-priority candidates
    # Candidates appearing in multiple strategies are more likely to be matches
    if max_candidates is not None and len(candidate_list) > max_candidates:
        # Take top priority candidates first
        candidate_list = candidate_list[:max_candidates]
    
    return candidate_list, candidate_counts


def compute_similarity_batch(
    query_norm: str,
    candidate_norm_list: List[str]
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Compute similarity scores for multiple candidates in batch.
    
    Uses rapidfuzz.process.cdist for vectorized scoring, which is much faster
    than looping through candidates individually.

    This function implements the same compute similarity batch strategy as in 04_sanctions_screening.ipynb
    to ensure consistent evaluation results.
    
    Args:
        query_norm: Normalized query name (full string)
        candidate_norm_list: List of candidate normalized strings
        
    Returns:
        Tuple of three numpy arrays (set, sort, partial scores) [0-100]
    """
    # Batch compute token_set_ratio
    set_scores = process.cdist(
        [query_norm],
        candidate_norm_list,
        scorer=fuzz.token_set_ratio,
        workers=4  # Parallel scoring on multi-core CPU
    )[0]

    # Batch compute token_sort_ratio
    sort_scores = process.cdist(
        [query_norm],
        candidate_norm_list,
        scorer=fuzz.token_sort_ratio,
        workers=4
    )[0]

    # Batch compute partial_ratio
    partial_scores = process.cdist(
        [query_norm],
        candidate_norm_list,
        scorer=fuzz.partial_ratio,
        workers=4
    )[0]

    return set_scores, sort_scores, partial_scores


def composite_score_batch(
    set_scores: np.ndarray,
    sort_scores: np.ndarray,
    partial_scores: np.ndarray
) -> np.ndarray:
    """
    Compute composite scores for batch of candidates.
    
    Uses vectorized numpy operations for efficiency.

    This function implements the same composite score batch strategy as in 04_sanctions_screening.ipynb
    to ensure consistent evaluation results.
    Returns:
        Array of composite scores [0-1]
    """
    # Weighted average: 0.45 * set + 0.35 * sort + 0.20 * partial
    raw_scores = 0.45 * set_scores + 0.35 * sort_scores + 0.20 * partial_scores

    # Rescale to [0, 1]
    composite_scores = np.clip(raw_scores / 100.0, 0.0, 1.0)

    return composite_scores

def screen_query_eval(
    query_name: str,
    sanctions_index: pd.DataFrame,
    first_token_index: Dict[str, List[int]],
    bucket_index: Dict[str, List[int]],
    initials_index: Dict[str, List[int]],
    top_k: int = 3,
    initial_candidates: int = 2000,  # Score this many first
    expand_threshold: float = 0.85,  # If top score < this, expand
    max_candidates: int = 3000  # Max to score if expanding
) -> List[Dict[str, Any]]:
    """
    Screen a query name and return top-K candidates with scores.
    
    Uses two-stage adaptive scoring:
    1. Score top 2000 priority candidates
    2. If top score is low (< 0.85), expand to 3000 candidates
    This balances latency and recall.
    """
    # Normalize and tokenize query
    query_norm = normalize_text(query_name)
    query_tokens = tokenize(query_norm)
    
    if not query_tokens:
        return []
    
    # Get candidates with prioritization
    candidate_indices, priority_scores = get_candidates_eval(
        query_name,
        first_token_index,
        bucket_index,
        initials_index,
        max_candidates=None
    )
    
    if not candidate_indices:
        return []
    
    # Stage 1: Score top priority candidates
    # Always include all high-priority candidates (priority >= 3)
    high_priority_candidates = [
        idx for idx in candidate_indices 
        if priority_scores.get(idx, 0) >= 3
    ]
    
    # Build initial candidate set
    if len(high_priority_candidates) > 0:
        remaining_slots = initial_candidates - len(high_priority_candidates)
        if remaining_slots > 0:
            other_candidates = [
                idx for idx in candidate_indices 
                if idx not in high_priority_candidates
            ][:remaining_slots]
            candidates_to_score = high_priority_candidates + other_candidates
        else:
            candidates_to_score = high_priority_candidates[:initial_candidates]
    else:
        candidates_to_score = candidate_indices[:initial_candidates]
    
    # Pre-extract candidate strings
    candidate_norm_list = []
    candidate_metadata = []
    
    for idx in candidates_to_score:
        try:
            candidate = sanctions_index.iloc[idx]
            candidate_norm_list.append(candidate['name_norm'])
            candidate_metadata.append({
                'uid': candidate['uid'],
                'name': candidate['name'],
                'country': candidate.get('country'),
                'program': candidate.get('program'),
                'source': candidate.get('source')
            })
        except (IndexError, KeyError):
            continue
    
    if not candidate_norm_list:
        return []
    
    # Stage 1: Batch score initial candidates
    set_scores, sort_scores, partial_scores = compute_similarity_batch(
        query_norm,
        candidate_norm_list
    )
    
    composite_scores = composite_score_batch(set_scores, sort_scores, partial_scores)
    
    # Check if we need to expand (two-stage approach)
    top_score = float(np.max(composite_scores)) if len(composite_scores) > 0 else 0.0
    
    # Stage 2: If top score is low, expand candidate set
    if top_score < expand_threshold and len(candidate_indices) > initial_candidates:
        # Expand to include more candidates
        additional_needed = max_candidates - len(candidates_to_score)
        if additional_needed > 0:
            additional_candidates = [
                idx for idx in candidate_indices 
                if idx not in candidates_to_score
            ][:additional_needed]
            
            # Add additional candidates
            for idx in additional_candidates:
                try:
                    candidate = sanctions_index.iloc[idx]
                    candidate_norm_list.append(candidate['name_norm'])
                    candidate_metadata.append({
                        'uid': candidate['uid'],
                        'name': candidate['name'],
                        'country': candidate.get('country'),
                        'program': candidate.get('program'),
                        'source': candidate.get('source')
                    })
                except (IndexError, KeyError):
                    continue
            
            # Re-score expanded set
            set_scores, sort_scores, partial_scores = compute_similarity_batch(
                query_norm,
                candidate_norm_list
            )
            composite_scores = composite_score_batch(set_scores, sort_scores, partial_scores)
    
    # Combine scores with metadata
    scored_candidates = []
    for i, metadata in enumerate(candidate_metadata):
        scored_candidates.append({
            **metadata,
            'score': float(composite_scores[i]),
            'sim_set': float(set_scores[i]) / 100.0,
            'sim_sort': float(sort_scores[i]) / 100.0,
            'sim_partial': float(partial_scores[i]) / 100.0
        })
    
    # Sort by score (descending) and return top-K
    scored_candidates.sort(key=lambda x: x['score'], reverse=True)
    return scored_candidates[:top_k]


# Test the screening function
print("Testing screening function...")
test_query = "BANCO NACIONAL DE CUBA"
results = screen_query_eval(
    test_query,
    sanctions_index,
    first_token_index,
    bucket_index,
    initials_index,
    top_k=3
)

print(f"\nQuery: '{test_query}'")
print(f"Found {len(results)} candidates:")
for i, result in enumerate(results, 1):
    print(f"  {i}. {result['name']} (score: {result['score']:.3f}, uid: {result['uid']})")

Testing screening function...

Query: 'BANCO NACIONAL DE CUBA'
Found 3 candidates:
  1. BANCO NACIONAL DE CUBA (score: 1.000, uid: SDN_306)
  2. INSTITUTO NACIONAL DE TURISMO DE CUBA (score: 0.705, uid: SDN_1042)
  3. BANCO INTERNACIONAL DE DESARROLLO, C.A. (score: 0.685, uid: SDN_25646)


## Compute Evaluation Metrics

Now that we have a labeled test set and screening functions, we can evaluate the system's performance. We'll run each query through the screening pipeline and compute key metrics:

**Metrics to Compute:**
- **Precision@1**: Percentage of queries where the top candidate is the correct match (target: ≥95%)
- **Recall@top3**: Percentage of queries where the ground truth match appears in the top 3 results (target: ≥98%)
- **False Positive Rate**: Percentage of non-matches incorrectly flagged as matches at different thresholds
- **Latency Statistics**: p50, p95, p99 latencies to validate performance targets

The evaluation function processes all queries, measures latency, and compares results against ground truth labels to compute these metrics.

In [56]:
## Compute evaluation metrics
def evaluate_screening_system(
    labeled_test_set: pd.DataFrame,
    sanctions_index: pd.DataFrame,
    first_token_index: Dict[str, List[int]],
    bucket_index: Dict[str, List[int]],
    initials_index: Dict[str, List[int]],
    top_k: int = 3
) -> Dict[str, Any]:
    """
    Evaluate screening system on labeled test set.
    
    Processes each query in the test set, screens it against the sanctions index,
    and compares results against ground truth labels to compute accuracy metrics.
    
    Returns:
        Dictionary with precision@1, recall@top3, FPR, latency stats, and detailed results
    """
    results = []
    latencies = []
    
    print("Evaluating screening system...")
    print(f"Processing {len(labeled_test_set):,} queries...\n")
    
    for idx, row in labeled_test_set.iterrows():
        query = row['query']
        ground_truth_uid = row['ground_truth_uid']
        
        # Screen query and measure latency
        start_time = time.time()
        candidates = screen_query_eval(
            query,
            sanctions_index,
            first_token_index,
            bucket_index,
            initials_index,
            top_k=top_k
        )
        latency_ms = (time.time() - start_time) * 1000
        latencies.append(latency_ms)
        
        # Check if ground truth is in results
        top1_match = candidates[0] if candidates else None
        top1_correct = (
            top1_match is not None and
            top1_match['uid'] == ground_truth_uid
        ) if pd.notna(ground_truth_uid) else None
        
        # Check if ground truth in top-K
        topk_uids = [c['uid'] for c in candidates]
        topk_match = (
            ground_truth_uid in topk_uids
        ) if pd.notna(ground_truth_uid) else None
        
        # Decision categories based on score thresholds
        if top1_match:
            score = top1_match['score']
            if score >= 0.90:
                decision = 'is_match'
            elif score >= 0.80:
                decision = 'review'
            else:
                decision = 'no_match'
        else:
            decision = 'no_match'
            score = 0.0
        
        results.append({
            'query_id': row['query_id'],
            'query': query,
            'ground_truth_uid': ground_truth_uid,
            'top1_uid': top1_match['uid'] if top1_match else None,
            'top1_score': score,
            'top1_correct': top1_correct,
            'topk_match': topk_match,
            'decision': decision,
            'latency_ms': latency_ms,
            'num_candidates': len(candidates)
        })
        
        # Progress indicator
        if (idx + 1) % 50 == 0:
            print(f"  Processed {idx + 1:,} queries...")
    
    results_df = pd.DataFrame(results)
    
    # Compute metrics
    # Precision@1: Of queries with ground truth, how many have correct top-1?
    queries_with_truth = results_df[results_df['ground_truth_uid'].notna()]
    precision_at_1 = queries_with_truth['top1_correct'].mean() if len(queries_with_truth) > 0 else 0.0
    
    # Recall@top3: Of queries with ground truth, how many have match in top-3?
    recall_at_top3 = queries_with_truth['topk_match'].mean() if len(queries_with_truth) > 0 else 0.0
    
    # False Positive Rate: Of non-matches, how many are flagged as matches?
    non_matches = results_df[results_df['ground_truth_uid'].isna()]
    fpr_at_90 = (
        (non_matches['decision'] == 'is_match').mean()
        if len(non_matches) > 0 else 0.0
    )
    fpr_at_80 = (
        ((non_matches['decision'] == 'is_match') | 
         (non_matches['decision'] == 'review')).mean()
        if len(non_matches) > 0 else 0.0
    )
    
    # Latency statistics
    latency_p50 = np.percentile(latencies, 50)
    latency_p95 = np.percentile(latencies, 95)
    latency_p99 = np.percentile(latencies, 99)
    
    metrics = {
        'precision_at_1': precision_at_1,
        'recall_at_top3': recall_at_top3,
        'fpr_at_threshold_90': fpr_at_90,
        'fpr_at_threshold_80': fpr_at_80,
        'latency_p50_ms': latency_p50,
        'latency_p95_ms': latency_p95,
        'latency_p99_ms': latency_p99,
        'num_queries': len(results_df),
        'num_with_ground_truth': len(queries_with_truth),
        'num_non_matches': len(non_matches)
    }
    
    return {
        'metrics': metrics,
        'results': results_df,
        'latencies': latencies
    }

# Run evaluation
print("Starting evaluation...")
evaluation_results = evaluate_screening_system(
    labeled_test_set,
    sanctions_index,
    first_token_index,
    bucket_index,
    initials_index,
    top_k=3
)

metrics = evaluation_results['metrics']
results_df = evaluation_results['results']

# Display results
print("\nEvaluation Results")
print("-"*60)
print(f"\nPrecision@1:        {metrics['precision_at_1']:.1%}")
print(f"Recall@top3:         {metrics['recall_at_top3']:.1%}")
print(f"FPR @ threshold 0.90: {metrics['fpr_at_threshold_90']:.1%}")
print(f"FPR @ threshold 0.80: {metrics['fpr_at_threshold_80']:.1%}")
print(f"\nLatency Statistics:")
print(f"  p50: {metrics['latency_p50_ms']:.2f} ms")
print(f"  p95: {metrics['latency_p95_ms']:.2f} ms")
print(f"  p99: {metrics['latency_p99_ms']:.2f} ms")
print(f"\nTotal queries:      {metrics['num_queries']:,}")
print(f"  With ground truth: {metrics['num_with_ground_truth']:,}")
print(f"  Non-matches:       {metrics['num_non_matches']:,}")

# Check if targets are met
print("\nTarget Validation")
print("-"*60)
targets_met = {
    'Precision@1 ≥ 95%': metrics['precision_at_1'] >= 0.95,
    'Recall@top3 ≥ 98%': metrics['recall_at_top3'] >= 0.98,
    'Latency p95 < 50ms': metrics['latency_p95_ms'] < 50.0
}

for target, met in targets_met.items():
    status = "PASS" if met else "FAIL"
    print(f"{status} - {target}")

all_targets_met = all(targets_met.values())
print(f"\nOverall: {'All targets met' if all_targets_met else 'Some targets not met'}")

Starting evaluation...
Evaluating screening system...
Processing 250 queries...

  Processed 50 queries...
  Processed 100 queries...
  Processed 150 queries...
  Processed 200 queries...
  Processed 250 queries...

Evaluation Results
------------------------------------------------------------

Precision@1:        97.5%
Recall@top3:         98.0%
FPR @ threshold 0.90: 100.0%
FPR @ threshold 0.80: 100.0%

Latency Statistics:
  p50: 23.56 ms
  p95: 49.63 ms
  p99: 130.92 ms

Total queries:      250
  With ground truth: 200
  Non-matches:       50

Target Validation
------------------------------------------------------------
PASS - Precision@1 ≥ 95%
PASS - Recall@top3 ≥ 98%
PASS - Latency p95 < 50ms

Overall: All targets met
