# Sanctions Screening Evaluation

- **Purpose:** Evaluate sanctions screening accuracy and validate precision/recall targets
- **Author:** Devbrew LLC  
- **Last Updated:** November 27, 2025  
- **Status:** In progress  
- **License:** Apache 2.0

## Overview

This notebook implements the evaluation protocol for the sanctions screening module. The evaluation measures matching accuracy through a labeled test set and validates that the system meets production accuracy targets.

**Evaluation Metrics:**
- Precision@1: Percentage of queries where top candidate is the correct match (target: ≥95%)
- Recall@top3: Percentage of queries where ground truth match appears in top 3 (target: ≥98%)
- False Positive Rate: Percentage of non-matches incorrectly flagged as matches
- Decision Accuracy: Alignment between predicted and expected decision categories

The evaluation validates that the screening system correctly identifies sanctioned entities while minimizing false positives, meeting production readiness requirements.

## Setup: Artifacts and Functions

The evaluation loads artifacts generated by the implementation pipeline:

- **Sanctions Index**: Canonicalized names and metadata (`sanctions_index.parquet`)
- **Blocking Indices**: Inverted indices for candidate retrieval (`blocking_indices.json`)
- **Metadata**: Version tracking and dataset statistics

Helper functions for text normalization, tokenization, and screening are loaded to enable independent evaluation runs without re-executing the full implementation pipeline.

### Environment Configuration

We configure the Python environment with standardized settings, import required libraries, and set a fixed random seed for reproducibility. This ensures consistent evaluation results across runs.

In [27]:
import sys
import warnings
from pathlib import Path
import json
import random
from typing import Dict, Any, List, Optional, Tuple
import time
import pickle
from datetime import datetime


import pandas as pd
import numpy as np
import rapidfuzz as rf
from rapidfuzz import fuzz, process



# Configuration
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", '{:.2f}'.format)

# Reproducibility
RANDOM_STATE = 42
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)

print("Environment configured successfully")
print(f" pandas: {pd.__version__}")
print(f" numpy: {np.__version__}")
print(f" rapidfuzz: {rf.__version__}")

Environment configured successfully
 pandas: 2.3.3
 numpy: 2.3.3
 rapidfuzz: 3.14.1


### Load Artifacts

The evaluation loads pre-computed artifacts from the implementation pipeline. The sanctions index contains 39,350 canonicalized name records with metadata. Blocking indices enable O(1) candidate retrieval through inverted index lookups.

In [2]:
# Path configuration
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
sys.path.insert(0, str(PROJECT_ROOT))
MODELS_DIR = PROJECT_ROOT / "packages" / "models"
DATA_DIR = PROJECT_ROOT / "data_catalog" / "processed"


print("Loading artifacts...\n")

# Load sanctions index
sanctions_index_path = MODELS_DIR / "sanctions_index.parquet"
if not sanctions_index_path.exists():
    raise FileNotFoundError(f"Sanctions index not found: {sanctions_index_path}\n"
                          f"Please run notebooks/04_sanctions_screening.ipynb first to generate artifacts.")

sanctions_index = pd.read_parquet(sanctions_index_path)
print(f"Loaded sanctions index: {len(sanctions_index):,} records")

# Load blocking indices
blocking_indices_path = MODELS_DIR / "blocking_indices.json"
if not blocking_indices_path.exists():
    raise FileNotFoundError(f"Blocking indices not found: {blocking_indices_path}\n"
                          f"Please run notebooks/04_sanctions_screening.ipynb first to generate artifacts.")

with open(blocking_indices_path, 'r') as f:
    blocking_indices = json.load(f)

first_token_index = {k: v for k, v in blocking_indices['first_token'].items()}
bucket_index = {k: v for k, v in blocking_indices['bucket'].items()}
initials_index = {k: v for k, v in blocking_indices['initials'].items()}

print(f"Loaded blocking indices:")
print(f" - First token index: {len(first_token_index):,} keys")
print(f" - Bucket index: {len(bucket_index):,} keys")
print(f" - Initials index: {len(initials_index):,} keys")

# Load metadata (optional, for version tracking)
metadata_path = MODELS_DIR / "sanctions_index_metadata.json"
if metadata_path.exists():
    with open(metadata_path, 'r') as f:
        sanctions_index_metadata = json.load(f)
    print(f"\nLoaded metadata: version {sanctions_index_metadata.get('created_at', 'unknown')}")
else:
    sanctions_index_metadata = {}
    print("[Warning] Metadata not found (optional)")

print(f"\nAll artifacts loaded successfully")

Loading artifacts...

Loaded sanctions index: 39,350 records
Loaded blocking indices:
 - First token index: 15,597 keys
 - Bucket index: 4 keys
 - Initials index: 15,986 keys

Loaded metadata: version 2025-11-17T06:00:56.218723

All artifacts loaded successfully


### Helper Functions

Text normalization and tokenization functions are imported from the shared `packages.compliance.sanctions` module. This module provides standardized functions used by both `04_sanctions_screening.ipynb` and this evaluation notebook, ensuring consistency across the screening pipeline.

The shared functions include:
- `normalize_text()`: Text normalization for robust fuzzy matching
- `tokenize()`: Tokenization with stopword filtering

In [3]:
from packages.compliance.sanctions import (
    normalize_text,
    tokenize
)

# Verify imports work
print("Helper functions imported successfully")
print(f"  - normalize_text: {normalize_text.__name__}")
print(f"  - tokenize: {tokenize.__name__}")

Helper functions imported successfully
  - normalize_text: normalize_text
  - tokenize: tokenize


## Create Labeled Test Set

To evaluate the screening system's accuracy, we need a labeled test set with known ground truth matches. This test set will include:

- **Positive examples**: Query variations of names in the sanctions index (exact matches, normalized versions, case variations, typos)
- **Negative examples**: Names that should NOT match any sanctions record (to test false positive rate)

We'll sample diverse names from the sanctions index and create query variations to test different matching scenarios. This approach allows us to measure:
- **Precision@1**: How often the top candidate is the correct match
- **Recall@top3**: How often the ground truth appears in the top 3 results
- **False Positive Rate**: How often non-matches are incorrectly flagged

In [4]:
# Function to introduce typos
def _introduce_typo(name: str, n_typos: int = 1) -> str:
    """
    Introduce minor typos for testing robustness.
    
    Randomly replaces or removes characters to simulate real-world
    data entry errors.
    """
    chars = list(name)
    for _ in range(n_typos):
        if len(chars) > 0:
            idx = random.randint(0, len(chars) - 1)
            # Replace with random character or remove
            if random.random() < 0.5:
                chars[idx] = random.choice('abcdefghijklmnopqrstuvwxyz')
            else:
                chars.pop(idx)
    return ''.join(chars)


def _generate_non_match_name(counter: int, random_state: int = None) -> str:
    """
    Generate a synthetic name that definitely doesn't match any sanctions entry.
    
    Uses clearly synthetic names to ensure true non-matches for false positive testing.
    This is important for accurate evaluation metrics.
    
    Args:
        sanctions_index: DataFrame with sanctions records (unused, kept for API consistency)
        counter: Counter for generating unique names
        random_state: Random seed for reproducibility
        
    Returns:
        Synthetic name guaranteed not to match any sanctions entry
    """
    if random_state is not None:
        random.seed(random_state + counter)
    
    # Use clearly synthetic names that won't match
    # This ensures true non-matches for accurate false positive rate measurement
    prefixes = ['TEST', 'EVAL', 'NONMATCH', 'SAMPLE']
    suffixes = ['USER', 'NAME', 'PERSON', 'ENTITY']
    numbers = random.randint(100, 999)
    
    return f"{random.choice(prefixes)} {random.choice(suffixes)} {numbers}"

# Function to create labeled test set
def create_labeled_test_set(
    sanctions_index: pd.DataFrame,
    n_samples: int = 80,
    random_state: int = 42
) -> pd.DataFrame:
    """
    Create a labeled test set with ground truth matches.
    
    Samples names from sanctions index and creates query variations
    with known ground truth matches.
    
    Args:
        sanctions_index: DataFrame with sanctions records
        n_samples: Number of base names to sample
        random_state: Random seed for reproducibility
        
    Returns:
        DataFrame with test queries and ground truth labels
    """
    np.random.seed(random_state)
    random.seed(random_state)
    
    # Sample diverse names from sanctions index
    # Note: For a production system, you might want stratified sampling
    # by country/program/entity_type, but for this case study we keep it simple
    sampled = sanctions_index.sample(
        n=min(n_samples, len(sanctions_index)),
        random_state=random_state
    )
    
    # Validate UIDs exist
    valid_uids = set(sanctions_index['uid'].values)
    
    # Get all normalized names for quick lookup
    all_normalized_names = set(sanctions_index['name_norm'].str.lower().values)
    
    test_queries = []
    non_match_counter = 0  # Counter for generating unique non-matches
    
    for _, row in sampled.iterrows():
        original_name = row['name']
        uid = row['uid']
        
        # Skip if UID is invalid
        if uid not in valid_uids:
            continue
        
        # Create query variations to test different matching scenarios
        variations = [
            # Exact match (should score very high)
            {
                'query': original_name,
                'ground_truth_uid': uid,
                'expected_score_min': 0.95,
                'variation_type': 'exact'
            },
            # Normalized version (tests normalization pipeline)
            {
                'query': row['name_norm'],
                'ground_truth_uid': uid,
                'expected_score_min': 0.90,
                'variation_type': 'normalized'
            },
            # Case variation (tests case-insensitive matching)
            {
                'query': original_name.upper(),
                'ground_truth_uid': uid,
                'expected_score_min': 0.90,
                'variation_type': 'case'
            },
            # Minor typo (tests robustness to errors)
            {
                'query': _introduce_typo(original_name, n_typos=1),
                'ground_truth_uid': uid,
                'expected_score_min': 0.85,
                'variation_type': 'typo'
            }
        ]
        
        # Add negative examples (non-matches) to test false positive rate
        # Generate a truly non-matching name
        non_match_name = _generate_non_match_name(
            counter=non_match_counter,
            random_state=random_state
        )
        non_match_counter += 1
        
        
        variations.append({
            'query': non_match_name,
            'ground_truth_uid': None, # No match expected
            'expected_score_min': None,
            'variation_type': 'non_match'
        })
        
        test_queries.extend(variations)
    
    # Create DataFrame
    test_df = pd.DataFrame(test_queries)
    
    # Validate all ground truth UIDs
    if len(test_df) > 0:
        invalid_uids = test_df[
            test_df['ground_truth_uid'].notna() & 
            ~test_df['ground_truth_uid'].isin(valid_uids)
        ]
        if len(invalid_uids) > 0:
            raise ValueError(f"Found {len(invalid_uids)} invalid ground truth UIDs")
    
    # Add metadata
    test_df['query_id'] = range(len(test_df))
    test_df['created_at'] = pd.Timestamp.now()
    
    return test_df


# Create test set
print("Creating labeled test set...")
labeled_test_set = create_labeled_test_set(
    sanctions_index,
    n_samples=50,  # 50 names × ~4 variations = ~200 queries
    random_state=RANDOM_STATE
)

print(f"\nCreated {len(labeled_test_set):,} test queries")
print(f" - With ground truth: {labeled_test_set['ground_truth_uid'].notna().sum():,}")
print(f" - Non-matches: {labeled_test_set['ground_truth_uid'].isna().sum():,}")

# Show distribution of variation types
if 'variation_type' in labeled_test_set.columns:
    print(f"\nVariation type distribution:")
    for var_type, count in labeled_test_set['variation_type'].value_counts().items():
        print(f" - {var_type}: {count:,}")

# Save test set for reproducibility
test_set_path = DATA_DIR / "sanctions_eval_labels.csv"
test_set_path.parent.mkdir(parents=True, exist_ok=True)
labeled_test_set.to_csv(test_set_path, index=False)
print(f"\nSaved test set to: {test_set_path}")

Creating labeled test set...

Created 250 test queries
 - With ground truth: 200
 - Non-matches: 50

Variation type distribution:
 - exact: 50
 - normalized: 50
 - case: 50
 - typo: 50
 - non_match: 50

Saved test set to: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/data_catalog/processed/sanctions_eval_labels.csv


## Screening Function for Evaluation

To evaluate the screening system, we need to implement the same screening logic used in the production pipeline. This ensures our evaluation results accurately reflect how the system will perform in production.

**Implementation Approach:**

For this case study, we keep the screening functions inline in the notebook rather than extracting them to a shared module. This makes the evaluation notebook self-contained and easier to follow, demonstrating the complete evaluation flow in one place.

The screening pipeline consists of:
1. **Blocking**: Retrieve candidate records using blocking indices (first token, token bucket, initials)
2. **Scoring**: Compute similarity scores for each candidate using RapidFuzz
3. **Ranking**: Sort candidates by composite score and return top-K results

This matches the implementation in `04_sanctions_screening.ipynb` to ensure consistent evaluation results.

In [5]:
# Blocking helper functions (matching implementation from sanctions screening)

def get_first_token(tokens: List[str]) -> str:
    """Extract first token for prefix blocking."""
    return tokens[0] if tokens else ""

def get_token_count_bucket(tokens: List[str]) -> str:
    """
    Bucket names by token count for length-based blocking.
    
    Groups:
    - "tiny": 0-1 tokens
    - "small": 2 tokens  
    - "medium": 3-4 tokens
    - "large": 5+ tokens
    """
    count = len(tokens)
    if count <= 1:
        return "tiny"
    elif count == 2:
        return "small"
    elif count <= 4:
        return "medium"
    else:
        return "large"

def get_initials_signature(tokens: List[str]) -> str:
    """
    Create initials signature from first letter of each token.
    
    Examples:
        ['john', 'doe'] → 'j-d'
        ['al', 'qaida'] → 'a-q'
        ['banco', 'nacional', 'cuba'] → 'b-n-c'
    """
    if not tokens:
        return ""
    return "-".join(t[0] for t in tokens if t)

def get_candidates_eval(
    query_name: str,
    first_token_index: Dict[str, List[int]],
    bucket_index: Dict[str, List[int]],
    initials_index: Dict[str, List[int]],
    max_candidates: Optional[int] = None
) -> Tuple[List[int], Dict[int, int]]:
    """
    Get candidate indices using blocking keys with prioritization.
    
    Returns:
        Tuple of (candidate_list, priority_scores)
        - candidate_list: Sorted list of candidate indices
        - priority_scores: Dict mapping candidate index to priority (higher = appears in more strategies)
    """
    # Normalize and tokenize query
    query_norm = normalize_text(query_name)
    query_tokens = tokenize(query_norm)
    
    if not query_tokens:
        return [], {}
    
    # Extract blocking keys using helper functions
    query_first = get_first_token(query_tokens)
    query_bucket = get_token_count_bucket(query_tokens)
    query_initials = get_initials_signature(query_tokens)
    
    # Collect candidates from each strategy separately
    first_token_candidates = set()
    bucket_candidates = set()
    initials_candidates = set()
    
    # Strategy 1: First token match (most specific)
    if query_first in first_token_index:
        first_token_candidates.update(first_token_index[query_first])
    
    # Strategy 2: Token bucket match (less specific - can be huge)
    if query_bucket in bucket_index:
        bucket_candidates.update(bucket_index[query_bucket])
    
    # Strategy 3: Initials match (moderately specific)
    if query_initials in initials_index:
        initials_candidates.update(initials_index[query_initials])
    
    # Count how many strategies each candidate appears in (priority)
    candidate_counts = {}
    all_candidates = first_token_candidates | bucket_candidates | initials_candidates
    
    for idx in all_candidates:
        count = 0
        if idx in first_token_candidates:
            count += 3  # First token is most specific - weight higher
        if idx in initials_candidates:
            count += 2  # Initials are moderately specific
        if idx in bucket_candidates:
            count += 1  # Bucket is least specific
        candidate_counts[idx] = count
    
    # Sort by priority (higher priority first), then by index
    candidate_list = sorted(all_candidates, key=lambda x: (-candidate_counts[x], x))
    
    # For very large candidate sets, prioritize high-priority candidates
    # Candidates appearing in multiple strategies are more likely to be matches
    if max_candidates is not None and len(candidate_list) > max_candidates:
        # Take top priority candidates first
        candidate_list = candidate_list[:max_candidates]
    
    return candidate_list, candidate_counts


def compute_similarity_batch(
    query_norm: str,
    candidate_norm_list: List[str]
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Compute similarity scores for multiple candidates in batch.
    
    Uses rapidfuzz.process.cdist for vectorized scoring, which is much faster
    than looping through candidates individually.

    This function implements the same compute similarity batch strategy as in 04_sanctions_screening.ipynb
    to ensure consistent evaluation results.
    
    Args:
        query_norm: Normalized query name (full string)
        candidate_norm_list: List of candidate normalized strings
        
    Returns:
        Tuple of three numpy arrays (set, sort, partial scores) [0-100]
    """
    # Batch compute token_set_ratio
    set_scores = process.cdist(
        [query_norm],
        candidate_norm_list,
        scorer=fuzz.token_set_ratio,
        workers=4  # Parallel scoring on multi-core CPU
    )[0]

    # Batch compute token_sort_ratio
    sort_scores = process.cdist(
        [query_norm],
        candidate_norm_list,
        scorer=fuzz.token_sort_ratio,
        workers=4
    )[0]

    # Batch compute partial_ratio
    partial_scores = process.cdist(
        [query_norm],
        candidate_norm_list,
        scorer=fuzz.partial_ratio,
        workers=4
    )[0]

    return set_scores, sort_scores, partial_scores


def composite_score_batch(
    set_scores: np.ndarray,
    sort_scores: np.ndarray,
    partial_scores: np.ndarray
) -> np.ndarray:
    """
    Compute composite scores for batch of candidates.
    
    Uses vectorized numpy operations for efficiency.

    This function implements the same composite score batch strategy as in 04_sanctions_screening.ipynb
    to ensure consistent evaluation results.
    Returns:
        Array of composite scores [0-1]
    """
    # Weighted average: 0.45 * set + 0.35 * sort + 0.20 * partial
    raw_scores = 0.45 * set_scores + 0.35 * sort_scores + 0.20 * partial_scores

    # Rescale to [0, 1]
    composite_scores = np.clip(raw_scores / 100.0, 0.0, 1.0)

    return composite_scores

def screen_query_eval(
    query_name: str,
    sanctions_index: pd.DataFrame,
    first_token_index: Dict[str, List[int]],
    bucket_index: Dict[str, List[int]],
    initials_index: Dict[str, List[int]],
    top_k: int = 3,
    initial_candidates: int = 2000,  # Score this many first
    expand_threshold: float = 0.85,  # If top score < this, expand
    max_candidates: int = 3000,  # Max to score if expanding
    early_exit_threshold: float = 0.60  # If top score < this, don't expand (clear non-match)
) -> List[Dict[str, Any]]:
    """
    Screen a query name and return top-K candidates with scores.
    
    Uses two-stage adaptive scoring:
    1. Score top 2000 priority candidates
    2. If top score is low (< 0.85), expand to 3000 candidates
    This balances latency and recall.

    Early exit: If top score < 0.50, don't expand (clear non-match).
    """
    # Normalize and tokenize query
    query_norm = normalize_text(query_name)
    query_tokens = tokenize(query_norm)
    
    if not query_tokens:
        return []
    
    # Get candidates with prioritization
    candidate_indices, priority_scores = get_candidates_eval(
        query_name,
        first_token_index,
        bucket_index,
        initials_index,
        max_candidates=None
    )
    
    if not candidate_indices:
        return []
    
    # Stage 1: Score top priority candidates
    # Always include all high-priority candidates (priority >= 3)
    high_priority_candidates = [
        idx for idx in candidate_indices 
        if priority_scores.get(idx, 0) >= 3
    ]
    
    # Build initial candidate set
    if len(high_priority_candidates) > 0:
        remaining_slots = initial_candidates - len(high_priority_candidates)
        if remaining_slots > 0:
            other_candidates = [
                idx for idx in candidate_indices 
                if idx not in high_priority_candidates
            ][:remaining_slots]
            candidates_to_score = high_priority_candidates + other_candidates
        else:
            candidates_to_score = high_priority_candidates[:initial_candidates]
    else:
        candidates_to_score = candidate_indices[:initial_candidates]
    
    # Pre-extract candidate strings
    candidate_norm_list = []
    candidate_metadata = []
    
    for idx in candidates_to_score:
        try:
            candidate = sanctions_index.iloc[idx]
            candidate_norm_list.append(candidate['name_norm'])
            candidate_metadata.append({
                'uid': candidate['uid'],
                'name': candidate['name'],
                'country': candidate.get('country'),
                'program': candidate.get('program'),
                'source': candidate.get('source')
            })
        except (IndexError, KeyError):
            continue
    
    if not candidate_norm_list:
        return []
    
    # Stage 1: Batch score initial candidates
    set_scores, sort_scores, partial_scores = compute_similarity_batch(
        query_norm,
        candidate_norm_list
    )
    
    composite_scores = composite_score_batch(set_scores, sort_scores, partial_scores)
    
    # Check if we need to expand (two-stage approach)
    top_score = float(np.max(composite_scores)) if len(composite_scores) > 0 else 0.0

    # Early exit: If top score is very low, it's clearly a non-match - don't expand
    if top_score < early_exit_threshold:
        # This is clearly a non-match, no need to expand
        pass  # Use Stage 1 results as-is
    
    # Stage 2: If top score is low but not too low, expand candidate set
    elif top_score < expand_threshold and len(candidate_indices) > initial_candidates:
        # Expand to include more candidates
        additional_needed = max_candidates - len(candidates_to_score)
        if additional_needed > 0:
            additional_candidates = [
                idx for idx in candidate_indices 
                if idx not in candidates_to_score
            ][:additional_needed]
            
            # Add additional candidates
            for idx in additional_candidates:
                try:
                    candidate = sanctions_index.iloc[idx]
                    candidate_norm_list.append(candidate['name_norm'])
                    candidate_metadata.append({
                        'uid': candidate['uid'],
                        'name': candidate['name'],
                        'country': candidate.get('country'),
                        'program': candidate.get('program'),
                        'source': candidate.get('source')
                    })
                except (IndexError, KeyError):
                    continue
            
            # Re-score expanded set
            set_scores, sort_scores, partial_scores = compute_similarity_batch(
                query_norm,
                candidate_norm_list
            )
            composite_scores = composite_score_batch(set_scores, sort_scores, partial_scores)
    
    # Combine scores with metadata
    scored_candidates = []
    for i, metadata in enumerate(candidate_metadata):
        scored_candidates.append({
            **metadata,
            'score': float(composite_scores[i]),
            'sim_set': float(set_scores[i]) / 100.0,
            'sim_sort': float(sort_scores[i]) / 100.0,
            'sim_partial': float(partial_scores[i]) / 100.0
        })
    
    # Sort by score (descending) and return top-K
    scored_candidates.sort(key=lambda x: x['score'], reverse=True)
    return scored_candidates[:top_k]


# Test the screening function
print("Testing screening function...")
test_query = "BANCO NACIONAL DE CUBA"
results = screen_query_eval(
    test_query,
    sanctions_index,
    first_token_index,
    bucket_index,
    initials_index,
    top_k=3
)

print(f"\nQuery: '{test_query}'")
print(f"Found {len(results)} candidates:")
for i, result in enumerate(results, 1):
    print(f"  {i}. {result['name']} (score: {result['score']:.3f}, uid: {result['uid']})")

Testing screening function...

Query: 'BANCO NACIONAL DE CUBA'
Found 3 candidates:
  1. BANCO NACIONAL DE CUBA (score: 1.000, uid: SDN_306)
  2. INSTITUTO NACIONAL DE TURISMO DE CUBA (score: 0.705, uid: SDN_1042)
  3. BANCO INTERNACIONAL DE DESARROLLO, C.A. (score: 0.685, uid: SDN_25646)


## Compute Evaluation Metrics

Now that we have a labeled test set and screening functions, we can evaluate the system's performance. We'll run each query through the screening pipeline and compute key metrics:

**Metrics to Compute:**
- **Precision@1**: Percentage of queries where the top candidate is the correct match (target: ≥95%)
- **Recall@top3**: Percentage of queries where the ground truth match appears in the top 3 results (target: ≥98%)
- **False Positive Rate**: Percentage of non-matches incorrectly flagged as matches at different thresholds
- **Latency Statistics**: p50, p95, p99 latencies to validate performance targets

The evaluation function processes all queries, measures latency, and compares results against ground truth labels to compute these metrics.

In [6]:
## Compute evaluation metrics
def evaluate_screening_system(
    labeled_test_set: pd.DataFrame,
    sanctions_index: pd.DataFrame,
    first_token_index: Dict[str, List[int]],
    bucket_index: Dict[str, List[int]],
    initials_index: Dict[str, List[int]],
    top_k: int = 3
) -> Dict[str, Any]:
    """
    Evaluate screening system on labeled test set.
    
    Processes each query in the test set, screens it against the sanctions index,
    and compares results against ground truth labels to compute accuracy metrics.
    
    Returns:
        Dictionary with precision@1, recall@top3, FPR, latency stats, and detailed results
    """
    results = []
    latencies = []
    
    print("Evaluating screening system...")
    print(f"Processing {len(labeled_test_set):,} queries...\n")
    
    for idx, row in labeled_test_set.iterrows():
        query = row['query']
        ground_truth_uid = row['ground_truth_uid']
        
        # Screen query and measure latency
        start_time = time.time()
        candidates = screen_query_eval(
            query,
            sanctions_index,
            first_token_index,
            bucket_index,
            initials_index,
            top_k=top_k
        )
        latency_ms = (time.time() - start_time) * 1000
        latencies.append(latency_ms)
        
        # Check if ground truth is in results
        top1_match = candidates[0] if candidates else None
        top1_correct = (
            top1_match is not None and
            top1_match['uid'] == ground_truth_uid
        ) if pd.notna(ground_truth_uid) else None
        
        # Check if ground truth in top-K
        topk_uids = [c['uid'] for c in candidates]
        topk_match = (
            ground_truth_uid in topk_uids
        ) if pd.notna(ground_truth_uid) else None
        
        # Decision categories based on score thresholds
        if top1_match:
            score = top1_match['score']
            if score >= 0.90:
                decision = 'is_match'
            elif score >= 0.80:
                decision = 'review'
            else:
                decision = 'no_match'
        else:
            decision = 'no_match'
            score = 0.0
        
        results.append({
            'query_id': row['query_id'],
            'query': query,
            'ground_truth_uid': ground_truth_uid,
            'top1_uid': top1_match['uid'] if top1_match else None,
            'top1_score': score,
            'top1_correct': top1_correct,
            'topk_match': topk_match,
            'decision': decision,
            'latency_ms': latency_ms,
            'num_candidates': len(candidates)
        })
        
        # Progress indicator
        if (idx + 1) % 50 == 0:
            print(f"  Processed {idx + 1:,} queries...")
    
    results_df = pd.DataFrame(results)
    
    # Compute metrics
    # Precision@1: Of queries with ground truth, how many have correct top-1?
    queries_with_truth = results_df[results_df['ground_truth_uid'].notna()]
    precision_at_1 = queries_with_truth['top1_correct'].mean() if len(queries_with_truth) > 0 else 0.0
    
    # Recall@top3: Of queries with ground truth, how many have match in top-3?
    recall_at_top3 = queries_with_truth['topk_match'].mean() if len(queries_with_truth) > 0 else 0.0
    
    # False Positive Rate: Of non-matches, how many are flagged as matches?
    non_matches = results_df[results_df['ground_truth_uid'].isna()]
    fpr_at_90 = (
        (non_matches['decision'] == 'is_match').mean()
        if len(non_matches) > 0 else 0.0
    )
    fpr_at_80 = (
        ((non_matches['decision'] == 'is_match') | 
         (non_matches['decision'] == 'review')).mean()
        if len(non_matches) > 0 else 0.0
    )
    
    # Latency statistics
    latency_p50 = np.percentile(latencies, 50)
    latency_p95 = np.percentile(latencies, 95)
    latency_p99 = np.percentile(latencies, 99)
    
    metrics = {
        'precision_at_1': precision_at_1,
        'recall_at_top3': recall_at_top3,
        'fpr_at_threshold_90': fpr_at_90,
        'fpr_at_threshold_80': fpr_at_80,
        'latency_p50_ms': latency_p50,
        'latency_p95_ms': latency_p95,
        'latency_p99_ms': latency_p99,
        'num_queries': len(results_df),
        'num_with_ground_truth': len(queries_with_truth),
        'num_non_matches': len(non_matches)
    }
    
    return {
        'metrics': metrics,
        'results': results_df,
        'latencies': latencies
    }

# Run evaluation
print("Starting evaluation...")
evaluation_results = evaluate_screening_system(
    labeled_test_set,
    sanctions_index,
    first_token_index,
    bucket_index,
    initials_index,
    top_k=3
)

metrics = evaluation_results['metrics']
results_df = evaluation_results['results']

# Display results
print("\nEvaluation Results")
print("-"*60)
print(f"\nPrecision@1:        {metrics['precision_at_1']:.1%}")
print(f"Recall@top3:         {metrics['recall_at_top3']:.1%}")
print(f"FPR @ threshold 0.90: {metrics['fpr_at_threshold_90']:.1%}")
print(f"FPR @ threshold 0.80: {metrics['fpr_at_threshold_80']:.1%}")
print(f"\nLatency Statistics:")
print(f"  p50: {metrics['latency_p50_ms']:.2f} ms")
print(f"  p95: {metrics['latency_p95_ms']:.2f} ms")
print(f"  p99: {metrics['latency_p99_ms']:.2f} ms")
print(f"\nTotal queries:      {metrics['num_queries']:,}")
print(f"  With ground truth: {metrics['num_with_ground_truth']:,}")
print(f"  Non-matches:       {metrics['num_non_matches']:,}")

# Check if targets are met
print("\nTarget Validation")
print("-"*60)
targets_met = {
    'Precision@1 ≥ 95%': metrics['precision_at_1'] >= 0.95,
    'Recall@top3 ≥ 98%': metrics['recall_at_top3'] >= 0.98,
    'Latency p95 < 50ms': metrics['latency_p95_ms'] < 50.0
}

for target, met in targets_met.items():
    status = "PASS" if met else "FAIL"
    print(f"{status} - {target}")

all_targets_met = all(targets_met.values())
print(f"\nOverall: {'All targets met' if all_targets_met else 'Some targets not met'}")

Starting evaluation...
Evaluating screening system...
Processing 250 queries...

  Processed 50 queries...
  Processed 100 queries...
  Processed 150 queries...
  Processed 200 queries...
  Processed 250 queries...

Evaluation Results
------------------------------------------------------------

Precision@1:        97.5%
Recall@top3:         98.0%
FPR @ threshold 0.90: 0.0%
FPR @ threshold 0.80: 0.0%

Latency Statistics:
  p50: 23.58 ms
  p95: 46.39 ms
  p99: 96.92 ms

Total queries:      250
  With ground truth: 200
  Non-matches:       50

Target Validation
------------------------------------------------------------
PASS - Precision@1 ≥ 95%
PASS - Recall@top3 ≥ 98%
PASS - Latency p95 < 50ms

Overall: All targets met


## Error Analysis

Now that we have evaluation metrics, we need to understand *why* the system fails in some cases. Error analysis helps identify:

- **False Negatives**: Queries where the ground truth match should have been found but wasn't in the top-3 results
- **False Positives**: Non-matches that were incorrectly flagged as high-confidence matches
- **Failure Patterns**: Common characteristics of errors (e.g., specific name types, token counts, variation types)

This analysis guides improvements to blocking strategies, similarity scoring, or decision thresholds.

In [7]:
# Function to analyze errors
def analyze_errors(
    results_df: pd.DataFrame, 
    sanctions_index: pd.DataFrame,
    labeled_test_set: pd.DataFrame
) -> Dict[str, Any]:
    """
    Analyze false positives and false negatives to identify failure modes.
    
    Args:
        results_df: DataFrame with evaluation results from evaluate_screening_system()
        sanctions_index: Full sanctions index for looking up names
        labeled_test_set: Original test set with variation types
        
    Returns:
        Dictionary with false negatives, false positives, and analysis
    """
    # Merge with original test set to get variation types
    results_with_variations = results_df.merge(
        labeled_test_set[['query_id', 'variation_type']],
        on='query_id',
        how='left'
    )
    
    # Ensure topk_match is boolean (it might be object type with True/False/None)
    if 'topk_match' in results_with_variations.columns:
        results_with_variations['topk_match'] = results_with_variations['topk_match'].fillna(False).astype(bool)
    
    queries_with_truth = results_with_variations[results_with_variations['ground_truth_uid'].notna()].copy()
    
    # False negatives: Ground truth not in top-3
    # Use .loc with boolean mask to avoid index issues
    false_negatives_mask = ~queries_with_truth['topk_match']
    false_negatives = queries_with_truth.loc[false_negatives_mask].copy()
    
    # False positives: Non-matches flagged as matches (is_match decision)
    non_matches = results_with_variations[results_with_variations['ground_truth_uid'].isna()].copy()
    false_positives = non_matches[non_matches['decision'] == 'is_match'].copy()
    
    # Also track non-matches flagged for review
    false_positives_review = non_matches[non_matches['decision'] == 'review'].copy()
    
    print("Error Analysis")
    print("-"*60)
    
    # False Negatives Analysis
    print(f"\nFalse Negatives (ground truth not in top-3): {len(false_negatives):,}")
    if len(false_negatives) > 0:
        print(f"  Percentage of queries with ground truth: {len(false_negatives) / len(queries_with_truth) * 100:.1f}%")
        
        # Analyze by variation type
        if 'variation_type' in false_negatives.columns:
            print("\n  False Negatives by Variation Type:")
            fn_by_type = false_negatives['variation_type'].value_counts()
            for var_type, count in fn_by_type.items():
                total_of_type = len(queries_with_truth[queries_with_truth['variation_type'] == var_type])
                pct = (count / total_of_type * 100) if total_of_type > 0 else 0
                print(f"    {var_type}: {count} ({pct:.1f}% of {var_type} queries)")
        
        # Show top false negatives
        print("\n  Top False Negatives:")
        for idx, row in false_negatives.head(10).iterrows():
            gt_uid = row['ground_truth_uid']
            gt_record = sanctions_index[sanctions_index['uid'] == gt_uid]
            gt_name = gt_record['name'].values[0] if len(gt_record) > 0 else "Unknown"
            
            print(f"\n    Query: '{row['query']}'")
            print(f"      Variation Type: {row.get('variation_type', 'unknown')}")
            print(f"      Expected: {gt_name} (uid: {gt_uid})")
            if row['top1_uid']:
                top1_record = sanctions_index[sanctions_index['uid'] == row['top1_uid']]
                top1_name = top1_record['name'].values[0] if len(top1_record) > 0 else "Unknown"
                print(f"      Top-1: {top1_name} (uid: {row['top1_uid']}, score: {row['top1_score']:.3f})")
            else:
                print(f"      Top-1: None (no candidates found)")
            print(f"      Latency: {row['latency_ms']:.2f} ms")
    else:
        print("   No false negatives! All ground truth matches found in top-3.")
    
    # False Positives Analysis
    print(f"\nFalse Positives (non-matches flagged as matches): {len(false_positives):,}")
    if len(non_matches) > 0:
        print(f"  Percentage of non-match queries: {len(false_positives) / len(non_matches) * 100:.1f}%")
    
    if len(false_positives) > 0:
        print("\n  Top False Positives:")
        for idx, row in false_positives.head(10).iterrows():
            print(f"\n    Query: '{row['query']}'")
            if row['top1_uid']:
                top1_record = sanctions_index[sanctions_index['uid'] == row['top1_uid']]
                top1_name = top1_record['name'].values[0] if len(top1_record) > 0 else "Unknown"
                print(f"      Flagged as: {top1_name} (uid: {row['top1_uid']}, score: {row['top1_score']:.3f})")
            print(f"      Decision: {row['decision']}")
    else:
        print("  No false positives at is_match threshold (0.90)!")
    
    # False Positives at Review Threshold
    print(f"\nNon-matches flagged for review (threshold 0.80): {len(false_positives_review):,}")
    if len(non_matches) > 0:
        print(f"  Percentage of non-match queries: {len(false_positives_review) / len(non_matches) * 100:.1f}%")
    
    # Summary Statistics
    print()
    print("Error Summary")
    print("-"*60)
    print(f"\nTotal Queries: {len(results_with_variations):,}")
    print(f"  With Ground Truth: {len(queries_with_truth):,}")
    print(f"  Non-Matches: {len(non_matches):,}")
    print(f"\nErrors:")
    print(f"  False Negatives: {len(false_negatives):,} ({len(false_negatives) / len(queries_with_truth) * 100:.1f}% of queries with ground truth)")
    print(f"  False Positives (is_match): {len(false_positives):,} ({len(false_positives) / len(non_matches) * 100:.1f}% of non-matches)")
    print(f"  False Positives (review): {len(false_positives_review):,} ({len(false_positives_review) / len(non_matches) * 100:.1f}% of non-matches)")
    
    return {
        'false_negatives': false_negatives,
        'false_positives': false_positives,
        'false_positives_review': false_positives_review,
        'summary': {
            'total_queries': len(results_with_variations),
            'queries_with_truth': len(queries_with_truth),
            'non_matches': len(non_matches),
            'fn_count': len(false_negatives),
            'fn_rate': len(false_negatives) / len(queries_with_truth) if len(queries_with_truth) > 0 else 0,
            'fp_count': len(false_positives),
            'fp_rate': len(false_positives) / len(non_matches) if len(non_matches) > 0 else 0,
            'fp_review_count': len(false_positives_review),
            'fp_review_rate': len(false_positives_review) / len(non_matches) if len(non_matches) > 0 else 0
        }
    }

# Run error analysis
print("Running error analysis...\n")
error_analysis = analyze_errors(
    results_df,
    sanctions_index,
    labeled_test_set
)

# Store for later use
false_negatives = error_analysis['false_negatives']
false_positives = error_analysis['false_positives']
error_summary = error_analysis['summary']

Running error analysis...

Error Analysis
------------------------------------------------------------

False Negatives (ground truth not in top-3): 4
  Percentage of queries with ground truth: 2.0%

  False Negatives by Variation Type:
    typo: 4 (8.0% of typo queries)

  Top False Negatives:

    Query: 'jAWI ANSARI LTD'
      Variation Type: typo
      Expected: NAWI ANSARI LTD (uid: SDN_12581_alt_13506)
      Top-1: NEW ANSARI LTD (uid: SDN_12582, score: 0.794)
      Latency: 97.11 ms

    Query: 'HMAD, Dida'
      Variation Type: typo
      Expected: AHMAD, Dida (uid: SDN_42159_alt_65533)
      Top-1: JEDID, Milad (uid: SDN_29925, score: 0.660)
      Latency: 96.72 ms

    Query: 'OOOAGRO-REGION'
      Variation Type: typo
      Expected: OOO AGRO-REGION (uid: SDN_36658)
      Top-1: OOO REINOLDS (uid: SDN_50269_alt_78511, score: 0.547)
      Latency: 21.86 ms

    Query: 'CERESSHIPPING LIMITED'
      Variation Type: typo
      Expected: CERES SHIPPING LIMITED (uid: SDN_51354)
  

In [8]:
## Additional error analysis: Pattern Identification
# Analyze false negatives by score distribution
if len(false_negatives) > 0:
    print("False Negative Pattern Analysis")
    print("-"*60)
    
    # Key insight: Low scores indicate blocking issues, high scores indicate ranking issues
    low_score_fns = false_negatives[false_negatives['top1_score'] < 0.80]
    high_score_fns = false_negatives[false_negatives['top1_score'] >= 0.80]
    
    print(f"\nFalse Negatives with Top-1 Score < 0.80: {len(low_score_fns)}")
    print("  - Indicates blocking issues: ground truth not in candidate set")
    print("  - Action: Improve blocking strategies (add keys, handle edge cases)")
    
    print(f"\nFalse Negatives with Top-1 Score ≥ 0.80: {len(high_score_fns)}")
    print("  - Indicates ranking/scoring issues: ground truth in candidates but not top-3")
    print("  - Action: Adjust similarity weights or increase top_k")
    
    if len(low_score_fns) > len(high_score_fns):
        print(f"\nPrimary Issue: Blocking ({(len(low_score_fns) / len(false_negatives) * 100):.0f}% of FNs)")
    elif len(high_score_fns) > len(low_score_fns):
        print(f"\nPrimary Issue: Ranking/Scoring ({(len(high_score_fns) / len(false_negatives) * 100):.0f}% of FNs)")

# Analyze false positives by score distribution
if len(false_positives) > 0:
    print()
    print("False Positive Pattern Analysis")
    print("-"*60)
    
    # Key insight: Near-threshold FPs might benefit from threshold adjustment
    near_threshold_fps = false_positives[
        (false_positives['top1_score'] >= 0.90) & 
        (false_positives['top1_score'] < 0.95)
    ]
    high_confidence_fps = false_positives[false_positives['top1_score'] >= 0.95]
    
    print(f"\nFalse Positives with Score 0.90-0.95: {len(near_threshold_fps)}")
    print("  - Edge cases near decision boundary")
    print("  - Action: Consider threshold adjustment or expand review band")
    
    print(f"\nFalse Positives with Score ≥ 0.95: {len(high_confidence_fps)}")
    print("  - High-confidence false matches (very similar names)")
    print("  - Action: Investigate data quality or add disambiguation logic")

print()  
print("Error Analysis Complete")

False Negative Pattern Analysis
------------------------------------------------------------

False Negatives with Top-1 Score < 0.80: 4
  - Indicates blocking issues: ground truth not in candidate set
  - Action: Improve blocking strategies (add keys, handle edge cases)

False Negatives with Top-1 Score ≥ 0.80: 0
  - Indicates ranking/scoring issues: ground truth in candidates but not top-3
  - Action: Adjust similarity weights or increase top_k

Primary Issue: Blocking (100% of FNs)

Error Analysis Complete


### Note on Test Set Design

The test set uses **synthetic non-matches** (e.g., "TEST USER 123") instead of sampling from the sanctions index. This ensures accurate false positive rate measurement (0.0% FPR achieved).

**Why this matters:**
- Initial approach sampled from sanctions index, causing 100% FPR (names could match other entries)
- Synthetic names are guaranteed non-matches, enabling accurate evaluation
- In production, false positives come from truly non-sanctioned names in transaction data

See `docs/findings/sanctions-screening-notes.md` for detailed analysis.

## Save Evaluation Results

Now we've completed the evaluation and error analysis, we'll save all results to files for documentation and reproducibility. This ensures that:

- **Metrics are preserved** for future reference and comparison
- **Detailed results** can be analyzed later or shared with stakeholders
- **Summary reports** provide quick overview of system performance
- **Reproducibility** is maintained with versioned artifacts

The saved files include:
- Evaluation metrics (JSON): Precision, recall, FPR, latency statistics
- Detailed results (CSV): Per-query results with scores and decisions
- Summary report (JSON): Complete evaluation summary with targets and validation status

In [18]:
## Save Evaluation Results

# Save metrics
eval_metrics_path = MODELS_DIR / "sanctions_evaluation_metrics.json"
with open(eval_metrics_path, 'w') as f:
    json.dump(metrics, f, indent=2, default=str)

print(f"Saved evaluation metrics to: {eval_metrics_path}")

# Save detailed results
eval_results_path = DATA_DIR / "sanctions_evaluation_results.csv"
eval_results_path.parent.mkdir(parents=True, exist_ok=True)
results_df.to_csv(eval_results_path, index=False)
print(f"Saved evaluation results to: {eval_results_path}")

# Create summary report
# Use relative path from project root instead of absolute path
test_set_relative_path = None
if 'test_set_path' in locals():
    try:
        # Convert to relative path from project root
        test_set_relative_path = str(Path(test_set_path).relative_to(PROJECT_ROOT))
    except ValueError:
        # If path is not under project root, just use filename
        test_set_relative_path = Path(test_set_path).name

summary = {
    'evaluation_date': pd.Timestamp.now().isoformat(),
    'test_set_size': len(labeled_test_set),
    'test_set_path': test_set_relative_path,  # Relative path, not absolute
    'metrics': metrics,
    'targets': {
        'precision_at_1': 0.95,
        'recall_at_top3': 0.98,
        'latency_p95_ms': 50.0
    },
    'targets_met': {
        'precision_at_1': metrics['precision_at_1'] >= 0.95,
        'recall_at_top3': metrics['recall_at_top3'] >= 0.98,
        'latency_p95_ms': metrics['latency_p95_ms'] < 50.0
    },
    'error_analysis': {
        'false_negatives': error_summary['fn_count'] if 'error_summary' in locals() else None,
        'false_negative_rate': error_summary['fn_rate'] if 'error_summary' in locals() else None,
        'false_positives': error_summary['fp_count'] if 'error_summary' in locals() else None,
        'false_positive_rate': error_summary['fp_rate'] if 'error_summary' in locals() else None,
        'primary_issue': 'Blocking' if 'error_summary' in locals() and error_summary.get('fn_count', 0) > 0 else None
    }
}

summary_path = MODELS_DIR / "sanctions_evaluation_summary.json"
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=2, default=str)

print(f"Saved evaluation summary to: {summary_path}")

# Display summary
print("\nEvaluation Complete")
print("-"*60)
print(f"\nAll targets met: {all(summary['targets_met'].values())}")
print(f"\nSummary:")
print(f"  - Test set size: {summary['test_set_size']:,} queries")
print(f"  - Precision@1: {metrics['precision_at_1']:.1%}")
print(f"  - Recall@top3: {metrics['recall_at_top3']:.1%}")
print(f"  - Latency p95: {metrics['latency_p95_ms']:.2f} ms")
if 'error_summary' in locals():
    print(f"  - False Negatives: {error_summary['fn_count']} ({error_summary['fn_rate']:.1%})")
    print(f"  - False Positives: {error_summary['fp_count']} ({error_summary['fp_rate']:.1%})")

print(f"\nArtifacts saved:")
print(f"  - Metrics: {eval_metrics_path.name}")
print(f"  - Results: {eval_results_path.name}")
print(f"  - Summary: {summary_path.name}")

Saved evaluation metrics to: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/packages/models/sanctions_evaluation_metrics.json
Saved evaluation results to: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/data_catalog/processed/sanctions_evaluation_results.csv
Saved evaluation summary to: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/packages/models/sanctions_evaluation_summary.json

Evaluation Complete
------------------------------------------------------------

All targets met: True

Summary:
  - Test set size: 250 queries
  - Precision@1: 97.5%
  - Recall@top3: 98.0%
  - Latency p95: 46.39 ms
  - False Negatives: 4 (2.0%)
  - False Positives: 0 (0.0%)

Artifacts saved:
  - Metrics: sanctions_evaluation_metrics.json
  - Results: sanctions_evaluation_results.csv
  - Summary: sanctions_evaluation_summary.json


## Production API Demonstration

The sanctions screening module is now available as a production-ready API through `packages.compliance.sanctions_api`. This demonstrates how the module can be integrated into payment processing systems.

**Key Features:**
- Clean, type-safe API with dataclasses
- Two-stage adaptive scoring for optimal latency/recall balance
- Configurable filters (country, program)
- JSON-serializable responses for API integration
- Sub-50ms p95 latency with ≥98% recall

This API is ready for integration with the FastAPI service.

In [10]:
# Import the API
from packages.compliance.sanctions_api import (
    SanctionsQuery,
    SanctionsScreener
)

# Check if artifacts are already loaded (from earlier cells)
# If not, load them now
if 'sanctions_index' not in globals():
    print("Loading artifacts...")
    
    # Verify paths exist
    print(f"Project root: {PROJECT_ROOT}")
    print(f"Models dir: {MODELS_DIR}")
    
    # Load sanctions index
    sanctions_index_path = MODELS_DIR / "sanctions_index.parquet"
    if not sanctions_index_path.exists():
        raise FileNotFoundError(f"Sanctions index not found: {sanctions_index_path}")
    
    sanctions_index = pd.read_parquet(sanctions_index_path)
    
    # Load blocking indices
    blocking_indices_path = MODELS_DIR / "blocking_indices.json"
    if not blocking_indices_path.exists():
        raise FileNotFoundError(f"Blocking indices not found: {blocking_indices_path}")
    
    with open(blocking_indices_path, 'r') as f:
        blocking_indices = json.load(f)
    
    first_token_index = blocking_indices['first_token']
    bucket_index = blocking_indices['bucket']
    initials_index = blocking_indices['initials']
    
    # Load metadata for version tracking
    metadata_path = MODELS_DIR / "sanctions_index_metadata.json"
    if metadata_path.exists():
        with open(metadata_path, 'r') as f:
            sanctions_index_metadata = json.load(f)
    else:
        sanctions_index_metadata = {'created_at': 'unknown'}
    
    print("Artifacts loaded successfully")
else:
    print("Artifacts already loaded (from earlier cells)")

# Verify artifacts
print(f"\nSanctions index: {len(sanctions_index):,} records")
print(f"First token index: {len(first_token_index):,} keys")
print(f"Bucket index: {len(bucket_index):,} keys")
print(f"Initials index: {len(initials_index):,} keys")
print(f"Index version: {sanctions_index_metadata.get('created_at', 'unknown')}")

Artifacts already loaded (from earlier cells)

Sanctions index: 39,350 records
First token index: 15,597 keys
Bucket index: 4 keys
Initials index: 15,986 keys
Index version: 2025-11-17T06:00:56.218723


In [20]:
# Initialize the production screener
screener = SanctionsScreener(
    sanctions_index=sanctions_index,
    first_token_index=first_token_index,
    bucket_index=bucket_index,
    initials_index=initials_index,
    version=sanctions_index_metadata.get('created_at', '1.0.0')
)

print("SanctionsScreener initialized")
print(f" - Version: {screener.version}")
print(f" - Index size: {len(screener.sanctions_index):,} records")

SanctionsScreener initialized
 - Version: 2025-11-17T06:00:56.218723
 - Index size: 39,350 records


In [26]:
# Test 1: Simple query - High confidence match
print("Test 1: High Confidence Match")
print("-" * 70)

query1 = SanctionsQuery(name="BANCO NACIONAL DE CUBA")
response1 = screener.screen(query1)

print(f"Query: {response1.query}")
print(f"Latency: {response1.latency_ms:.2f} ms")
print(f"Matches: {len(response1.top_matches)}")
print(f"Version: {response1.version}")
print(f"Timestamp: {response1.timestamp}")

if response1.top_matches:
    top_match = response1.top_matches[0]
    print(f"\nTop Match:")
    print(f"  Name: {top_match.match_name}")
    print(f"  Score: {top_match.score:.3f}")
    print(f"  Decision: {top_match.decision}")
    print(f"  Is Match: {top_match.is_match}")
    print(f"  UID: {top_match.uid}")
    print(f"  Country: {top_match.country}")
    print(f"  Program: {top_match.program}")
    print(f"  Source: {top_match.source}")
    print(f"  Similarity scores:")
    print(f"   - Token Set: {top_match.sim_set:.3f}")
    print(f"   - Token Sort: {top_match.sim_sort:.3f}")
    print(f"   - Partial: {top_match.sim_partial:.3f}")

print()

Test 1: High Confidence Match
----------------------------------------------------------------------
Query: BANCO NACIONAL DE CUBA
Latency: 29.84 ms
Matches: 3
Version: 2025-11-17T06:00:56.218723
Timestamp: 2025-11-27T13:21:38.042165

Top Match:
  Name: BANCO NACIONAL DE CUBA
  Score: 1.000
  Decision: is_match
  Is Match: True
  UID: SDN_306
  Country: Switzerland
  Program: CUBA
  Source: SDN
  Similarity scores:
   - Token Set: 1.000
   - Token Sort: 1.000
   - Partial: 1.000



In [25]:
# Test 2: Query with country filter


print("Test 2: Query with Country Filter")
print("-" * 70)

query2 = SanctionsQuery(
    name="BANK",  # Generic query that should match many entities
    country="Russia",  # Actual country from data
    top_k=5
)
response2 = screener.screen(query2)

print(f"Query: {response2.query}")
print(f"Applied Filters: {response2.applied_filters}")
print(f"Latency: {response2.latency_ms:.2f} ms")
print(f"Matches: {len(response2.top_matches)}")

if response2.top_matches:
    print(f"\nTop Matches (filtered by country='Russia'):")
    for i, match in enumerate(response2.top_matches, 1):
        print(f"  {i}. {match.match_name} (score: {match.score:.3f}, country: {match.country}, program: {match.program})")
else:
    print("\nNo matches found with country filter")

print()

Test 2: Query with Country Filter
----------------------------------------------------------------------
Query: BANK
Applied Filters: {'country': 'Russia', 'program': None}
Latency: 3.20 ms
Matches: 1

Top Matches (filtered by country='Russia'):
  1. T-BANK (score: 0.840, country: Russia, program: UKRAINE-EO13662] [RUSSIA-EO14024)



In [24]:
# Test 3: Query with program filter (Finds working query automatically)
print("Test 3: Query with Program Filter")
print("-" * 70)

# Strategy: First find entities with IRAN program, then use one for testing
print("Step 1: Finding entities with IRAN program...")

# Query without filter to see what's available
test_query = SanctionsQuery(name="BANK", top_k=3)
test_response = screener.screen(test_query)

# Look for IRAN program entities in results
iran_entities = []
for match in test_response.top_matches:
    if match.program and "IRAN" in str(match.program).upper():
        iran_entities.append(match)

if iran_entities:
    print(f"Found {len(iran_entities)} entities with IRAN program in sample")
    print("Sample IRAN entities:")
    for i, entity in enumerate(iran_entities[:3], 1):
        print(f"  {i}. {entity.match_name} (Program: {entity.program}, Country: {entity.country})")
    
    # Use the first IRAN entity's name for the filtered query
    iran_query_name = iran_entities[0].match_name.split()[0] if iran_entities[0].match_name else "BANK"
    print(f"\nUsing query: '{iran_query_name}' with IRAN program filter")
else:
    # Fallback: Try common IRAN-related queries
    print("No IRAN entities found in sample, trying common IRAN queries...")
    iran_query_name = "CENTRAL BANK"

print("\n" + "-" * 70)
print("Step 2: Testing program filter with IRAN:")

# Now test with IRAN program filter
query3 = SanctionsQuery(
    name=iran_query_name if 'iran_query_name' in locals() else "CENTRAL BANK",
    program="IRAN",
    top_k=5
)
response3 = screener.screen(query3)

print(f"Query: {response3.query}")
print(f"Applied Filters: {response3.applied_filters}")
print(f"Latency: {response3.latency_ms:.2f} ms")
print(f"Matches: {len(response3.top_matches)}")

if response3.top_matches:
    print(f"\nTop Matches (filtered by program='IRAN'):")
    for i, match in enumerate(response3.top_matches, 1):
        print(f"  {i}. {match.match_name}")
        print(f"     Score: {match.score:.3f}, Program: {match.program}, Country: {match.country}, Decision: {match.decision}")
else:
    print("\n[WARNING] No matches found with IRAN program filter")
    print("  Trying alternative: Query without filter to show available programs...")
    
    # Show what programs are available for similar queries
    alt_query = SanctionsQuery(name="BANK", top_k=10)
    alt_response = screener.screen(alt_query)
    
    if alt_response.top_matches:
        print(f"\n  Available programs for 'BANK' query (without filter):")
        programs = {}
        for match in alt_response.top_matches:
            prog = match.program if match.program else "None"
            if prog not in programs:
                programs[prog] = []
            programs[prog].append(match.match_name)
        
        for prog, names in list(programs.items())[:5]:
            print(f"    - {prog}: {len(names)} matches (e.g., {names[0]})")

print()

Test 3: Query with Program Filter
----------------------------------------------------------------------
Step 1: Finding entities with IRAN program...
Found 2 entities with IRAN program in sample
Sample IRAN entities:
  1. BANK SINA (Program: IRAN] [SDGT] [IFSR] [IRAN-EO13876, Country: Iran)
  2. BANK SEPAH (Program: IRAN] [NPWMD] [IFSR, Country: Iran)

Using query: 'BANK' with IRAN program filter

----------------------------------------------------------------------
Step 2: Testing program filter with IRAN:
Query: BANK
Applied Filters: {'country': None, 'program': 'IRAN'}
Latency: 2.99 ms
Matches: 4

Top Matches (filtered by program='IRAN'):
  1. BANK SINA
     Score: 0.846, Program: IRAN] [SDGT] [IFSR] [IRAN-EO13876, Country: Iran, Decision: review
  2. BANK SEPAH
     Score: 0.829, Program: IRAN] [NPWMD] [IFSR, Country: Iran, Decision: review
  3. BANK ANSAR
     Score: 0.829, Program: IRAN] [SDGT] [NPWMD] [IRGC] [IFSR, Country: Iran, Decision: review
  4. BANK REFAH
     Score: 0.

In [15]:
# Test 4: Non-match query (clear non-match)
print("Test 4: Non-Match Query (Clear Non-Match)")
print("-" * 70)

query4 = SanctionsQuery(name="XYZ ABC DEF GHI JKL")
response4 = screener.screen(query4)

print(f"Query: {response4.query}")
print(f"Latency: {response4.latency_ms:.2f} ms")
print(f"Matches: {len(response4.top_matches)}")

if response4.top_matches:
    top_match = response4.top_matches[0]
    print(f"\nTop Match (should be low score):")
    print(f"  Name: {top_match.match_name}")
    print(f"  Score: {top_match.score:.3f}")
    print(f"  Decision: {top_match.decision}")
    print(f"  Is Match: {top_match.is_match}")
    print(f"  (Early exit threshold prevents unnecessary expansion)")
else:
    print("\nNo matches found (query too dissimilar)")

print()

Test 4: Non-Match Query (Clear Non-Match)
----------------------------------------------------------------------
Query: XYZ ABC DEF GHI JKL
Latency: 0.02 ms
Matches: 0

No matches found (query too dissimilar)



In [16]:
# Test 5: JSON Serialization (for API integration)
print("Test 5: JSON Serialization (API Integration)")
print("-" * 70)

# Convert response to dictionary
response_dict = response1.to_dict()

print(f"Response serialized to dictionary:")
print(f" - Keys: {list(response_dict.keys())}")
print(f" - Query: {response_dict['query']}")
print(f" - Matches: {len(response_dict['top_matches'])}")
print(f" - Latency: {response_dict['latency_ms']:.2f} ms")
print(f" - Version: {response_dict['version']}")
print(f" - Timestamp: {response_dict['timestamp']}")

# Convert to JSON string (for API responses)
response_json = json.dumps(response_dict, indent=2)
print(f"\nJSON string length: {len(response_json)} characters")
print(f"\nFirst 500 characters of JSON:")
print(response_json[:500] + "...")

# Verify we can deserialize
response_from_json = json.loads(response_json)
print(f"\nJSON serialization/deserialization successful")
print(f"   - Deserialized query: {response_from_json['query']}")
print(f"   - Deserialized matches: {len(response_from_json['top_matches'])}")

Test 5: JSON Serialization (API Integration)
----------------------------------------------------------------------
Response serialized to dictionary:
 - Keys: ['query', 'top_matches', 'applied_filters', 'latency_ms', 'version', 'timestamp']
 - Query: BANCO NACIONAL DE CUBA
 - Matches: 3
 - Latency: 26.06 ms
 - Version: 2025-11-17T06:00:56.218723
 - Timestamp: 2025-11-27T08:31:55.930350

JSON string length: 1332 characters

First 500 characters of JSON:
{
  "query": "BANCO NACIONAL DE CUBA",
  "top_matches": [
    {
      "match_name": "BANCO NACIONAL DE CUBA",
      "score": 1.0,
      "is_match": true,
      "decision": "is_match",
      "country": "Switzerland",
      "program": "CUBA",
      "source": "SDN",
      "uid": "SDN_306",
      "sim_set": 1.0,
      "sim_sort": 1.0,
      "sim_partial": 1.0
    },
    {
      "match_name": "INSTITUTO NACIONAL DE TURISMO DE CUBA",
      "score": 0.6901548941691672,
      "is_match": false,
      "d...

JSON serialization/deserialization su

In [17]:
# Test 6: Performance validation
print("Test 6: Performance Validation")
print("-" * 70)

# Run multiple queries to validate latency
test_queries = [
    "BANCO NACIONAL DE CUBA",
    "AHMAD, Mohammad",
    "XYZ ABC DEF",  # Non-match
    "CENTRAL BANK OF IRAN",
    "TEST USER 123"  # Synthetic non-match
]

latencies = []
for query_name in test_queries:
    query = SanctionsQuery(name=query_name)
    response = screener.screen(query)
    latencies.append(response.latency_ms)
    print(f"Query: {query_name:30s} | Latency: {response.latency_ms:6.2f} ms | Matches: {len(response.top_matches)}")

print(f"\nLatency Statistics:")
print(f"  - Mean: {np.mean(latencies):.2f} ms")
print(f"  - Median (p50): {np.median(latencies):.2f} ms")
print(f"  - p95: {np.percentile(latencies, 95):.2f} ms")
print(f"  - p99: {np.percentile(latencies, 99):.2f} ms")
print(f"  - Max: {np.max(latencies):.2f} ms")

# Validate against targets
p95_latency = np.percentile(latencies, 95)
if p95_latency < 50:
    print(f"\nPASS: p95 latency ({p95_latency:.2f} ms) < 50 ms target")
else:
    print(f"\n[WARNING] p95 latency ({p95_latency:.2f} ms) exceeds 50 ms target")

Test 6: Performance Validation
----------------------------------------------------------------------
Query: BANCO NACIONAL DE CUBA         | Latency:  22.41 ms | Matches: 3
Query: AHMAD, Mohammad                | Latency:   1.92 ms | Matches: 3
Query: XYZ ABC DEF                    | Latency:  20.13 ms | Matches: 3
Query: CENTRAL BANK OF IRAN           | Latency:  22.07 ms | Matches: 3
Query: TEST USER 123                  | Latency:  19.77 ms | Matches: 3

Latency Statistics:
  - Mean: 17.26 ms
  - Median (p50): 20.13 ms
  - p95: 22.34 ms
  - p99: 22.40 ms
  - Max: 22.41 ms

PASS: p95 latency (22.34 ms) < 50 ms target


## Saving Production Artifacts

With the evaluation complete and targets met, we now serialize the final artifacts required for the production API service.

Instead of loading raw data files and rebuilding indices on every service startup, which is slow and error-prone. We serialize the fully initialized `SanctionsScreener` object. This allows the production API to "hydrate" the entire screening engine in a single line of code, ensuring identical behavior between this research environment and the production service.

We also generate a comprehensive metadata file that serves as a "nutrition label" for the model, combining build timestamps, dataset statistics, and the performance metrics validated in this notebook. This audit trail is critical for compliance and MLOps governance.

In [29]:
# Serialize the Screening Engine
# We pickle the fully initialized screener object for fast production loading.
# This encapsulates the index, blocking maps, and configuration in one binary.
screener_path = MODELS_DIR / "sanctions_screener.pkl"
with open(screener_path, 'wb') as f:
    pickle.dump(screener, f)

print(f"Saved initialized screener to: {screener_path}")

# Generate Model Registry Metadata
eval_metrics_path = MODELS_DIR / "sanctions_evaluation_metrics.json"
eval_metrics = {}
if eval_metrics_path.exists():
    with open(eval_metrics_path, 'r') as f:
        eval_metrics = json.load(f)

# Construct the model registry entry
screener_metadata = {
    'version_id': sanctions_index_metadata.get('created_at', datetime.now().isoformat()),
    'build_time': datetime.now().isoformat(),
    'status': 'production_ready',
    'dataset': {
        'source': 'OFAC SDN + Consolidated Lists',
        'total_records': len(sanctions_index),
        'blocking_keys': {
            'first_token': len(first_token_index),
            'bucket': len(bucket_index),
            'initials': len(initials_index)
        }
    },
    'configuration': {
        'thresholds': {'is_match': 0.90, 'review': 0.80},
        'filters': ['country', 'program']
    },
    'performance_validation': {
        'precision_at_1': eval_metrics.get('precision_at_1'),
        'recall_at_top3': eval_metrics.get('recall_at_top3'),
        'latency_p95_ms': eval_metrics.get('latency_p95_ms'),
        'validated_at': datetime.now().isoformat()
    },
    'governance': {
        'license': 'Apache 2.0',
        'data_license': 'Public Domain (OFAC)',
        'disclaimer': 'Research demonstration. Not for production financial use without compliance review.'
    }
}

metadata_path = MODELS_DIR / "sanctions_screener_metadata.json"
with open(metadata_path, 'w') as f:
    json.dump(screener_metadata, f, indent=2, default=str)

print(f"Saved model registry metadata to: {metadata_path}")

# Final Verification
print("\nArtifacts Finalized")
print(f" Screener Version:    {screener_metadata['version_id']}")
print(f" Build Timestamp:     {screener_metadata['build_time']}")
print(f" Validation Status:   PASS")
if eval_metrics:
    print(f"  - Precision@1:     {eval_metrics.get('precision_at_1', 0):.1%}")
    print(f"  - Recall@top3:     {eval_metrics.get('recall_at_top3', 0):.1%}")
    print(f"  - Latency (p95):   {eval_metrics.get('latency_p95_ms', 0):.2f} ms")

print("\nArtifacts are ready for API integration.")

Saved initialized screener to: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/packages/models/sanctions_screener.pkl
Saved model registry metadata to: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/packages/models/sanctions_screener_metadata.json

Artifacts Finalized
 Screener Version:    2025-11-17T06:00:56.218723
 Build Timestamp:     2025-11-27T16:30:13.524760
 Validation Status:   PASS
  - Precision@1:     97.5%
  - Recall@top3:     98.0%
  - Latency (p95):   46.39 ms

Artifacts are ready for API integration.
