# Sanctions Screening

- **Purpose:** OFAC sanctions screening with fuzzy name matching for fraud detection pipeline  
- **Author:** Devbrew LLC  
- **Last Updated:** November 4, 2025  
- **Status:** In progress  
- **License:** Apache 2.0 (Code) | Public Domain (OFAC Data)

---

## Dataset License Notice

This notebook uses **OFAC Sanctions Lists** (SDN and Consolidated) from the U.S. Department of the Treasury.

**Dataset License:** Public Domain  
- OFAC sanctions data is publicly available from [OFAC Sanctions List Search](https://sanctionslist.ofac.treas.gov/Home)  
- Data can be freely used, redistributed, and incorporated into commercial systems  
- Updates are published regularly; production systems should refresh data periodically  

**Setup Instructions:** See [`../data_catalog/README.md`](../data_catalog/README.md) for download instructions.

**Code License:** This notebook's code is licensed under Apache 2.0 (open source).

**Disclaimer:** This is a research demonstration. Production sanctions screening requires broader list coverage (EU, UN, UK HMT), legal review, and compliance with local regulations.

---

## Notebook Configuration

### Environment Setup

We configure the Python environment with standardized settings, import required libraries for text processing and fuzzy matching, and set a fixed random seed for reproducibility. This ensures consistent results across runs and enables reliable experimentation.

These settings establish the foundation for all sanctions screening operations, including name normalization, tokenization, and similarity scoring.

In [1]:
import warnings
from pathlib import Path
import json
import hashlib
import unicodedata
import re
from typing import Dict, Any, Optional, List, Tuple
import time

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import rapidfuzz as rf
from rapidfuzz import fuzz, process

# Configuration
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", '{:.2f}'.format)

# Plotting configuration
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Environment configured successfully")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"rapidfuzz: {rf.__version__}")

Environment configured successfully
pandas: 2.3.3
numpy: 2.3.3
rapidfuzz: 3.14.1


### Path Configuration

We define the project directory structure and validate that OFAC data files exist before proceeding. The validation ensures we have the necessary sanctions lists for screening operations.

This configuration pattern ensures we can locate all required data artifacts and provides clear feedback if prerequisites are missing.

In [2]:
# Project paths
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data_catalog"
OFAC_DIR = DATA_DIR / "ofac"
PROCESSED_DIR = DATA_DIR / "processed"
MODELS_DIR = PROJECT_ROOT / "packages" / "models"

# Ensure output directories exist
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# Expected OFAC data files
OFAC_FILES = {
    'SDN Primary': OFAC_DIR / 'sdn' / 'sdn.csv',
    'SDN Alternate': OFAC_DIR / 'sdn' / 'alt.csv',
    'SDN Address': OFAC_DIR / 'sdn' / 'add.csv',
    'Consolidated Primary': OFAC_DIR / 'consolidated' / 'cons_prim.csv',
    'Consolidated Alternate': OFAC_DIR / 'consolidated' / 'cons_alt.csv',
    'Consolidated Address': OFAC_DIR / 'consolidated' / 'cons_add.csv',
}

def validate_required_data():
    """Validate that OFAC sanctions data files exist."""
    print("OFAC Data Availability Check:")
    
    all_exist = True
    for name, path in OFAC_FILES.items():
        exists = path.exists()
        status = "Found" if exists else "Missing"
        print(f" - {name:25s}: {status}")
        if not exists:
            all_exist = False
    
    if not all_exist:
        print("\n[WARNING] Some OFAC files are missing; see data_catalog/README.md for instructions")
    else:
        print("\nAll required OFAC data files are available")
    
    return all_exist

data_available = validate_required_data()

OFAC Data Availability Check:
 - SDN Primary              : Found
 - SDN Alternate            : Found
 - SDN Address              : Found
 - Consolidated Primary     : Found
 - Consolidated Alternate   : Found
 - Consolidated Address     : Found

All required OFAC data files are available


## Load & Normalize OFAC Datasets

We load OFAC sanctions lists (SDN and Consolidated) and apply comprehensive text normalization to enable robust fuzzy matching. This step is critical for handling variations in how names appear across different systems and languages.

Our normalization strategy addresses several common challenges in sanctions screening:
- **Unicode variations**: Convert to canonical form (NFKC) to handle different encodings
- **Accent marks**: Strip diacritics to match "José" with "Jose"
- **Case sensitivity**: Lowercase everything for case-insensitive matching
- **Punctuation**: Standardize hyphens, remove quotes that don't affect identity
- **Whitespace**: Collapse multiple spaces to single space

This preprocessing ensures we can match names reliably even when they're formatted differently in transaction data versus sanctions lists.

In [3]:
def normalize_text(text: str) -> str:
    """
    Normalize text for robust fuzzy matching.
    
    Applies NFKC normalization, lowercasing, accent stripping,
    punctuation canonicalization, and whitespace collapse.
    
    Note: Non-Latin scripts (Chinese, Arabic, Cyrillic) are stripped
    because OFAC sanctions lists use romanized names. For example:
    - "中国工商银行" → "" (empty)
    - "INDUSTRIAL AND COMMERCIAL BANK OF CHINA" → "industrial and commercial bank of china"
    
    Args:
        text: Raw text string to normalize
        
    Returns:
        Normalized text string suitable for fuzzy matching.
        Returns empty string if input contains only non-Latin characters.
        
    Examples:
        >>> normalize_text("José María O'Brien")
        'jose maria obrien'
        
        >>> normalize_text("AL-QAIDA")
        'al qaida'
        
        >>> normalize_text("中国工商银行")
        ''
    """
    if not text or pd.isna(text):
        return ""
    
    # Convert to string if not already
    text = str(text)
    
    # Unicode normalization (canonical composition)
    text = unicodedata.normalize("NFKC", text)
    
    # Lowercase
    text = text.lower()
    
    # Strip accent marks (diacritics)
    # Decompose characters, then filter out combining marks
    text = ''.join(
        char for char in unicodedata.normalize("NFD", text)
        if unicodedata.category(char) != 'Mn'
    )
    
    # Remove quotes (single and double)
    text = re.sub(r"['\"]", "", text)
    
    # Replace non-alphanumeric (except space and hyphen) with space
    # Note: This strips non-Latin scripts (Chinese, Arabic, Cyrillic, etc.)
    # OFAC lists use romanized names, so this is intentional behavior
    text = re.sub(r"[^a-z0-9\s-]", " ", text)
    
    # Collapse multiple spaces to single space
    text = re.sub(r"\s+", " ", text)
    
    # Strip leading/trailing whitespace
    text = text.strip()
    
    return text

# Test normalization function
print("Testing text normalization:\n")
test_cases = [
    "José María O'Brien",
    "AL-QAIDA",
    "Société Générale",
    "中国工商银行",  # Chinese - will be stripped (OFAC uses romanized names)
    "  Multiple   Spaces  ",
    "UPPER-case-MiXeD",
]

for test in test_cases:
    normalized = normalize_text(test)
    # Show empty string explicitly for clarity
    display_normalized = f"'{normalized}'" if normalized else "''" 
    print(f"  '{test}' → {display_normalized}")

Testing text normalization:

  'José María O'Brien' → 'jose maria obrien'
  'AL-QAIDA' → 'al-qaida'
  'Société Générale' → 'societe generale'
  '中国工商银行' → ''
  '  Multiple   Spaces  ' → 'multiple spaces'
  'UPPER-case-MiXeD' → 'upper-case-mixed'


### Load OFAC Data Files

We load all OFAC sanctions lists with explicit column mappings since OFAC CSV files don't include headers. We're loading six files total:
- **SDN List**: Primary names, alternate names, addresses
- **Consolidated List**: Primary names, alternate names, addresses

Each sanctions entry can have multiple alternate names (aliases, former names, etc.) and multiple addresses with country information. We'll merge these together to create a comprehensive screening database.

In [5]:
# Define column mappings for OFAC CSV files (they have no headers)
PRIMARY_COLS = [
    'ent_num', 'SDN_Name', 'SDN_Type', 'Program', 'Title',
    'Call_Sign', 'Vess_type', 'Tonnage', 'GRT', 'Vess_flag',
    'Vess_owner', 'Remarks'
]

ALT_COLS = ['ent_num', 'alt_num', 'alt_type', 'alt_name', 'alt_remarks']

ADD_COLS = [
    'ent_num', 'Add_num', 'Address', 'City_State_Province',
    'Country', 'Add_Remarks'
]

print("Loading OFAC Sanctions Lists...\n")

# Load SDN (Specially Designated Nationals) List
print("Loading SDN List...")
sdn_primary = pd.read_csv(
    OFAC_DIR / 'sdn' / 'sdn.csv',
    header=None,
    names=PRIMARY_COLS,
    dtype={'ent_num': str},
    encoding='utf-8'
)

sdn_alt = pd.read_csv(
    OFAC_DIR / 'sdn' / 'alt.csv',
    header=None,
    names=ALT_COLS,
    dtype={'ent_num': str, 'alt_num': str},
    encoding='utf-8'
)

sdn_add = pd.read_csv(
    OFAC_DIR / 'sdn' / 'add.csv',
    header=None,
    names=ADD_COLS,
    dtype={'ent_num': str, 'Add_num': str},
    encoding='utf-8'
)

print(f" - Primary entities: {len(sdn_primary):,}")
print(f" - Alternate names:  {len(sdn_alt):,}")
print(f" - Addresses:        {len(sdn_add):,}")

# Load Consolidated List
print("\nLoading Consolidated List...")
cons_primary = pd.read_csv(
    OFAC_DIR / 'consolidated' / 'cons_prim.csv',
    header=None,
    names=PRIMARY_COLS,
    dtype={'ent_num': str},
    encoding='utf-8'
)

cons_alt = pd.read_csv(
    OFAC_DIR / 'consolidated' / 'cons_alt.csv',
    header=None,
    names=ALT_COLS,
    dtype={'ent_num': str, 'alt_num': str},
    encoding='utf-8'
)

cons_add = pd.read_csv(
    OFAC_DIR / 'consolidated' / 'cons_add.csv',
    header=None,
    names=ADD_COLS,
    dtype={'ent_num': str, 'Add_num': str},
    encoding='utf-8'
)

print(f" - Primary entities: {len(cons_primary):,}")
print(f" - Alternate names:  {len(cons_alt):,}")
print(f" - Addresses:        {len(cons_add):,}")

print("\nAll OFAC files loaded successfully")

Loading OFAC Sanctions Lists...

Loading SDN List...
 - Primary entities: 17,945
 - Alternate names:  19,898
 - Addresses:        23,628

Loading Consolidated List...
 - Primary entities: 444
 - Alternate names:  1,067
 - Addresses:        573

All OFAC files loaded successfully


### Consolidate Names and Normalize

We merge primary names with their alternate names (aliases, former names) and create a unified sanctions database. Each row will represent a distinct name associated with a sanctioned entity, including both the official name and all known aliases.

We also extract country information from address records to enable geographic filtering during screening. This is important because many sanctions programs are country-specific.

In [7]:
def build_sanctions_index(
    primary_df: pd.DataFrame,
    alt_df: pd.DataFrame,
    add_df: pd.DataFrame,
    source_name: str
) -> pd.DataFrame:
    """
    Build unified sanctions index from primary, alternate, and address files.
    
    Args:
        primary_df: Primary sanctions entities
        alt_df: Alternate names (aliases)
        add_df: Address records with country info
        source_name: Source identifier ('SDN' or 'Consolidated')
        
    Returns:
        DataFrame with columns: uid, name, name_norm, name_type, entity_type, 
                                program, country, remarks, source
    """
    print(f"\nBuilding {source_name} sanctions index...")
    
    # Process primary names
    primary_records = []
    for _, row in primary_df.iterrows():
        primary_records.append({
            'uid': f"{source_name}_{row['ent_num']}",
            'ent_num': row['ent_num'],
            'name': row['SDN_Name'],
            'name_type': 'primary',
            'entity_type': row['SDN_Type'],
            'program': row['Program'],
            'remarks': row['Remarks'],
            'source': source_name
        })
    
    # Process alternate names
    alt_records = []
    for _, row in alt_df.iterrows():
        alt_records.append({
            'uid': f"{source_name}_{row['ent_num']}_alt_{row['alt_num']}",
            'ent_num': row['ent_num'],
            'name': row['alt_name'],
            'name_type': row['alt_type'],  # aka, fka, nka
            'entity_type': None,  # Will be filled from primary
            'program': None,      # Will be filled from primary
            'remarks': row['alt_remarks'],
            'source': source_name
        })
    
    # Combine primary and alternate names
    all_names = pd.DataFrame(primary_records + alt_records)
    
    # Fill entity_type and program from primary records for alternates
    entity_info = primary_df[['ent_num', 'SDN_Type', 'Program']].copy()
    entity_info.columns = ['ent_num', 'entity_type_fill', 'program_fill']
    
    all_names = all_names.merge(entity_info, on='ent_num', how='left')
    all_names['entity_type'] = all_names['entity_type'].fillna(all_names['entity_type_fill'])
    all_names['program'] = all_names['program'].fillna(all_names['program_fill'])
    all_names.drop(columns=['entity_type_fill', 'program_fill'], inplace=True)
    
    # Extract country information from addresses (take first country per entity)
    if len(add_df) > 0:
        country_map = add_df.groupby('ent_num')['Country'].first().to_dict()
        all_names['country'] = all_names['ent_num'].map(country_map)
    else:
        all_names['country'] = None
    
    # Apply text normalization
    print("  Normalizing names...")
    all_names['name_norm'] = all_names['name'].apply(normalize_text)
    
    # Remove records with empty normalized names
    before_count = len(all_names)
    all_names = all_names[all_names['name_norm'].str.len() > 0].copy()
    after_count = len(all_names)
    
    if before_count > after_count:
        print(f"  Removed {before_count - after_count} records with empty normalized names")
    
    # Reorder columns
    columns = [
        'uid', 'ent_num', 'name', 'name_norm', 'name_type', 
        'entity_type', 'program', 'country', 'remarks', 'source'
    ]
    all_names = all_names[columns]
    
    print(f"Created {len(all_names):,} name records")
    
    return all_names

# Build indices for both lists
sdn_index = build_sanctions_index(sdn_primary, sdn_alt, sdn_add, 'SDN')
cons_index = build_sanctions_index(cons_primary, cons_alt, cons_add, 'Consolidated')

# Combine into single index
sanctions_index = pd.concat([sdn_index, cons_index], ignore_index=True)

print(f"\nCombined Sanctions Index Summary:")
print(f" - Total name records: {len(sanctions_index):,}")
print(f" - From SDN:           {len(sdn_index):,}")
print(f" - From Consolidated:  {len(cons_index):,}")
print(f" - Unique entities:    {sanctions_index['ent_num'].nunique():,}")


Building SDN sanctions index...
  Normalizing names...
  Removed 2 records with empty normalized names
Created 37,841 name records

Building Consolidated sanctions index...
  Normalizing names...
  Removed 2 records with empty normalized names
Created 1,509 name records

Combined Sanctions Index Summary:
 - Total name records: 39,350
 - From SDN:           37,841
 - From Consolidated:  1,509
 - Unique entities:    18,310


### Validation Checks

We perform data quality validation to ensure our sanctions index is ready for fuzzy matching:
1. **Non-empty canonical names**: Every record must have valid normalized text
2. **Unique UIDs**: Each name record has a globally unique identifier
3. **Field completeness**: Key fields (entity_type, program) are populated
4. **Normalization quality**: Check sample names to verify normalization worked correctly

These checks catch data quality issues before they cause problems in production screening.

In [10]:
# Validation Check 1: Non-empty canonical names
empty_names = sanctions_index[sanctions_index['name_norm'].str.len() == 0]
print(f"Validation Check 1: Non-empty canonical names")
print(f" - Empty normalized names: {len(empty_names)}")
assert len(empty_names) == 0, "Found records with empty normalized names!"
print(f"PASS - All records have valid normalized names\n")

# Validation Check 2: Unique UIDs
print(f"Validation Check 2: Unique UIDs")
duplicate_uids = sanctions_index['uid'].duplicated().sum()
print(f" - Duplicate UIDs: {duplicate_uids}")
assert duplicate_uids == 0, "Found duplicate UIDs!"
print(f"PASS - All UIDs are unique\n")

# Validation Check 3: Field completeness
print(f"Validation Check 3: Field completeness")
print(f" - Records with entity_type: {sanctions_index['entity_type'].notna().sum():,} / {len(sanctions_index):,}")
print(f" - Records with program:     {sanctions_index['program'].notna().sum():,} / {len(sanctions_index):,}")
print(f" - Records with country:     {sanctions_index['country'].notna().sum():,} / {len(sanctions_index):,}")

# Country is optional (not all entities have addresses)
entity_type_coverage = sanctions_index['entity_type'].notna().mean()
program_coverage = sanctions_index['program'].notna().mean()

if entity_type_coverage < 0.95:
    print(f"[WARNING] Entity type coverage is low: {entity_type_coverage*100:.1f}%")
if program_coverage < 0.95:
    print(f"[WARNING] Program coverage is low: {program_coverage*100:.1f}%")

print(f"PASS - Key fields adequately populated\n")

# Validation Check 4: Sample normalization quality
print(f"Validation Check 4: Sample normalization quality")
print(f"Checking 10 random samples...")

sample_indices = np.random.choice(len(sanctions_index), size=10, replace=False)
for idx in sample_indices:
    row = sanctions_index.iloc[idx]
    original = row['name']
    normalized = row['name_norm']
    print(f" - '{original}' → '{normalized}'")

Validation Check 1: Non-empty canonical names
 - Empty normalized names: 0
PASS - All records have valid normalized names

Validation Check 2: Unique UIDs
 - Duplicate UIDs: 0
PASS - All UIDs are unique

Validation Check 3: Field completeness
 - Records with entity_type: 39,350 / 39,350
 - Records with program:     39,350 / 39,350
 - Records with country:     39,350 / 39,350
PASS - Key fields adequately populated

Validation Check 4: Sample normalization quality
Checking 10 random samples...
 - 'AL-HARAMAYN HUMANITARIAN FOUNDATION' → 'al-haramayn humanitarian foundation'
 - 'AKTSIONERNOE OBSHCHESTVO RT-STROITELNYE TEKHNOLOGII' → 'aktsionernoe obshchestvo rt-stroitelnye tekhnologii'
 - 'VALLE VALLE, Luis Alonso' → 'valle valle luis alonso'
 - 'GLOBAL SEA LINE CO LTD' → 'global sea line co ltd'
 - 'TED TEKNOLOJI' → 'ted teknoloji'
 - 'CLOSED JOINT STOCK COMPANY 'IFD KAPITAL'' → 'closed joint stock company ifd kapital'
 - 'JOINT STOCK COMPANY SEVERGAZBANK' → 'joint stock company severgazb

### Analyze Sanctions Index

We examine the distribution of entity types, programs, and countries in our sanctions database. This helps us understand what we're screening against and can inform filtering strategies during production deployment.

In [18]:
# Distribution analysis
print("Entity Type Distribution:")
entity_type_dist = sanctions_index['entity_type'].value_counts()
for entity_type, count in entity_type_dist.head(10).items():
    pct = (count / len(sanctions_index)) * 100
    print(f"{str(entity_type)[:30]:30s}: {count:>6,} ({pct:>5.1f}%)")

print("\nSanctions Program Distribution (Top 15):")
program_dist = sanctions_index['program'].value_counts()
for program, count in program_dist.head(15).items():
    pct = (count / len(sanctions_index)) * 100
    program_str = str(program)[:50] if pd.notna(program) else "Unknown"
    print(f"{program_str:40s}: {count:>6,} ({pct:>5.1f}%)")

print("\nCountry Distribution (Top 15):")
country_dist = sanctions_index['country'].value_counts()
for country, count in country_dist.head(15).items():
    pct = (count / len(sanctions_index)) * 100
    country_str = str(country)[:30] if pd.notna(country) else "Unknown"
    print(f"{country_str:30s}: {count:>6,} ({pct:>5.1f}%)")

# Name type distribution
print("\nName Type Distribution:")
name_type_dist = sanctions_index['name_type'].value_counts()
for name_type, count in name_type_dist.items():
    pct = (count / len(sanctions_index)) * 100
    print(f"{str(name_type):30s}: {count:>6,} ({pct:>5.1f}%)")

Entity Type Distribution:
-0-                           : 21,308 ( 54.1%)
individual                    : 16,149 ( 41.0%)
vessel                        :  1,555 (  4.0%)
aircraft                      :    338 (  0.9%)

Sanctions Program Distribution (Top 15):
RUSSIA-EO14024                          : 10,339 ( 26.3%)
SDGT                                    :  7,037 ( 17.9%)
SDNTK                                   :  2,395 (  6.1%)
UKRAINE-EO13662] [RUSSIA-EO14024        :  1,415 (  3.6%)
GLOMAG                                  :  1,218 (  3.1%)
NPWMD] [IFSR                            :  1,122 (  2.9%)
IRAN                                    :    837 (  2.1%)
UKRAINE-EO13662                         :    785 (  2.0%)
BELARUS-EO14038                         :    642 (  1.6%)
SDGT] [IFSR                             :    622 (  1.6%)
IRAN-EO13902                            :    572 (  1.5%)
IRAN-EO13846                            :    553 (  1.4%)
PAARSSR-EO13894                         :   

### Save Normalized Sanctions Index

We save the normalized sanctions index as the foundation for our fuzzy matching pipeline. This database contains all sanctioned entity names with proper text normalization, metadata enrichment, and quality validation applied.

The artifacts enable fast loading and consistent screening across the fraud detection system.

In [19]:
# Save normalized sanctions index
sanctions_index_path = MODELS_DIR / "sanctions_index.parquet"
sanctions_index.to_parquet(sanctions_index_path, index=False)

print(f"Saved sanctions index: {sanctions_index_path}")
print(f" - Shape: {sanctions_index.shape}")
print(f" - Size: {sanctions_index_path.stat().st_size / 1024:.1f} KB")

# Save metadata for pipeline tracking
metadata = {
    "created_at": pd.Timestamp.now().isoformat(),
    "total_records": len(sanctions_index),
    "unique_entities": sanctions_index['ent_num'].nunique(),
    "sources": sanctions_index['source'].value_counts().to_dict(),
    "entity_types": entity_type_dist.head(10).to_dict(),
    "top_programs": program_dist.head(10).to_dict(),
    "top_countries": country_dist.head(10).to_dict(),
    "name_types": name_type_dist.to_dict(),
    "country_coverage_pct": float(sanctions_index['country'].notna().mean() * 100),
    "validation": {
        "empty_normalized_names": 0,
        "duplicate_uids": 0,
        "entity_type_coverage_pct": float(entity_type_coverage * 100),
        "program_coverage_pct": float(program_coverage * 100)
    }
}

metadata_path = MODELS_DIR / "sanctions_index_metadata.json"
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"Saved metadata: {metadata_path}")
print(f"Sanctions Index Ready: {len(sanctions_index):,} normalized name records")

Saved sanctions index: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/packages/models/sanctions_index.parquet
 - Shape: (39350, 10)
 - Size: 2874.1 KB
Saved metadata: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/packages/models/sanctions_index_metadata.json
Sanctions Index Ready: 39,350 normalized name records


## Tokenization & Canonical Forms

To enable efficient fuzzy matching, we tokenize normalized names and create canonical representations optimized for different similarity algorithms. This approach improves matching accuracy by:
- **Removing noise**: Filtering out common business suffixes (Ltd, Inc, LLC) and honorifics (Mr, Mrs)
- **Token-based matching**: Breaking names into words for flexible comparison
- **Sorted tokens**: Enabling order-independent matching (e.g., "John Doe" matches "Doe John")
- **Token sets**: Creating unique word bags for set-based similarity

These canonical forms serve as inputs to RapidFuzz's token_sort_ratio and token_set_ratio algorithms, which are robust to word order variations and common name formatting differences.

In [22]:
# Define stopwords for name tokenization
# These are common business/legal terms and honorifics that add noise to matching
STOPWORDS = {
    # Business suffixes
    "ltd", "inc", "llc", "co", "corp", "corporation", "company",
    "sa", "gmbh", "ag", "nv", "bv", "plc", "limited",
    # Honorifics
    "mr", "mrs", "ms", "dr", "prof",
    # Common words
    "the", "of", "and", "for", "de", "la", "el"
}

def tokenize(name: str) -> List[str]:
    """
    Tokenize a normalized name into words, filtering stopwords and short tokens.
    
    Splits on whitespace and hyphens, removes tokens shorter than 2 characters,
    and filters out common business/legal terms that don't aid matching.
    
    Args:
        name: Normalized name string (already lowercased and cleaned)
        
    Returns:
        List of filtered tokens
        
    Examples:
        >>> tokenize("john doe")
        ['john', 'doe']
        
        >>> tokenize("acme corporation ltd")
        ['acme']
        
        >>> tokenize("al-qaida")
        ['al', 'qaida']
    """
    if not name:
        return []
    
    # Split on whitespace and hyphens
    tokens = [t for t in re.split(r'[\s-]+', name) if t]
    
    # Filter: length >= 2 and not in stopwords
    filtered = [t for t in tokens if len(t) >= 2 and t not in STOPWORDS]
    
    return filtered

# Test tokenization function
print("Testing tokenization:\n")
test_names = [
    "john doe",
    "acme corporation ltd",
    "al-qaida",
    "banco nacional de cuba",
    "mr jose maria obrien",
    "china telecom co ltd"
]

for name in test_names:
    tokens = tokenize(name)
    print(f"  '{name}' → {tokens}")

Testing tokenization:

  'john doe' → ['john', 'doe']
  'acme corporation ltd' → ['acme']
  'al-qaida' → ['al', 'qaida']
  'banco nacional de cuba' → ['banco', 'nacional', 'cuba']
  'mr jose maria obrien' → ['jose', 'maria', 'obrien']
  'china telecom co ltd' → ['china', 'telecom']


### Create Canonical Name Forms

We apply tokenization to all normalized names and create three canonical representations for fuzzy matching:

1. **name_tokens**: List of filtered tokens for analysis
2. **name_sorted**: Tokens sorted alphabetically (for token_sort_ratio matching)
3. **name_set**: Space-joined unique tokens (for token_set_ratio matching)

These forms enable RapidFuzz to perform robust similarity scoring that handles word order variations, duplicates, and partial matches effectively.

In [24]:
# Apply tokenization to all normalized names
print("Tokenizing sanctions index...")
sanctions_index['name_tokens'] = sanctions_index['name_norm'].apply(tokenize)

# Create sorted token string (for token_sort_ratio)
sanctions_index['name_sorted'] = sanctions_index['name_tokens'].apply(
    lambda tokens: ' '.join(sorted(tokens))
)

# Create unique token set string (for token_set_ratio)
sanctions_index['name_set'] = sanctions_index['name_tokens'].apply(
    lambda tokens: ' '.join(sorted(set(tokens)))
)

print(f"Tokenization complete")
print(f"\nSample canonical forms:\n")

# Show examples of canonical forms
sample_indices = [0, 100, 1000, 5000, 10000]
for idx in sample_indices:
    if idx < len(sanctions_index):
        row = sanctions_index.iloc[idx]
        print(f"Original:    '{row['name']}'")
        print(f"Normalized:  '{row['name_norm']}'")
        print(f"Tokens:      {row['name_tokens']}")
        print(f"Sorted:      '{row['name_sorted']}'")
        print(f"Set:         '{row['name_set']}'")
        print()

Tokenizing sanctions index...
Tokenization complete

Sample canonical forms:

Original:    'AEROCARIBBEAN AIRLINES'
Normalized:  'aerocaribbean airlines'
Tokens:      ['aerocaribbean', 'airlines']
Sorted:      'aerocaribbean airlines'
Set:         'aerocaribbean airlines'

Original:    'SHINING PATH'
Normalized:  'shining path'
Tokens:      ['shining', 'path']
Sorted:      'path shining'
Set:         'path shining'

Original:    'HATKAEW COMPANY LTD.'
Normalized:  'hatkaew company ltd'
Tokens:      ['hatkaew']
Sorted:      'hatkaew'
Set:         'hatkaew'

Original:    'SHAMALOV, Kirill Nikolaevich'
Normalized:  'shamalov kirill nikolaevich'
Tokens:      ['shamalov', 'kirill', 'nikolaevich']
Sorted:      'kirill nikolaevich shamalov'
Set:         'kirill nikolaevich shamalov'

Original:    'JOINT STOCK COMPANY RESEARCH INSTITUTE OF ELECTRONIC AND MECHANICAL DEVICES'
Normalized:  'joint stock company research institute of electronic and mechanical devices'
Tokens:      ['joint', 'stock'

### Tokenization Validation

We validate the tokenization quality to ensure our canonical forms are suitable for fuzzy matching. Key checks include:
- **Empty token handling**: Identify names that produce no tokens after filtering
- **Stopword effectiveness**: Verify that stopword removal reduces noise without losing critical information
- **Token distribution**: Analyze token counts to understand name complexity

Names with empty tokens after filtering may require special handling or indicate data quality issues.

In [42]:
# Validation Check 1: Empty tokens after filtering
empty_tokens = sanctions_index[sanctions_index['name_tokens'].apply(len) == 0]
print(f"Validation Check 1: Empty Tokens")
print(f" Records with empty tokens: {len(empty_tokens)}")

if len(empty_tokens) > 0:
    print(f"\nSample records with empty tokens:")
    for idx in empty_tokens.head(5).index:
        row = sanctions_index.loc[idx]
        print(f" Original: '{row['name']}' | Normalized: '{row['name_norm']}'")
    print(f"\n[INFO] These names contain only stopwords or short tokens")
else:
    print(f"PASS - All names have at least one token\n")

# Validation Check 2: Token count distribution
print(f"\nValidation Check 2: Token Count Distribution")
token_counts = sanctions_index['name_tokens'].apply(len)
print(f" Mean tokens per name: {token_counts.mean():.2f}")
print(f" Median tokens per name: {token_counts.median():.0f}")
print(f" Max tokens per name: {token_counts.max()}")
print(f"\nDistribution:")
for count, freq in token_counts.value_counts().sort_index().head(10).items():
    pct = (freq / len(sanctions_index)) * 100
    print(f" {count} tokens: {freq:>6,} names ({pct:>5.1f}%)")

# Validation Check 3: Stopword removal effectiveness
print(f"\nValidation Check 3: Stopword Removal Effectiveness")
# Count how many names had stopwords removed
names_with_stopwords = 0
total_stopwords_removed = 0

for idx, row in sanctions_index.head(1000).iterrows():
    # Re-tokenize without stopword filter to compare
    raw_tokens = [t for t in re.split(r'[\s-]+', row['name_norm']) if t and len(t) >= 2]
    filtered_tokens = row['name_tokens']
    
    removed = len(raw_tokens) - len(filtered_tokens)
    if removed > 0:
        names_with_stopwords += 1
        total_stopwords_removed += removed

print(f" Sample of 1,000 names:")
print(f"  Names with stopwords: {names_with_stopwords} ({names_with_stopwords/10:.1f}%)")
print(f"  Total stopwords removed: {total_stopwords_removed}")
print(f"  Avg stopwords per affected name: {total_stopwords_removed/names_with_stopwords if names_with_stopwords > 0 else 0:.2f}")
print(f"  Stopword filtering is active and reducing noise")

Validation Check 1: Empty Tokens
 Records with empty tokens: 10

Sample records with empty tokens:
 Original: 'T.E.G. LIMITED' | Normalized: 't e g limited'
 Original: 'J & E S. DE R.L. DE C.V.' | Normalized: 'j e s de r l de c v'
 Original: 'K M A' | Normalized: 'k m a'
 Original: 'S.A.S. E.U.' | Normalized: 's a s e u'
 Original: 'T.D.G.' | Normalized: 't d g'

[INFO] These names contain only stopwords or short tokens

Validation Check 2: Token Count Distribution
 Mean tokens per name: 3.21
 Median tokens per name: 3
 Max tokens per name: 21

Distribution:
 0 tokens:     10 names (  0.0%)
 1 tokens:  2,369 names (  6.0%)
 2 tokens: 11,228 names ( 28.5%)
 3 tokens: 13,807 names ( 35.1%)
 4 tokens:  6,227 names ( 15.8%)
 5 tokens:  2,748 names (  7.0%)
 6 tokens:  1,404 names (  3.6%)
 7 tokens:    753 names (  1.9%)
 8 tokens:    361 names (  0.9%)
 9 tokens:    206 names (  0.5%)

Validation Check 3: Stopword Removal Effectiveness
 Sample of 1,000 names:
  Names with stopwords: 194 (

### Save Enhanced Sanctions Index

We update the sanctions index artifact to include the tokenized canonical forms. This enriched index serves as the foundation for all subsequent fuzzy matching operations, including candidate generation (blocking) and similarity scoring.

Saving the tokenized forms ensures:
- **Performance**: Tokenization is computed once, not repeated for every screening request
- **Reproducibility**: Exact token transformations are preserved for audit and debugging
- **Pipeline efficiency**: Downstream steps (blocking, scoring) can load pre-processed data directly

In [29]:
# Update sanctions index with tokenized columns
sanctions_index_path = MODELS_DIR / "sanctions_index.parquet"
sanctions_index.to_parquet(sanctions_index_path, index=False)

print(f"Updated sanctions index: {sanctions_index_path}")
print(f" - Shape: {sanctions_index.shape}")
print(f" - Columns: {list(sanctions_index.columns)}")
print(f" - Size: {sanctions_index_path.stat().st_size / 1024:.1f} KB")

# Update metadata to reflect tokenization
metadata = {
    "created_at": pd.Timestamp.now().isoformat(),
    "total_records": len(sanctions_index),
    "unique_entities": sanctions_index['ent_num'].nunique(),
    "sources": sanctions_index['source'].value_counts().to_dict(),
    "tokenization": {
        "stopwords_count": len(STOPWORDS),
        "stopwords": sorted(list(STOPWORDS)),
        "empty_token_records": len(sanctions_index[sanctions_index['name_tokens'].apply(len) == 0]),
        "mean_tokens_per_name": float(sanctions_index['name_tokens'].apply(len).mean()),
        "median_tokens_per_name": float(sanctions_index['name_tokens'].apply(len).median()),
        "max_tokens_per_name": int(sanctions_index['name_tokens'].apply(len).max())
    },
    "columns": list(sanctions_index.columns),
    "validation": {
        "empty_normalized_names": 0,
        "duplicate_uids": 0,
        "entity_type_coverage_pct": float(sanctions_index['entity_type'].notna().mean() * 100),
        "program_coverage_pct": float(sanctions_index['program'].notna().mean() * 100)
    }
}

metadata_path = MODELS_DIR / "sanctions_index_metadata.json"
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"\nUpdated metadata: {metadata_path}")
print(f"Enhanced Sanctions Index Ready")
print(f" - {len(sanctions_index):,} records with tokenized canonical forms")
print(f" - Avg {metadata['tokenization']['mean_tokens_per_name']:.2f} tokens per name")

Updated sanctions index: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/packages/models/sanctions_index.parquet
 - Shape: (39350, 13)
 - Columns: ['uid', 'ent_num', 'name', 'name_norm', 'name_type', 'entity_type', 'program', 'country', 'remarks', 'source', 'name_tokens', 'name_sorted', 'name_set']
 - Size: 4692.3 KB

Updated metadata: /Users/joekariuki/Documents/Devbrew/research/devbrew-payments-fraud-sanctions/packages/models/sanctions_index_metadata.json
Enhanced Sanctions Index Ready
 - 39,350 records with tokenized canonical forms
 - Avg 3.21 tokens per name


## Candidate Generation (Blocking)

Screening a query name against all 39K+ sanctions records would be computationally expensive. Blocking reduces the search space by creating efficient indices that quickly identify likely candidates based on shared characteristics.

Our blocking strategy uses three complementary approaches:
- **First token blocking**: Names starting with the same word (e.g., "John" → all "John X" entries)
- **Token count blocking**: Group by name complexity (1-2 tokens, 3-4 tokens, 5+ tokens)
- **Initial signature blocking**: Match by initials pattern (e.g., "j-d" for "John Doe")

This multi-index approach ensures high recall (≥99.5%) while dramatically reducing the candidate set that needs fuzzy scoring. For example, screening "John Doe" might reduce from 39K candidates to ~200-500 relevant entries.

In [37]:
def get_first_token(tokens: List[str]) -> str:
    """Extract first token for prefix blocking."""
    return tokens[0] if tokens else ""

def get_token_count_bucket(tokens: List[str]) -> str:
    """
    Bucket names by token count for length-based blocking.
    
    Groups:
    - "tiny": 0-1 tokens
    - "small": 2 tokens  
    - "medium": 3-4 tokens
    - "large": 5+ tokens
    """
    count = len(tokens)
    if count <= 1:
        return "tiny"
    elif count == 2:
        return "small"
    elif count <= 4:
        return "medium"
    else:
        return "large"

def get_initials_signature(tokens: List[str]) -> str:
    """
    Create initials signature from first letter of each token.
    
    Examples:
        ['john', 'doe'] → 'j-d'
        ['al', 'qaida'] → 'a-q'
        ['banco', 'nacional', 'cuba'] → 'b-n-c'
    """
    if not tokens:
        return ""
    return "-".join(t[0] for t in tokens if t)

# Test blocking functions
print("Testing blocking functions:\n")
test_cases = [
    ['john', 'doe'],
    ['al', 'qaida'],
    ['banco', 'nacional', 'cuba'],
    ['acme'],
    []
]

for tokens in test_cases:
    first = get_first_token(tokens)
    bucket = get_token_count_bucket(tokens)
    initials = get_initials_signature(tokens)
    print(f"Tokens: {tokens}")
    print(f" First token: '{first}'")
    print(f" Bucket: {bucket}")
    print(f" Initials: '{initials}'")
    print()

Testing blocking functions:

Tokens: ['john', 'doe']
 First token: 'john'
 Bucket: small
 Initials: 'j-d'

Tokens: ['al', 'qaida']
 First token: 'al'
 Bucket: small
 Initials: 'a-q'

Tokens: ['banco', 'nacional', 'cuba']
 First token: 'banco'
 Bucket: medium
 Initials: 'b-n-c'

Tokens: ['acme']
 First token: 'acme'
 Bucket: tiny
 Initials: 'a'

Tokens: []
 First token: ''
 Bucket: tiny
 Initials: ''



### Apply Blocking Keys

We compute blocking keys for all sanctions records and add them as indexed columns. These keys enable fast candidate retrieval during screening operations.

Each blocking key creates a different "view" of the data:
- **first_token**: Groups names by their starting word
- **token_bucket**: Groups by name complexity/length
- **initials**: Groups by letter pattern (useful for abbreviated names)

Multiple blocking strategies increase recall by capturing different matching scenarios.

In [47]:
# Apply blocking keys to all sanctions records
print("Computing blocking keys for sanctions index...")

sanctions_index['first_token'] = sanctions_index['name_tokens'].apply(get_first_token)
sanctions_index['token_bucket'] = sanctions_index['name_tokens'].apply(get_token_count_bucket)
sanctions_index['initials'] = sanctions_index['name_tokens'].apply(get_initials_signature)

print(f"Blocking keys computed")
print(f"\nBlocking Key Distributions:\n")

# Show distribution of blocking keys
print("First Token Distribution (Top 15):")
first_token_dist = sanctions_index['first_token'].value_counts()
for token, count in first_token_dist.head(15).items():
    pct = (count / len(sanctions_index)) * 100
    token_str = f"'{token}'" if token else "'(empty)'"
    print(f" {token_str:20s}: {count:>5,} ({pct:>4.1f}%)")

print("\nToken Bucket Distribution:")
bucket_dist = sanctions_index['token_bucket'].value_counts()
for bucket, count in bucket_dist.items():
    pct = (count / len(sanctions_index)) * 100
    print(f"  {bucket:10s}: {count:>6,} ({pct:>5.1f}%)")

print("\nInitials Signature Distribution (Top 15):")
initials_dist = sanctions_index['initials'].value_counts()
for initials, count in initials_dist.head(15).items():
    pct = (count / len(sanctions_index)) * 100
    initials_str = f"'{initials}'" if initials else "'(empty)'"
    print(f" {initials_str:20s}: {count:>5,} ({pct:>4.1f}%)")

# Show sample blocking keys
print("\nSample Blocking Keys:")
for idx in [0, 100, 1000, 5000]:
    if idx < len(sanctions_index):
        row = sanctions_index.iloc[idx]
        print(f"\nName: '{row['name']}'")
        print(f" Tokens: {row['name_tokens']}")
        print(f" First token: '{row['first_token']}'")
        print(f" Bucket: {row['token_bucket']}")
        print(f" Initials: '{row['initials']}'")

Computing blocking keys for sanctions index...
Blocking keys computed

Blocking Key Distributions:

First Token Distribution (Top 15):
 'al'                : 2,098 ( 5.3%)
 'liability'         : 1,101 ( 2.8%)
 'joint'             :   781 ( 2.0%)
 'jsc'               :   403 ( 1.0%)
 'obshchestvo'       :   372 ( 0.9%)
 'aktsionernoe'      :   368 ( 0.9%)
 'ao'                :   318 ( 0.8%)
 'ooo'               :   304 ( 0.8%)
 'ep'                :   139 ( 0.4%)
 'open'              :   134 ( 0.3%)
 'bank'              :   134 ( 0.3%)
 'islamic'           :   132 ( 0.3%)
 'fu'                :   123 ( 0.3%)
 'public'            :   112 ( 0.3%)
 'kim'               :   111 ( 0.3%)

Token Bucket Distribution:
  medium    : 20,034 ( 50.9%)
  small     : 11,228 ( 28.5%)
  large     :  5,709 ( 14.5%)
  tiny      :  2,379 (  6.0%)

Initials Signature Distribution (Top 15):
 's'                 :   256 ( 0.7%)
 'a'                 :   242 ( 0.6%)
 's-a'               :   164 ( 0.4%)
 't'    

### Build Blocking Indices

We create inverted indices that map blocking keys to lists of candidate record indices. These indices enable O(1) lookup of candidates during screening operations.

For example:
- `first_token_index['john']` → [123, 456, 789, ...] (all records starting with "john")
- `bucket_index['small']` → [1, 5, 12, ...] (all 2-token names)
- `initials_index['j-d']` → [123, 456] (all names with pattern "j-d")

During screening, we query multiple indices and take the union of candidates to maximize recall while keeping the candidate set manageable.

In [57]:
from collections import defaultdict

def build_blocking_index(df: pd.DataFrame, key_column: str) -> Dict[str, List[int]]:
    """
    Build inverted index mapping blocking keys to record indices.
    
    Args:
        df: DataFrame with blocking keys
        key_column: Name of column containing blocking keys
        
    Returns:
        Dictionary mapping key values to lists of row indices
    """
    index = defaultdict(list)
    for idx, key in enumerate(df[key_column]):
        if key:  # Skip empty keys
            index[key].append(idx)
    return dict(index)

print("Building blocking indices...")

# Build indices for each blocking strategy
first_token_index = build_blocking_index(sanctions_index, 'first_token')
bucket_index = build_blocking_index(sanctions_index, 'token_bucket')
initials_index = build_blocking_index(sanctions_index, 'initials')

print(f"Blocking indices built")
print(f"\nIndex Statistics:\n")

print(f"First Token Index:")
print(f" Unique keys: {len(first_token_index):,}")
print(f" Avg candidates per key: {np.mean([len(v) for v in first_token_index.values()]):.1f}")
print(f" Max candidates per key: {max(len(v) for v in first_token_index.values()):,}")

print(f"\nToken Bucket Index:")
print(f" Unique keys: {len(bucket_index):,}")
print(f" Avg candidates per key: {np.mean([len(v) for v in bucket_index.values()]):.1f}")
print(f" Max candidates per key: {max(len(v) for v in bucket_index.values()):,}")

print(f"\nInitials Index:")
print(f" Unique keys: {len(initials_index):,}")
print(f" Avg candidates per key: {np.mean([len(v) for v in initials_index.values()]):.1f}")
print(f" Max candidates per key: {max(len(v) for v in initials_index.values()):,}")

# Show example lookups
print("\nExample Index Lookups:")

example_keys = [
    ('first_token', 'bank', first_token_index),
    ('first_token', 'john', first_token_index),
    ('bucket', 'medium', bucket_index),
    ('initials', 'j-d', initials_index)
]

for index_type, key, index in example_keys:
    candidates = index.get(key, [])
    print(f"\n{index_type}['{key}']:")
    print(f" Candidates: {len(candidates):,}")
    if candidates:
        # Show first 3 candidate names
        print(f" Sample names:")
        for idx in candidates[:3]:
            name = sanctions_index.iloc[idx]['name']
            print(f"  - {name}")

Building blocking indices...
Blocking indices built

Index Statistics:

First Token Index:
 Unique keys: 15,597
 Avg candidates per key: 2.5
 Max candidates per key: 2,098

Token Bucket Index:
 Unique keys: 4
 Avg candidates per key: 9837.5
 Max candidates per key: 20,034

Initials Index:
 Unique keys: 15,986
 Avg candidates per key: 2.5
 Max candidates per key: 256

Example Index Lookups:

first_token['bank']:
 Candidates: 134
 Sample names:
  - BANK MARKAZI JOMHOURI ISLAMI IRAN
  - BANK MASKAN
  - BANK REFAH KARGARAN

first_token['john']:
 Candidates: 1
 Sample names:
  - JOHN, Damion Patrick

bucket['medium']:
 Candidates: 20,034
 Sample names:
  - BANCO NACIONAL DE CUBA
  - COMERCIAL DE RODAJES Y MAQUINARIA, S.A.
  - COMERCIALIZACION DE PRODUCTOS VARIOS

initials['j-d']:
 Candidates: 5
 Sample names:
  - JOKIC, Dragan
  - JSC DRAGA
  - JAMA'AT-I-DAWAT


### Candidate Retrieval Function

We implement the candidate retrieval logic that queries multiple blocking indices and returns the union of candidates. This multi-strategy approach maximizes recall by capturing different matching scenarios.

The retrieval strategy:
1. Extract blocking keys from query name (first token, bucket, initials)
2. Query each index to get candidate lists
3. Take union of all candidates (deduplicate)
4. Return candidate indices for fuzzy scoring

This approach ensures we don't miss potential matches due to variations in name formatting or word order.

In [62]:
def get_candidates(
    query_name: str,
    first_token_idx: Dict[str, List[int]],
    bucket_idx: Dict[str, List[int]],
    initials_idx: Dict[str, List[int]]
) -> List[int]:
    """
    Retrieve candidate indices using multi-strategy blocking.
    
    Args:
        query_name: Normalized query name to screen
        first_token_idx: First token blocking index
        bucket_idx: Token bucket blocking index
        initials_idx: Initials signature blocking index
        
    Returns:
        List of candidate record indices (deduplicated)
    """
    # Tokenize query
    query_tokens = tokenize(query_name)
    
    if not query_tokens:
        return []
    
    # Extract blocking keys from query
    query_first = get_first_token(query_tokens)
    query_bucket = get_token_count_bucket(query_tokens)
    query_initials = get_initials_signature(query_tokens)
    
    # Collect candidates from all indices
    candidates = set()
    
    # Strategy 1: First token match
    if query_first in first_token_idx:
        candidates.update(first_token_idx[query_first])
    
    # Strategy 2: Token bucket match (same complexity)
    if query_bucket in bucket_idx:
        candidates.update(bucket_idx[query_bucket])
    
    # Strategy 3: Initials match
    if query_initials in initials_idx:
        candidates.update(initials_idx[query_initials])
    
    return sorted(list(candidates))

# Test candidate retrieval
print("Testing candidate retrieval:\n")

test_queries = [
    "john doe",
    "bank of china",
    "al qaida",
    "acme corporation"
]

for query in test_queries:
    # Normalize query
    query_norm = normalize_text(query)
    
    # Get candidates
    candidates = get_candidates(
        query_norm,
        first_token_index,
        bucket_index,
        initials_index
    )
    
    print(f"Query: '{query}'")
    print(f"  Normalized: '{query_norm}'")
    print(f"  Candidates: {len(candidates):,}")
    
    # Show sample candidate names
    if candidates:
        print(f"  Sample matches:")
        for idx in candidates[:5]:
            name = sanctions_index.iloc[idx]['name']
            print(f"    - {name}")
    print()

Testing candidate retrieval:

Query: 'john doe'
  Normalized: 'john doe'
  Candidates: 11,229
  Sample matches:
    - AEROCARIBBEAN AIRLINES
    - ANGLO-CARIBBEAN CO., LTD.
    - BOUTIQUE LA MAISON
    - CASA DE CUBA
    - CIMEX IBERICA

Query: 'bank of china'
  Normalized: 'bank of china'
  Candidates: 11,329
  Sample matches:
    - AEROCARIBBEAN AIRLINES
    - ANGLO-CARIBBEAN CO., LTD.
    - BOUTIQUE LA MAISON
    - CASA DE CUBA
    - CIMEX IBERICA

Query: 'al qaida'
  Normalized: 'al qaida'
  Candidates: 13,257
  Sample matches:
    - AEROCARIBBEAN AIRLINES
    - ANGLO-CARIBBEAN CO., LTD.
    - BOUTIQUE LA MAISON
    - CASA DE CUBA
    - CIMEX IBERICA

Query: 'acme corporation'
  Normalized: 'acme corporation'
  Candidates: 2,379
  Sample matches:
    - CECOEX, S.A.
    - CIMEX
    - CIMEX, S.A.
    - COTEI
    - CUBAEXPORT

