# Sanctions Screening

- **Purpose:** OFAC sanctions screening with fuzzy name matching for fraud detection pipeline  
- **Author:** Devbrew LLC  
- **Last Updated:** November 3, 2025  
- **Status:** In progress  
- **License:** Apache 2.0 (Code) | Public Domain (OFAC Data)

---

## Dataset License Notice

This notebook uses **OFAC Sanctions Lists** (SDN and Consolidated) from the U.S. Department of the Treasury.

**Dataset License:** Public Domain  
- OFAC sanctions data is publicly available from [OFAC Sanctions List Search](https://sanctionslist.ofac.treas.gov/Home)  
- Data can be freely used, redistributed, and incorporated into commercial systems  
- Updates are published regularly; production systems should refresh data periodically  

**Setup Instructions:** See [`../data_catalog/README.md`](../data_catalog/README.md) for download instructions.

**Code License:** This notebook's code is licensed under Apache 2.0 (open source).

**Disclaimer:** This is a research demonstration. Production sanctions screening requires broader list coverage (EU, UN, UK HMT), legal review, and compliance with local regulations.

---

## Notebook Configuration

### Environment Setup

We configure the Python environment with standardized settings, import required libraries for text processing and fuzzy matching, and set a fixed random seed for reproducibility. This ensures consistent results across runs and enables reliable experimentation.

These settings establish the foundation for all sanctions screening operations, including name normalization, tokenization, and similarity scoring.

In [3]:
import warnings
from pathlib import Path
import json
import hashlib
import unicodedata
import re
from typing import Dict, Any, Optional, List, Tuple
import time

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import rapidfuzz as rf
from rapidfuzz import fuzz, process

# Configuration
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", '{:.2f}'.format)

# Plotting configuration
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Environment configured successfully")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"rapidfuzz: {rf.__version__}")

Environment configured successfully
pandas: 2.3.3
numpy: 2.3.3
rapidfuzz: 3.14.1


### Path Configuration

We define the project directory structure and validate that OFAC data files exist before proceeding. The validation ensures we have the necessary sanctions lists for screening operations.

This configuration pattern ensures we can locate all required data artifacts and provides clear feedback if prerequisites are missing.

In [5]:
# Project paths
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data_catalog"
OFAC_DIR = DATA_DIR / "ofac"
PROCESSED_DIR = DATA_DIR / "processed"
MODELS_DIR = PROJECT_ROOT / "packages" / "models"

# Ensure output directories exist
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# Expected OFAC data files
OFAC_FILES = {
    'SDN Primary': OFAC_DIR / 'sdn' / 'sdn.csv',
    'SDN Alternate': OFAC_DIR / 'sdn' / 'alt.csv',
    'SDN Address': OFAC_DIR / 'sdn' / 'add.csv',
    'Consolidated Primary': OFAC_DIR / 'consolidated' / 'cons_prim.csv',
    'Consolidated Alternate': OFAC_DIR / 'consolidated' / 'cons_alt.csv',
    'Consolidated Address': OFAC_DIR / 'consolidated' / 'cons_add.csv',
}

def validate_required_data():
    """Validate that OFAC sanctions data files exist."""
    print("OFAC Data Availability Check:")
    
    all_exist = True
    for name, path in OFAC_FILES.items():
        exists = path.exists()
        status = "Found" if exists else "Missing"
        print(f" - {name:25s}: {status}")
        if not exists:
            all_exist = False
    
    if not all_exist:
        print("\n[WARNING] Some OFAC files are missing; see data_catalog/README.md for instructions")
    else:
        print("\nAll required OFAC data files are available")
    
    return all_exist

data_available = validate_required_data()

OFAC Data Availability Check:
 - SDN Primary              : Found
 - SDN Alternate            : Found
 - SDN Address              : Found
 - Consolidated Primary     : Found
 - Consolidated Alternate   : Found
 - Consolidated Address     : Found

All required OFAC data files are available


## Load & Normalize OFAC Datasets

We load OFAC sanctions lists (SDN and Consolidated) and apply comprehensive text normalization to enable robust fuzzy matching. This step is critical for handling variations in how names appear across different systems and languages.

Our normalization strategy addresses several common challenges in sanctions screening:
- **Unicode variations**: Convert to canonical form (NFKC) to handle different encodings
- **Accent marks**: Strip diacritics to match "José" with "Jose"
- **Case sensitivity**: Lowercase everything for case-insensitive matching
- **Punctuation**: Standardize hyphens, remove quotes that don't affect identity
- **Whitespace**: Collapse multiple spaces to single space

This preprocessing ensures we can match names reliably even when they're formatted differently in transaction data versus sanctions lists.

In [None]:
def normalize_text(text: str) -> str:
    """
    Normalize text for robust fuzzy matching.
    
    Applies NFKC normalization, lowercasing, accent stripping,
    punctuation canonicalization, and whitespace collapse.
    
    Note: Non-Latin scripts (Chinese, Arabic, Cyrillic) are stripped
    because OFAC sanctions lists use romanized names. For example:
    - "中国工商银行" → "" (empty)
    - "INDUSTRIAL AND COMMERCIAL BANK OF CHINA" → "industrial and commercial bank of china"
    
    Args:
        text: Raw text string to normalize
        
    Returns:
        Normalized text string suitable for fuzzy matching.
        Returns empty string if input contains only non-Latin characters.
        
    Examples:
        >>> normalize_text("José María O'Brien")
        'jose maria obrien'
        
        >>> normalize_text("AL-QAIDA")
        'al qaida'
        
        >>> normalize_text("中国工商银行")
        ''
    """
    if not text or pd.isna(text):
        return ""
    
    # Convert to string if not already
    text = str(text)
    
    # Unicode normalization (canonical composition)
    text = unicodedata.normalize("NFKC", text)
    
    # Lowercase
    text = text.lower()
    
    # Strip accent marks (diacritics)
    # Decompose characters, then filter out combining marks
    text = ''.join(
        char for char in unicodedata.normalize("NFD", text)
        if unicodedata.category(char) != 'Mn'
    )
    
    # Remove quotes (single and double)
    text = re.sub(r"['\"]", "", text)
    
    # Replace non-alphanumeric (except space and hyphen) with space
    # Note: This strips non-Latin scripts (Chinese, Arabic, Cyrillic, etc.)
    # OFAC lists use romanized names, so this is intentional behavior
    text = re.sub(r"[^a-z0-9\s-]", " ", text)
    
    # Collapse multiple spaces to single space
    text = re.sub(r"\s+", " ", text)
    
    # Strip leading/trailing whitespace
    text = text.strip()
    
    return text

# Test normalization function
print("Testing text normalization:\n")
test_cases = [
    "José María O'Brien",
    "AL-QAIDA",
    "Société Générale",
    "中国工商银行",  # Chinese - will be stripped (OFAC uses romanized names)
    "  Multiple   Spaces  ",
    "UPPER-case-MiXeD",
]

for test in test_cases:
    normalized = normalize_text(test)
    # Show empty string explicitly for clarity
    display_normalized = f"'{normalized}'" if normalized else "''" 
    print(f"  '{test}' → {display_normalized}")

Testing text normalization:

  'José María O'Brien' → 'jose maria obrien'
  'AL-QAIDA' → 'al-qaida'
  'Société Générale' → 'societe generale'
  '中国工商银行' → ''
  '  Multiple   Spaces  ' → 'multiple spaces'
  'UPPER-case-MiXeD' → 'upper-case-mixed'
