# Identity Resolution: Matching Phase

This notebook implements the matching phase of identity resolution, using candidate pairs generated by the hybrid blocking strategy from `Identity_resolution_workflow.ipynb`.

## Workflow Overview

1. **Setup**: Import libraries, define paths, load source data tables
2. **Load Data**: Load source tables, candidate pairs, and ground truth splits
3. **Name Normalization**: Apply consistent name normalization (reusable)
4. **Common Matching Infrastructure**: Define reusable matching framework
5. **Matching Methods**: Implement different matching strategies
   - Rule-Based Matching
   - Optimized Matching (name variants + birth year constraint)
   - ML-Based Matching (RandomForestClassifier)
   - [Future: LLM-based matching, etc.]
6. **Evaluation**: Assess matching quality (reusable)
7. **Export Results**: Save final matches and metrics (reusable)


## 0. Setup


### 0.1 Import Libraries and Setup Logging


In [135]:
import logging
import os
from pathlib import Path
import pandas as pd
import sys

# Setup logging
os.makedirs('logs', exist_ok=True)
logging.basicConfig(
    level=logging.INFO,
    format='[%(levelname)-5s] %(name)s - %(message)s',
    handlers=[
        logging.FileHandler('logs/matching.log'),
        logging.StreamHandler()
    ],
    force=True
)
logging.getLogger().info('Matching phase logging enabled')

print("✓ Logging setup complete")


[INFO ] root - Matching phase logging enabled


✓ Logging setup complete


### 0.2 Define Paths


In [136]:
# Project base directory
BASE_DIR = Path('/Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets')

# Input: Candidate pairs from blocking phase
CANDIDATES_DIR = BASE_DIR / 'data' / 'output' / 'workflow'
CANDIDATES_LR = CANDIDATES_DIR / 'candidates_hybrid_LR.csv'
CANDIDATES_LS = CANDIDATES_DIR / 'candidates_hybrid_LS.csv'

# Input: Source data tables
CLEAN_DIR = BASE_DIR / 'data' / 'output' / 'clean'
LAHMAN_PATH = (CLEAN_DIR / 'Lahman_Mapped_dedup.xml'
               if (CLEAN_DIR / 'Lahman_Mapped_dedup.xml').exists()
               else BASE_DIR / 'Lahman_Mapped.xml')
REFERENCE_PATH = (CLEAN_DIR / 'Reference_Mapped_dedup.xml'
                  if (CLEAN_DIR / 'Reference_Mapped_dedup.xml').exists()
                  else BASE_DIR / 'Reference_Mapped.xml')
SAVANT_PATH = (CLEAN_DIR / 'Savant_Mapped_dedup.xml'
               if (CLEAN_DIR / 'Savant_Mapped_dedup.xml').exists()
               else BASE_DIR / 'Savant_Mapped.xml')

# Input: Ground truth splits for evaluation
SPLITS_DIR = BASE_DIR / 'data' / 'output' / 'gt' / 'splits'
LR_TRAIN = SPLITS_DIR / 'gt_LR_train.csv'
LR_VAL = SPLITS_DIR / 'gt_LR_val.csv'
LR_TEST = SPLITS_DIR / 'gt_LR_test.csv'
LS_TRAIN = SPLITS_DIR / 'gt_LS_train.csv'
LS_VAL = SPLITS_DIR / 'gt_LS_val.csv'
LS_TEST = SPLITS_DIR / 'gt_LS_test.csv'

# Output: Matching results
OUTPUT_DIR = BASE_DIR / 'data' / 'output' / 'matching'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("Paths configured:")
print(f"  Candidates LR: {CANDIDATES_LR.exists()}")
print(f"  Candidates LS: {CANDIDATES_LS.exists()}")
print(f"  Output dir: {OUTPUT_DIR}")


Paths configured:
  Candidates LR: True
  Candidates LS: True
  Output dir: /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/matching


### 0.3 Import PyDI


In [137]:
import sys
import subprocess
import importlib
import PyDI  # noqa: F401

from PyDI.io import load_xml
from PyDI.entitymatching import RuleBasedMatcher, GreedyOneToOneMatchingAlgorithm, MLBasedMatcher, FeatureExtractor
from PyDI.entitymatching.comparators import StringComparator, DateComparator
from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

# ML libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

print("✓ PyDI imported successfully")
print("✓ ML libraries imported successfully")


✓ PyDI imported successfully
✓ ML libraries imported successfully


## 1. Load Data

### 1.1 Load Source Data Tables


In [138]:
# Load source data tables
print("Loading source data tables...")
L_full = load_xml(LAHMAN_PATH).convert_dtypes().reset_index(drop=True)
R_full = load_xml(REFERENCE_PATH).convert_dtypes().reset_index(drop=True)
S_full = load_xml(SAVANT_PATH).convert_dtypes().reset_index(drop=True)

# Create _rid column for matching (same format as blocking phase)
for df, tag in [(L_full, 'L'), (R_full, 'R'), (S_full, 'S')]:
    if {'player_id', 'season_year'} <= set(df.columns):
        pid = df['player_id'].astype('string').fillna('NA')
        season = df['season_year'].astype('Int64').astype('string').fillna('NA')
        df['_rid'] = pid + '|' + season + f'|{tag}'
    else:
        df['_rid'] = df.index.map(lambda i: f"{tag}{i:06d}")

print(f"  L_full: {len(L_full):,} records")
print(f"  R_full: {len(R_full):,} records")
print(f"  S_full: {len(S_full):,} records")
print("✓ Source tables loaded")


Loading source data tables...
  L_full: 106,553 records
  R_full: 15,215 records
  S_full: 6,743 records
✓ Source tables loaded


### 1.2 Load Candidate Pairs from Hybrid Blocking


In [139]:
# Load candidate pairs generated by hybrid blocking strategy
print("Loading candidate pairs from hybrid blocking...")
candidates_lr = pd.read_csv(CANDIDATES_LR)
candidates_ls = pd.read_csv(CANDIDATES_LS)

print(f"  LR candidates: {len(candidates_lr):,} pairs")
print(f"  LS candidates: {len(candidates_ls):,} pairs")
print(f"  Total candidates: {len(candidates_lr) + len(candidates_ls):,} pairs")
print("✓ Candidate pairs loaded")


Loading candidate pairs from hybrid blocking...
  LR candidates: 5,733,797 pairs
  LS candidates: 215,708 pairs
  Total candidates: 5,949,505 pairs
✓ Candidate pairs loaded


### 1.3 Load Ground Truth Splits for Evaluation


In [141]:
# Load ground truth splits
print("Loading ground truth splits...")
lr_train_df = pd.read_csv(LR_TRAIN)
lr_val_df = pd.read_csv(LR_VAL)
lr_test_df = pd.read_csv(LR_TEST)

ls_train_df = pd.read_csv(LS_TRAIN)
ls_val_df = pd.read_csv(LS_VAL)
ls_test_df = pd.read_csv(LS_TEST)

# Organize splits by edge
splits = {
    'LR': {'train': lr_train_df, 'val': lr_val_df, 'test': lr_test_df},
    'LS': {'train': ls_train_df, 'val': ls_val_df, 'test': ls_test_df}
}

# Organize source tables by edge
source_tables = {
    'LR': (L_full, R_full),
    'LS': (L_full, S_full)
}

# Organize candidate pairs by edge
candidates = {
    'LR': candidates_lr,
    'LS': candidates_ls
}

print("✓ Ground truth splits loaded")
print(f"  LR: train={len(lr_train_df)}, val={len(lr_val_df)}, test={len(lr_test_df)}")
print(f"  LS: train={len(ls_train_df)}, val={len(ls_val_df)}, test={len(ls_test_df)}")


Loading ground truth splits...
✓ Ground truth splits loaded
  LR: train=302, val=99, test=99
  LS: train=299, val=96, test=105


## 2. Name Normalization (Reusable)

Apply consistent name normalization across all matching methods.


In [142]:
# Check if normalized names exist, if not, apply normalization
import re
import unicodedata

def normalize_name_for_blocking(text: str) -> str:
    r"""Normalize name for consistent matching (same as workflow notebook)"""
    if not isinstance(text, str):
        return ''
    
    # Decode literal backslash-x-hex patterns
    def decode_literal_hex_sequence(match):
        hex_bytes = []
        for i in range(1, len(match.groups()) + 1):
            hex_str = match.group(i)
            try:
                hex_bytes.append(int(hex_str, 16))
            except ValueError:
                return match.group(0)
        try:
            decoded = bytes(hex_bytes).decode('utf-8')
            return decoded
        except (UnicodeDecodeError, ValueError):
            return match.group(0)
    
    text = re.sub(r'\\x([0-9a-fA-F]{2})\\x([0-9a-fA-F]{2})', decode_literal_hex_sequence, text)
    text = re.sub(r'\\x([0-9a-fA-F]{2})\\x([0-9a-fA-F]{2})\\x([0-9a-fA-F]{2})', decode_literal_hex_sequence, text)
    
    def decode_single_hex(match):
        hex_str = match.group(1)
        try:
            return chr(int(hex_str, 16))
        except (ValueError, OverflowError):
            return match.group(0)
    text = re.sub(r'\\x([0-9a-fA-F]{2})', decode_single_hex, text)
    
    # Unicode normalization
    text = unicodedata.normalize('NFD', text)
    text = ''.join(c for c in text if unicodedata.category(c) != 'Mn')
    
    # Lowercase and strip
    text = text.lower().strip()
    
    # Handle backslash escapes
    text = text.replace('\\ ', ' ').replace('\\', ' ')
    
    # Standardize punctuation
    text = text.replace('.', '').replace(',', '').replace('-', ' ').replace("'", '')
    
    # Normalize spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Remove common suffixes
    for suffix in [' jr', ' sr', ' ii', ' iii', ' iv', ' v']:
        text = text.replace(suffix, '')
    text = text.strip()
    
    return text

# Apply normalization if needed
for name, df in [('L_full', L_full), ('R_full', R_full), ('S_full', S_full)]:
    if 'full_name_normalized' not in df.columns and 'full_name' in df.columns:
        df['full_name_normalized'] = df['full_name'].astype('string').map(normalize_name_for_blocking)
        print(f"  {name}: Created 'full_name_normalized' column")
    elif 'full_name_normalized' in df.columns:
        print(f"  {name}: 'full_name_normalized' column already exists")

print("✓ Name normalization complete")


  L_full: Created 'full_name_normalized' column
  R_full: Created 'full_name_normalized' column
  S_full: Created 'full_name_normalized' column
✓ Name normalization complete


## 3. Common Matching Infrastructure (Reusable)

This section defines reusable functions and frameworks for all matching methods.


### 3.1 Matching Function Interface

Define a standard interface for matching functions. All matching methods should follow this pattern.


In [143]:
# Matching function interface
# All matching methods should return a DataFrame with columns: ['id1', 'id2', 'sim']
# where 'sim' is the similarity score (0.0 to 1.0)

def apply_matching_method(
    edge_name: str,
    matching_func,
    *args,
    **kwargs
) -> pd.DataFrame:
    """
    Apply a matching method to candidate pairs for a given edge.
    
    Args:
        edge_name: Edge name ('LR' or 'LS')
        matching_func: Function that takes (left_df, right_df, candidates_df, *args, **kwargs)
                       and returns DataFrame with ['id1', 'id2', 'sim']
        *args, **kwargs: Additional arguments to pass to matching_func
    
    Returns:
        DataFrame with columns ['id1', 'id2', 'sim']
    """
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name]
    
    print(f"\n=== {edge_name}: Applying matching method ===")
    print(f"  Candidate pairs: {len(cand_df):,}")
    
    result = matching_func(left_df, right_df, cand_df, *args, **kwargs)
    
    # Validate result format
    required_cols = ['id1', 'id2', 'sim']
    if not all(col in result.columns for col in required_cols):
        raise ValueError(f"Matching function must return DataFrame with columns: {required_cols}")
    
    print(f"  Generated {len(result):,} scored pairs")
    print(f"  Similarity range: [{result['sim'].min():.3f}, {result['sim'].max():.3f}]")
    print(f"  Mean similarity: {result['sim'].mean():.3f}")
    
    return result

print("✓ Matching function interface defined")


✓ Matching function interface defined


### 3.2 Evaluation Functions (Reusable)


In [144]:
# Reusable evaluation functions
from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

def evaluate_matching_thresholds(
    scored_pairs: pd.DataFrame,
    val_df: pd.DataFrame,
    thresholds: list = [0.5, 0.6, 0.7, 0.8, 0.9],
    output_dir: Path = None
) -> dict:
    """
    Evaluate matching performance across different similarity thresholds.
    
    Args:
        scored_pairs: DataFrame with columns ['id1', 'id2', 'sim']
        val_df: Validation ground truth with columns ['id1', 'id2', 'label']
        thresholds: List of similarity thresholds to evaluate
        output_dir: Directory for evaluation outputs
    
    Returns:
        Dictionary with best_threshold, best_f1, and detailed metrics
    """
    best_threshold = None
    best_f1 = 0.0
    threshold_metrics = {}
    
    print("\nThreshold Analysis:")
    print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<12} {'F1-Score':<12} {'TP':<8} {'FP':<8} {'FN':<8}")
    print("-" * 80)
    
    for threshold in thresholds:
        # Filter by threshold
        matched_pairs = scored_pairs[scored_pairs['sim'] >= threshold][['id1', 'id2']].copy()
        
        # Evaluate
        metrics = EntityMatchingEvaluator.evaluate_matching(
            predicted_pairs=matched_pairs,
            test_pairs=val_df,
            out_dir=output_dir
        )
        
        precision = metrics.get('precision', 0.0)
        recall = metrics.get('recall', 0.0)
        f1 = metrics.get('f1_score', 0.0)
        tp = metrics.get('true_positives', 0)
        fp = metrics.get('false_positives', 0)
        fn = metrics.get('false_negatives', 0)
        
        threshold_metrics[threshold] = metrics
        print(f"{threshold:<12.1f} {precision:<12.3f} {recall:<12.3f} {f1:<12.3f} {tp:<8} {fp:<8} {fn:<8}")
        
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = threshold
    
    return {
        'best_threshold': best_threshold,
        'best_f1': best_f1,
        'thresholds': thresholds,
        'threshold_metrics': threshold_metrics
    }

def apply_season_year_constraint(
    scored_pairs: pd.DataFrame,
    left_df: pd.DataFrame,
    right_df: pd.DataFrame
) -> pd.DataFrame:
    """
    Filter scored pairs to ensure season_year matches exactly.
    
    Args:
        scored_pairs: DataFrame with columns ['id1', 'id2', 'sim']
        left_df: Left source table with '_rid' and 'season_year' columns
        right_df: Right source table with '_rid' and 'season_year' columns
    
    Returns:
        Filtered DataFrame with same columns
    """
    # Merge to get season_year
    matches = scored_pairs.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        suffixes=('', '_left')
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        suffixes=('', '_right')
    )
    
    # Filter: season_year must match exactly
    matches = matches[matches['season_year'] == matches['season_year_right']]
    
    # Return original columns
    return matches[['id1', 'id2', 'sim']].copy()

def apply_global_matching(
    scored_pairs: pd.DataFrame
) -> pd.DataFrame:
    """
    Apply greedy one-to-one matching to resolve conflicts.
    
    Args:
        scored_pairs: DataFrame with columns ['id1', 'id2', 'sim']
    
    Returns:
        DataFrame with one-to-one matches (same columns)
    """
    # Convert 'sim' to 'score' if needed (PyDI expects 'score' column)
    correspondences = scored_pairs.copy()
    if 'sim' in correspondences.columns and 'score' not in correspondences.columns:
        correspondences = correspondences.rename(columns={'sim': 'score'})
    
    # Apply greedy one-to-one matching using .cluster() method
    global_matcher = GreedyOneToOneMatchingAlgorithm()
    global_matched = global_matcher.cluster(correspondences)
    
    # Convert 'score' back to 'sim' if original had 'sim'
    if 'sim' in scored_pairs.columns and 'score' in global_matched.columns:
        global_matched = global_matched.rename(columns={'score': 'sim'})
    
    return global_matched

print("✓ Evaluation functions defined")


✓ Evaluation functions defined


## 4. Matching Methods

### 4.1 Rule-Based Matching

Use similarity comparators to compute matching scores:
- **Name similarity**: Levenshtein distance (weight: 0.7) and Jaccard similarity (weight: 0.3)
- **Season year**: Must match exactly (hard constraint - if not matched, similarity = 0.0)


In [145]:
# Configure comparators for rule-based matching
from PyDI.entitymatching.comparators import StringComparator, DateComparator

# Name comparators (using normalized names)
name_comparators = [
    StringComparator(
        column="full_name_normalized" if 'full_name_normalized' in L_full.columns else "full_name",
        similarity_function="levenshtein",
        preprocess=str.lower
    ),
    StringComparator(
        column="full_name_normalized" if 'full_name_normalized' in L_full.columns else "full_name",
        similarity_function="jaccard",
        tokenization="word",
        preprocess=str.lower
    )
]
name_weights = [0.7, 0.3]  # Emphasize Levenshtein for name matching

# Matching strategy: name similarity + season_year hard constraint
# season_year is checked as a hard constraint (must match exactly) before computing similarity
# Only name comparators are used for similarity scoring

# Use only name comparators (season_year is handled as hard constraint)
rule_based_comparators = name_comparators
rule_based_weights = name_weights  # Levenshtein (0.7), Jaccard (0.3)

print("Rule-based comparators configured:")
print(f"  Name comparators: Levenshtein (0.7), Jaccard (0.3)")
print(f"  Hard constraint: season_year must match exactly")
print(f"  Total comparators: {len(rule_based_comparators)}")


Rule-based comparators configured:
  Name comparators: Levenshtein (0.7), Jaccard (0.3)
  Hard constraint: season_year must match exactly
  Total comparators: 2


### 4.1.1 Define Rule-Based Matching Function


In [146]:
def rule_based_matching(
    left_df: pd.DataFrame,
    right_df: pd.DataFrame,
    cand_df: pd.DataFrame,
    comparators: list = None,
    weights: list = None
) -> pd.DataFrame:
    """
    Rule-based matching using similarity comparators.
    
    Args:
        left_df: Left source table
        right_df: Right source table
        cand_df: Candidate pairs DataFrame with columns ['id1', 'id2']
        comparators: List of comparator objects (default: rule_based_comparators)
        weights: List of weights for each comparator (default: rule_based_weights)
    
    Returns:
        DataFrame with columns ['id1', 'id2', 'sim']
    """
    if comparators is None:
        comparators = rule_based_comparators
    if weights is None:
        weights = rule_based_weights
    
    # Merge candidate pairs with source data
    cand_with_data = cand_df.merge(
        left_df,
        left_on='id1',
        right_on='_rid',
        how='left',
        suffixes=('', '_left')
    ).merge(
        right_df,
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    # Compute similarity scores
    scores_list = []
    
    for idx, row in cand_with_data.iterrows():
        # Get left and right records
        # Left record: columns from left_df (no suffix added if no conflict)
        left_record = {col: row[col] for col in left_df.columns if col in row}
        
        # Right record: handle merge suffixes correctly
        # When merging with suffixes=('', '_right'), right_df columns get '_right' suffix only if there's a conflict
        # We need to check both the original column name and the suffixed version
        right_record = {}
        for col in right_df.columns:
            # First check if column with _right suffix exists (conflict case)
            if f"{col}_right" in row:
                right_record[col] = row[f"{col}_right"]
            # Then check if original column name exists (no conflict case)
            elif col in row:
                right_record[col] = row[col]
            else:
                # Column not found, use None
                right_record[col] = None
        
        # Hard constraint: season_year must match exactly
        # If season_year doesn't match, similarity is 0.0
        left_season = left_record.get('season_year')
        right_season = right_record.get('season_year')
        if left_season is None or right_season is None or left_season != right_season:
            scores_list.append(0.0)
            continue
        
        # Compute weighted similarity score (only name similarity, since season_year is already checked)
        total_score = 0.0
        total_weight = 0.0
        
        for comparator, weight in zip(comparators, weights):
            try:
                col_name = comparator.column
                # Get values, handling None and missing columns
                left_val = left_record.get(col_name) or ''
                right_val = right_record.get(col_name) or ''
                # Convert to string and handle None
                left_val = str(left_val) if left_val is not None else ''
                right_val = str(right_val) if right_val is not None else ''
                
                # Compute similarity using comparator
                if isinstance(comparator, StringComparator):
                    # Apply preprocessing
                    if comparator.preprocess:
                        left_val = comparator.preprocess(str(left_val))
                        right_val = comparator.preprocess(str(right_val))
                    
                    # Compute similarity
                    if comparator.similarity_function == 'levenshtein':
                        try:
                            from Levenshtein import ratio
                            sim = ratio(left_val, right_val)
                        except ImportError:
                            # Fallback to simple string comparison
                            sim = 1.0 if left_val == right_val else 0.0
                    elif comparator.similarity_function == 'jaccard':
                        if comparator.tokenization == 'word':
                            left_tokens = set(left_val.split())
                            right_tokens = set(right_val.split())
                            if len(left_tokens | right_tokens) == 0:
                                sim = 1.0
                            else:
                                sim = len(left_tokens & right_tokens) / len(left_tokens | right_tokens)
                        else:
                            sim = 0.0
                    else:
                        sim = 0.0
                elif isinstance(comparator, DateComparator):
                    # Date comparison with tolerance
                    # Note: DateComparator uses max_days_difference (in days), convert to years for year comparison
                    try:
                        left_year = int(float(left_val)) if left_val else None
                        right_year = int(float(right_val)) if right_val else None
                        if left_year is not None and right_year is not None:
                            diff = abs(left_year - right_year)
                            # Convert max_days_difference to years (365 days = 1 year)
                            year_tolerance = getattr(comparator, 'max_days_difference', 365) / 365.0
                            if diff <= year_tolerance:
                                sim = 1.0 - (diff / (year_tolerance + 1))
                            else:
                                sim = 0.0
                        else:
                            sim = 0.0
                    except (ValueError, TypeError):
                        sim = 0.0
                else:
                    sim = 0.0
                
                total_score += sim * weight
                total_weight += weight
            except Exception:
                # If comparator fails, skip it
                continue
        
        # Normalize by total weight
        final_score = total_score / total_weight if total_weight > 0 else 0.0
        scores_list.append(final_score)
    
    # Create result DataFrame
    result = cand_df[['id1', 'id2']].copy()
    result['sim'] = scores_list
    
    return result

print("✓ Rule-based matching function defined")


✓ Rule-based matching function defined


### 4.1.2 Apply Rule-Based Matching


In [147]:
# Apply rule-based matching to all edges
matching_results_rule_based = {}

for edge_name in ['LR', 'LS']:
    matching_results_rule_based[edge_name] = apply_matching_method(
        edge_name,
        rule_based_matching,
        comparators=rule_based_comparators,
        weights=rule_based_weights
    )

print("\n✓ Rule-based matching complete for all edges")



=== LR: Applying matching method ===
  Candidate pairs: 5,733,797
  Generated 5,733,797 scored pairs
  Similarity range: [0.000, 1.000]
  Mean similarity: 0.013

=== LS: Applying matching method ===
  Candidate pairs: 215,708
  Generated 215,708 scored pairs
  Similarity range: [0.000, 1.000]
  Mean similarity: 0.037

✓ Rule-based matching complete for all edges


### 2.3 Match Candidate Pairs

Apply RuleBasedMatcher to compute similarity scores for all candidate pairs.


In [148]:
# Match candidate pairs using PyDI's RuleBasedMatcher (following exercise file pattern)

# Define matching threshold
threshold = 0.7

matching_results = {}
matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Rule-Based Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    
    # Apply season_year hard constraint: filter candidate pairs where season_year matches
    print(f"  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    # Filter: season_year must match exactly
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    # Initialize matcher (following exercise file pattern)
    matcher = RuleBasedMatcher()
    
    # Match with a single threshold (following exercise file pattern)
    print(f"  Computing similarity scores using PyDI RuleBasedMatcher (threshold={threshold})...")
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        comparators=rule_based_comparators,
        weights=rule_based_weights,
        threshold=threshold,
        debug=False
    )
    
    # Store results
    matching_results[edge_name] = correspondences
    matchers[edge_name] = matcher
    
    print(f"  Generated {len(correspondences):,} matched pairs (above threshold {threshold})")
    if 'score' in correspondences.columns:
        print(f"  Similarity score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean similarity: {correspondences['score'].mean():.3f}")



=== LR: Rule-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.096; 130895 blocked pairs (reduction ratio: 0.9999192606183567)


  After season_year filter: 130,895 candidate pairs (from 5,733,797)
  Computing similarity scores using PyDI RuleBasedMatcher (threshold=0.7)...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:19.960; found 15309 correspondences.


  Generated 15,309 matched pairs (above threshold 0.7)
  Similarity score range: [0.700, 1.000]
  Mean similarity: 0.998

=== LS: Rule-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.078; 9526 blocked pairs (reduction ratio: 0.9999867415811222)


  After season_year filter: 9,526 candidate pairs (from 215,708)
  Computing similarity scores using PyDI RuleBasedMatcher (threshold=0.7)...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:1.331; found 6725 correspondences.


  Generated 6,725 matched pairs (above threshold 0.7)
  Similarity score range: [0.700, 1.000]
  Mean similarity: 0.997


## 3. Evaluate Matching Quality

### 3.1 Evaluate on Validation Set

Assess matching performance using different similarity thresholds.


In [149]:
# Evaluate matching on validation set (following exercise file pattern)
# Simple evaluation with a single threshold

from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Matching Evaluation (Validation Set) ===")
    
    correspondences = matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    # Evaluate matching (following exercise file pattern)
    eval_results = EntityMatchingEvaluator.evaluate_matching(
        correspondences=correspondences,
        test_pairs=val_df,
        out_dir=OUTPUT_DIR / 'matching-evaluation',
        matcher_instance=matchers[edge_name]
    )
    
    matching_metrics_val[edge_name] = eval_results
    
    print(f"\n  Precision: {eval_results.get('precision', 0.0):.3f}")
    print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
    print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
    print(f"  TP: {eval_results.get('true_positives', 0)}")
    print(f"  FP: {eval_results.get('false_positives', 0)}")
    print(f"  FN: {eval_results.get('false_negatives', 0)}")



=== LR: Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  70
[INFO ] root -   True Negatives:  27
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 2
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.980
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.972
[INFO ] root -   F1-Score:  0.986



  Precision: 1.000
  Recall:    0.972
  F1-Score:  0.986
  TP: 70
  FP: 0
  FN: 2

=== LS: Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  52
[INFO ] root -   True Negatives:  39
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 3
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.948
[INFO ] root -   Precision: 0.963
[INFO ] root -   Recall:    0.945
[INFO ] root -   F1-Score:  0.954



  Precision: 0.963
  Recall:    0.945
  F1-Score:  0.954
  TP: 52
  FP: 2
  FN: 3


### 3.2 Analyze error cases

In [150]:
# Analyze error cases (False Positives and False Negatives) for tuning

for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name}: Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = matching_results[edge_name]
    
    # Get true matches and false matches in validation set
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    
    # Get predicted matches (only those in validation set)
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))  # All pairs in validation set
    
    # False Negatives: True matches in validation set that were not predicted
    fn_pairs = true_set - pred_set
    print(f"\n[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if len(fn_pairs) > 0:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            
            if left_rec is not None and right_rec is not None:
                print(f"\n  {id1} <-> {id2}")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                
                # Check if in candidates
                in_candidates = len(candidates[edge_name][
                    (candidates[edge_name]['id1'] == id1) & 
                    (candidates[edge_name]['id2'] == id2)
                ]) > 0
                print(f"    In candidates: {in_candidates}")
    
    # False Positives: Predicted matches that are in validation set but labeled as FALSE
    # Only analyze pairs that are in the validation set
    fp_pairs = (pred_set & val_set) & false_set
    print(f"\n[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if len(fp_pairs) > 0:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 else None
            
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else "N/A"
                print(f"\n  {id1} <-> {id2} (score: {score_str})")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")



LR: Error Cases Analysis

[FALSE NEGATIVES] (2 cases):

  5961292022|2022|L <-> 5961292022|2022|R
    Left:  'dan vogelbach' | Season: 2022 | Birth: 1992
    Right: 'daniel vogelbach' | Season: 2022 | Birth: 1993
    In candidates: False

  5715102017|2017|L <-> 5715102017|2017|R
    Left:  'matt boyd' | Season: 2017 | Birth: 1991
    Right: 'matthew boyd' | Season: 2017 | Birth: 1991
    In candidates: True

[FALSE POSITIVES] (0 cases):

LS: Error Cases Analysis

[FALSE NEGATIVES] (3 cases):

  6404472019|2019|L <-> 6404472019|2019|S
    Left:  'phil ervin' | Season: 2019 | Birth: 1992
    Right: 'phillip ervin' | Season: 2019 | Birth: 1993
    In candidates: True

  4931142016|2016|L <-> 4931142016|2016|S
    Left:  'nori aoki' | Season: 2016 | Birth: 1982
    Right: 'norichika aoki' | Season: 2016 | Birth: 1982
    In candidates: True

  5470072016|2016|L <-> 5470072016|2016|S
    Left:  'robert whalen' | Season: 2016 | Birth: 1994
    Right: 'rob whalen' | Season: 2016 | Birth: 19

### 3.3 Optimized Matching with Name Variants and Birth Year Constraint

Based on error analysis, we implement three optimization strategies:

1. **Name Variant Handling**: Handle common name variants (dan/daniel, matt/matthew, etc.)
2. **Adjusted Threshold**: Lower threshold from 0.7 to 0.65 to capture more true matches
3. **Birth Year Soft Constraint**: Apply penalty when birth years differ by more than 1 year


In [152]:
# Optimized matching with name variants and birth year constraint

# Strategy 1: Name variant dictionary
NAME_VARIANTS = {
    'dan': 'daniel',
    'daniel': 'dan',
    'matt': 'matthew',
    'matthew': 'matt',
    'jon': 'jonathon',
    'jonathon': 'jon',
    'phil': 'phillip',
    'phillip': 'phil',
    'jim': 'james',
    'james': 'jim',
    'bob': 'robert',
    'robert': 'bob',
    'bill': 'william',
    'william': 'bill',
    'mike': 'michael',
    'michael': 'mike',
    'dave': 'david',
    'david': 'dave',
    'chris': 'christopher',
    'christopher': 'chris',
    'tom': 'thomas',
    'thomas': 'tom',
    'ed': 'edward',
    'edward': 'ed',
    'rick': 'richard',
    'richard': 'rick',
}

def check_name_variant_match(name1, name2):
    """
    Check if two names are variants of each other.
    Returns True if they are variants, False otherwise.
    """
    name1_words = name1.lower().split()
    name2_words = name2.lower().split()
    
    # Must have same number of words
    if len(name1_words) != len(name2_words):
        return False
    
    # Check if all words match or are variants
    for w1, w2 in zip(name1_words, name2_words):
        if w1 != w2:
            # Check if they are variants
            if not (NAME_VARIANTS.get(w1) == w2 or NAME_VARIANTS.get(w2) == w1):
                return False
    
    return True

def apply_birth_year_constraint(similarity_score, birth1, birth2, penalty=0.2):
    """
    Apply birth year soft constraint: reduce similarity if birth years differ by more than 1 year.
    
    Args:
        similarity_score: Base similarity score
        birth1: Birth year from left record
        birth2: Birth year from right record
        penalty: Penalty factor (default: 0.2)
    
    Returns:
        Adjusted similarity score
    """
    if birth1 is None or birth2 is None:
        return similarity_score  # If missing, don't penalize
    
    year_diff = abs(birth1 - birth2)
    
    if year_diff > 1:
        # Difference > 1 year: apply full penalty
        return similarity_score * (1 - penalty)
    elif year_diff == 1:
        # Difference = 1 year: apply half penalty
        return similarity_score * (1 - penalty * 0.5)
    else:
        # Same year or difference = 0: no penalty
        return similarity_score

def compute_enhanced_similarity(left_record, right_record, comparators, weights):
    """
    Compute enhanced similarity with name variant handling.
    
    Args:
        left_record: Left record dictionary
        right_record: Right record dictionary
        comparators: List of comparator objects
        weights: List of weights for comparators
    
    Returns:
        Enhanced similarity score
    """
    # Get names
    name1 = str(left_record.get('full_name_normalized', left_record.get('full_name', ''))).lower()
    name2 = str(right_record.get('full_name_normalized', right_record.get('full_name', ''))).lower()
    
    # Compute base similarity using comparators
    base_scores = []
    for comparator in comparators:
        col = comparator.column
        left_val = str(left_record.get(col, ''))
        right_val = str(right_record.get(col, ''))
        
        # Compute similarity using comparator
        if hasattr(comparator, 'similarity_function'):
            if comparator.similarity_function == 'levenshtein':
                from difflib import SequenceMatcher
                sim = SequenceMatcher(None, left_val.lower(), right_val.lower()).ratio()
            elif comparator.similarity_function == 'jaccard':
                left_tokens = set(left_val.lower().split())
                right_tokens = set(right_val.lower().split())
                if len(left_tokens | right_tokens) == 0:
                    sim = 1.0
                else:
                    sim = len(left_tokens & right_tokens) / len(left_tokens | right_tokens)
            else:
                sim = 0.0
        else:
            sim = 0.0
        
        base_scores.append(sim)
    
    # Weighted average
    base_score = sum(s * w for s, w in zip(base_scores, weights))
    
    # Strategy 1: Check for name variant match
    if check_name_variant_match(name1, name2):
        # Variant match: boost score (but cap at 1.0)
        base_score = min(1.0, base_score + 0.15)
    
    return base_score

# Strategy 2: Adjusted threshold (lower from 0.7 to 0.65)
optimized_threshold = 0.65

print("Optimized matching configuration:")
print(f"  Name variant dictionary: {len(NAME_VARIANTS)} variants")
print(f"  Adjusted threshold: {optimized_threshold} (from 0.7)")
print(f"  Birth year constraint: penalty={0.2} for year_diff > 1")


Optimized matching configuration:
  Name variant dictionary: 26 variants
  Adjusted threshold: 0.65 (from 0.7)
  Birth year constraint: penalty=0.2 for year_diff > 1


In [153]:
# Apply optimized matching with name variants and birth year constraint

optimized_matching_results = {}
optimized_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Optimized Rule-Based Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    
    # Apply season_year hard constraint: filter candidate pairs where season_year matches
    print(f"  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year', 'birth_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year', 'birth_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    # Filter: season_year must match exactly
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2', 'birth_year', 'birth_year_right']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    # Initialize matcher
    matcher = RuleBasedMatcher()
    
    # First, compute base similarity using PyDI (with original threshold to get all scores)
    print(f"  Computing base similarity scores using PyDI RuleBasedMatcher...")
    base_correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered[['id1', 'id2']],
        id_column='_rid',
        comparators=rule_based_comparators,
        weights=rule_based_weights,
        threshold=0.0,  # Get all scores, we'll filter later
        debug=False
    )
    
    print(f"  Base correspondences: {len(base_correspondences):,}")
    
    # Apply enhanced similarity computation with name variants and birth year constraint
    print(f"  Applying name variant handling and birth year constraint...")
    
    # Merge with source data to get names and birth years
    enhanced_correspondences = base_correspondences.merge(
        left_df[['_rid', 'full_name_normalized', 'full_name', 'birth_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'full_name_normalized', 'full_name', 'birth_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    # Apply name variant enhancement (Strategy 1)
    def apply_name_variant_boost(row):
        base_score = row['score']
        name1 = str(row.get('full_name_normalized', row.get('full_name', ''))).lower()
        name2 = str(row.get('full_name_normalized_right', row.get('full_name_right', ''))).lower()
        
        if check_name_variant_match(name1, name2):
            # Variant match: boost score (but cap at 1.0)
            return min(1.0, base_score + 0.15)
        return base_score
    
    enhanced_correspondences['enhanced_score'] = enhanced_correspondences.apply(apply_name_variant_boost, axis=1)
    
    # Apply birth year constraint (Strategy 3)
    # Convert birth_year columns to numeric (handle string types)
    enhanced_correspondences['birth_year'] = pd.to_numeric(enhanced_correspondences['birth_year'], errors='coerce')
    enhanced_correspondences['birth_year_right'] = pd.to_numeric(enhanced_correspondences['birth_year_right'], errors='coerce')
    
    enhanced_correspondences['final_score'] = enhanced_correspondences.apply(
        lambda row: apply_birth_year_constraint(
            row['enhanced_score'],
            row.get('birth_year'),
            row.get('birth_year_right'),
            penalty=0.2
        ),
        axis=1
    )
    
    # Keep only necessary columns
    enhanced_correspondences = enhanced_correspondences[['id1', 'id2', 'final_score']].rename(columns={'final_score': 'score'})
    
    # Apply optimized threshold (Strategy 2: 0.65 instead of 0.7)
    optimized_correspondences = enhanced_correspondences[enhanced_correspondences['score'] >= optimized_threshold].copy()
    
    # Store results
    optimized_matching_results[edge_name] = optimized_correspondences
    optimized_matchers[edge_name] = matcher
    
    print(f"  Generated {len(optimized_correspondences):,} matched pairs (above threshold {optimized_threshold})")
    if len(optimized_correspondences) > 0:
        print(f"  Similarity score range: [{optimized_correspondences['score'].min():.3f}, {optimized_correspondences['score'].max():.3f}]")
        print(f"  Mean similarity: {optimized_correspondences['score'].mean():.3f}")
    
print("\n✓ Optimized matching complete for all edges")



=== LR: Optimized Rule-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.137; 130895 blocked pairs (reduction ratio: 0.9999192606183567)


  After season_year filter: 130,895 candidate pairs (from 5,733,797)
  Computing base similarity scores using PyDI RuleBasedMatcher...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:20.921; found 130895 correspondences.


  Base correspondences: 130,895
  Applying name variant handling and birth year constraint...
  Generated 15,296 matched pairs (above threshold 0.65)
  Similarity score range: [0.655, 1.000]
  Mean similarity: 0.946

=== LS: Optimized Rule-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.075; 9526 blocked pairs (reduction ratio: 0.9999867415811222)


  After season_year filter: 9,526 candidate pairs (from 215,708)
  Computing base similarity scores using PyDI RuleBasedMatcher...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:1.318; found 9526 correspondences.


  Base correspondences: 9,526
  Applying name variant handling and birth year constraint...
  Generated 6,729 matched pairs (above threshold 0.65)
  Similarity score range: [0.655, 1.000]
  Mean similarity: 0.944

✓ Optimized matching complete for all edges


### 3.4 Evaluate Optimized Matching

Evaluate the performance of optimized matching on the validation set.


In [103]:
# Evaluate optimized matching on validation set

from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

optimized_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Optimized Matching Evaluation (Validation Set) ===")
    
    correspondences = optimized_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    # Rename 'score' to match PyDI evaluator expectations (if needed)
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    # Evaluate matching
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-optimized',
            matcher_instance=optimized_matchers[edge_name]
        )
        
        optimized_matching_metrics_val[edge_name] = eval_results
        
        print(f"\n  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
    except Exception as e:
        print(f"  PyDI evaluator failed: {e}")
        # Manual evaluation fallback
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        
        print(f"\n  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("\n✓ Optimized matching evaluation complete")



=== LR: Optimized Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  75
[INFO ] root -   True Negatives:  23
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 3
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.970
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.962
[INFO ] root -   F1-Score:  0.980



  Precision: 1.000
  Recall:    0.962
  F1-Score:  0.980
  TP: 75
  FP: 0
  FN: 3

=== LS: Optimized Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  55
[INFO ] root -   True Negatives:  40
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 3
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.950
[INFO ] root -   Precision: 0.965
[INFO ] root -   Recall:    0.948
[INFO ] root -   F1-Score:  0.957



  Precision: 0.965
  Recall:    0.948
  F1-Score:  0.957
  TP: 55
  FP: 2
  FN: 3

✓ Optimized matching evaluation complete


### 3.5 Comparison: Original vs Optimized Matching

Compare the performance of original matching (threshold=0.7) with optimized matching (name variants + birth year constraint + threshold=0.65).


In [154]:
# Compare original vs optimized matching results

print("="*80)
print("Matching Performance Comparison: Original vs Optimized")
print("="*80)

comparison_data = []

for edge_name in ['LR', 'LS']:
    print(f"\n{edge_name} Edge:")
    print("-" * 80)
    
    # Original results
    orig_metrics = matching_metrics_val.get(edge_name, {})
    orig_precision = orig_metrics.get('precision', 0.0)
    orig_recall = orig_metrics.get('recall', 0.0)
    orig_f1 = orig_metrics.get('f1', 0.0)
    orig_tp = orig_metrics.get('true_positives', 0)
    orig_fp = orig_metrics.get('false_positives', 0)
    orig_fn = orig_metrics.get('false_negatives', 0)
    
    # Optimized results
    opt_metrics = optimized_matching_metrics_val.get(edge_name, {})
    opt_precision = opt_metrics.get('precision', 0.0)
    opt_recall = opt_metrics.get('recall', 0.0)
    opt_f1 = opt_metrics.get('f1', 0.0)
    opt_tp = opt_metrics.get('true_positives', 0)
    opt_fp = opt_metrics.get('false_positives', 0)
    opt_fn = opt_metrics.get('false_negatives', 0)
    
    # Calculate improvements
    precision_improvement = opt_precision - orig_precision
    recall_improvement = opt_recall - orig_recall
    f1_improvement = opt_f1 - orig_f1
    tp_improvement = opt_tp - orig_tp
    fp_improvement = opt_fp - orig_fp
    fn_improvement = opt_fn - orig_fn
    
    print(f"  Metric              | Original  | Optimized | Improvement")
    print(f"  --------------------|-----------|-----------|------------")
    print(f"  Precision           | {orig_precision:7.3f}  | {opt_precision:7.3f}  | {precision_improvement:+7.3f}")
    print(f"  Recall              | {orig_recall:7.3f}  | {opt_recall:7.3f}  | {recall_improvement:+7.3f}")
    print(f"  F1-Score            | {orig_f1:7.3f}  | {opt_f1:7.3f}  | {f1_improvement:+7.3f}")
    print(f"  True Positives      | {orig_tp:7d}  | {opt_tp:7d}  | {tp_improvement:+7d}")
    print(f"  False Positives     | {orig_fp:7d}  | {opt_fp:7d}  | {fp_improvement:+7d}")
    print(f"  False Negatives     | {orig_fn:7d}  | {opt_fn:7d}  | {fn_improvement:+7d}")
    
    comparison_data.append({
        'edge': edge_name,
        'original_precision': orig_precision,
        'optimized_precision': opt_precision,
        'original_recall': orig_recall,
        'optimized_recall': opt_recall,
        'original_f1': orig_f1,
        'optimized_f1': opt_f1,
        'precision_improvement': precision_improvement,
        'recall_improvement': recall_improvement,
        'f1_improvement': f1_improvement,
    })

print("\n" + "="*80)
print("Summary:")
print("="*80)
print(f"  Optimizations applied:")
print(f"    1. Name variant handling (dan/daniel, matt/matthew, etc.)")
print(f"    2. Lowered threshold from 0.7 to 0.65")
print(f"    3. Birth year soft constraint (penalty for year_diff > 1)")

# Save comparison results
comparison_df = pd.DataFrame(comparison_data)
comparison_df.to_csv(OUTPUT_DIR / 'matching-comparison.csv', index=False)
print(f"\n  Comparison results saved to: {OUTPUT_DIR / 'matching-comparison.csv'}")


Matching Performance Comparison: Original vs Optimized

LR Edge:
--------------------------------------------------------------------------------
  Metric              | Original  | Optimized | Improvement
  --------------------|-----------|-----------|------------
  Precision           |   1.000  |   1.000  |  +0.000
  Recall              |   0.972  |   0.962  |  -0.011
  F1-Score            |   0.986  |   0.980  |  -0.006
  True Positives      |      70  |      75  |      +5
  False Positives     |       0  |       0  |      +0
  False Negatives     |       2  |       3  |      +1

LS Edge:
--------------------------------------------------------------------------------
  Metric              | Original  | Optimized | Improvement
  --------------------|-----------|-----------|------------
  Precision           |   0.963  |   0.965  |  +0.002
  Recall              |   0.945  |   0.948  |  +0.003
  F1-Score            |   0.954  |   0.957  |  +0.002
  True Positives      |      52  |   

### 3.6 Cluster Consistency Analysis

Analyze the cluster structure to identify any inconsistencies that our evaluation set may miss. The `EntityMatchingEvaluator` offers the `create_cluster_size_distribution` method for this purpose.

**Important Notes:**
- Your evaluation is only as good as your evaluation set! This is also true for the matching step!
- If you see an F1 of 95% and you are happy, but if the evaluation set does not accurately represent your data, the cluster size distribution is one way you can spot this.
- Seeing many clusters with a size larger than 2 when you are sure your source datasets are deduplicated should make you question your evaluation set and manually check what is going on!
- In this stage of the project, you will iteratively refine your evaluation sets not only based on the metrics you see but also manual inspection of debug logs and cluster size distribution.


In [155]:
# Cluster Consistency Analysis for Original Matching

print("="*80)
print("Cluster Consistency Analysis: Original Matching")
print("="*80)

original_cluster_distributions = {}

for edge_name in ['LR', 'LS']:
    print(f"\n{edge_name} Edge:")
    print("-" * 80)
    
    correspondences = matching_results[edge_name]
    
    # Create cluster size distribution
    cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
        correspondences=correspondences,
        out_dir=OUTPUT_DIR / "cluster_analysis" / "original"
    )
    
    original_cluster_distributions[edge_name] = cluster_distribution
    
    print(f"\nCluster Size Distribution:")
    if cluster_distribution is not None and len(cluster_distribution) > 0:
        print(cluster_distribution.to_string(index=False))
    else:
        print("  No cluster distribution data available")

print("\n✓ Original matching cluster analysis complete")


Cluster Consistency Analysis: Original Matching

LR Edge:
--------------------------------------------------------------------------------


[INFO ] PyDI.entitymatching.evaluation - Cluster Size Distribution of 15102 clusters:
[INFO ] PyDI.entitymatching.evaluation - 	Cluster Size	| Frequency	| Percentage
[INFO ] PyDI.entitymatching.evaluation - 	──────────────────────────────────────────────────
[INFO ] PyDI.entitymatching.evaluation - 		2	|	15000	|	99.32%
[INFO ] PyDI.entitymatching.evaluation - 		3	|	52	|	0.34%
[INFO ] PyDI.entitymatching.evaluation - 		4	|	47	|	0.31%
[INFO ] PyDI.entitymatching.evaluation - 		5	|	1	|	0.01%
[INFO ] PyDI.entitymatching.evaluation - 		6	|	1	|	0.01%
[INFO ] PyDI.entitymatching.evaluation - 		7	|	1	|	0.01%
[INFO ] root - Cluster size distribution written to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/matching/cluster_analysis/original/cluster_size_distribution.csv
[INFO ] PyDI.entitymatching.evaluation - Cluster Size Distribution of 6677 clusters:
[INFO ] PyDI.entitymatching.evaluation - 	Cluster Size	| Frequency	| Percentage
[INFO ] PyDI.entitymatching.evaluatio


Cluster Size Distribution:
 cluster_size  frequency  percentage
            2      15000   99.324593
            3         52    0.344325
            4         47    0.311217
            5          1    0.006622
            6          1    0.006622
            7          1    0.006622

LS Edge:
--------------------------------------------------------------------------------

Cluster Size Distribution:
 cluster_size  frequency  percentage
            2       6645   99.520743
            3         22    0.329489
            4         10    0.149768

✓ Original matching cluster analysis complete


In [156]:
# Cluster Consistency Analysis for Optimized Matching

print("="*80)
print("Cluster Consistency Analysis: Optimized Matching")
print("="*80)

optimized_cluster_distributions = {}

for edge_name in ['LR', 'LS']:
    print(f"\n{edge_name} Edge:")
    print("-" * 80)
    
    correspondences = optimized_matching_results[edge_name]
    
    # Create cluster size distribution
    cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
        correspondences=correspondences,
        out_dir=OUTPUT_DIR / "cluster_analysis" / "optimized"
    )
    
    optimized_cluster_distributions[edge_name] = cluster_distribution
    
    print(f"\nCluster Size Distribution:")
    if cluster_distribution is not None and len(cluster_distribution) > 0:
        print(cluster_distribution.to_string(index=False))
        
        # Calculate summary statistics
        # Note: PyDI returns column names in lowercase: 'cluster_size', 'frequency', 'percentage'
        total_clusters = cluster_distribution['frequency'].sum()
        clusters_size_2 = cluster_distribution[cluster_distribution['cluster_size'] == 2]['frequency'].sum() if len(cluster_distribution[cluster_distribution['cluster_size'] == 2]) > 0 else 0
        clusters_size_gt_2 = total_clusters - clusters_size_2
        
        print(f"\nSummary:")
        print(f"  Total clusters: {total_clusters}")
        print(f"  Clusters with size 2: {clusters_size_2} ({clusters_size_2/total_clusters*100:.2f}%)")
        print(f"  Clusters with size > 2: {clusters_size_gt_2} ({clusters_size_gt_2/total_clusters*100:.2f}%)")
        
        if clusters_size_gt_2 > 0:
            print(f"\n   Warning: Found {clusters_size_gt_2} clusters with size > 2.")
            print(f"     This may indicate issues with the evaluation set or data quality.")
            print(f"     Consider manually inspecting these clusters.")
    else:
        print("  No cluster distribution data available")

print("\n✓ Optimized matching cluster analysis complete")


Cluster Consistency Analysis: Optimized Matching

LR Edge:
--------------------------------------------------------------------------------


[INFO ] PyDI.entitymatching.evaluation - Cluster Size Distribution of 15091 clusters:
[INFO ] PyDI.entitymatching.evaluation - 	Cluster Size	| Frequency	| Percentage
[INFO ] PyDI.entitymatching.evaluation - 	──────────────────────────────────────────────────
[INFO ] PyDI.entitymatching.evaluation - 		2	|	14977	|	99.24%
[INFO ] PyDI.entitymatching.evaluation - 		3	|	50	|	0.33%
[INFO ] PyDI.entitymatching.evaluation - 		4	|	62	|	0.41%
[INFO ] PyDI.entitymatching.evaluation - 		5	|	2	|	0.01%
[INFO ] root - Cluster size distribution written to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/matching/cluster_analysis/optimized/cluster_size_distribution.csv



Cluster Size Distribution:
 cluster_size  frequency  percentage
            2      14977   99.244583
            3         50    0.331323
            4         62    0.410841
            5          2    0.013253

Summary:
  Total clusters: 15091
  Clusters with size 2: 14977 (99.24%)
  Clusters with size > 2: 114 (0.76%)

     This may indicate issues with the evaluation set or data quality.
     Consider manually inspecting these clusters.

LS Edge:
--------------------------------------------------------------------------------


[INFO ] PyDI.entitymatching.evaluation - Cluster Size Distribution of 6674 clusters:
[INFO ] PyDI.entitymatching.evaluation - 	Cluster Size	| Frequency	| Percentage
[INFO ] PyDI.entitymatching.evaluation - 	──────────────────────────────────────────────────
[INFO ] PyDI.entitymatching.evaluation - 		2	|	6635	|	99.42%
[INFO ] PyDI.entitymatching.evaluation - 		3	|	28	|	0.42%
[INFO ] PyDI.entitymatching.evaluation - 		4	|	10	|	0.15%
[INFO ] PyDI.entitymatching.evaluation - 		5	|	1	|	0.01%
[INFO ] root - Cluster size distribution written to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/matching/cluster_analysis/optimized/cluster_size_distribution.csv



Cluster Size Distribution:
 cluster_size  frequency  percentage
            2       6635   99.415643
            3         28    0.419539
            4         10    0.149835
            5          1    0.014984

Summary:
  Total clusters: 6674
  Clusters with size 2: 6635 (99.42%)
  Clusters with size > 2: 39 (0.58%)

     This may indicate issues with the evaluation set or data quality.
     Consider manually inspecting these clusters.

✓ Optimized matching cluster analysis complete


### 3.7 Cluster Distribution Comparison

Compare cluster size distributions between original and optimized matching results.


In [107]:
# Compare cluster distributions between original and optimized matching

print("="*80)
print("Cluster Distribution Comparison: Original vs Optimized")
print("="*80)

for edge_name in ['LR', 'LS']:
    print(f"\n{edge_name} Edge:")
    print("-" * 80)
    
    orig_dist = original_cluster_distributions.get(edge_name)
    opt_dist = optimized_cluster_distributions.get(edge_name)
    
    if orig_dist is not None and opt_dist is not None:
        # Calculate total clusters
        # Note: PyDI returns column names in lowercase: 'cluster_size', 'frequency', 'percentage'
        orig_total = orig_dist['frequency'].sum()
        opt_total = opt_dist['frequency'].sum()
        
        # Calculate clusters with size 2
        orig_size_2 = orig_dist[orig_dist['cluster_size'] == 2]['frequency'].sum() if len(orig_dist[orig_dist['cluster_size'] == 2]) > 0 else 0
        opt_size_2 = opt_dist[opt_dist['cluster_size'] == 2]['frequency'].sum() if len(opt_dist[opt_dist['cluster_size'] == 2]) > 0 else 0
        
        # Calculate clusters with size > 2
        orig_size_gt_2 = orig_total - orig_size_2
        opt_size_gt_2 = opt_total - opt_size_2
        
        print(f"\n  Metric                    | Original  | Optimized | Change")
        print(f"  --------------------------|-----------|-----------|--------")
        print(f"  Total Clusters            | {orig_total:9d} | {opt_total:9d} | {opt_total - orig_total:+6d}")
        print(f"  Clusters with Size 2       | {orig_size_2:9d} | {opt_size_2:9d} | {opt_size_2 - orig_size_2:+6d}")
        print(f"  Clusters with Size > 2     | {orig_size_gt_2:9d} | {opt_size_gt_2:9d} | {opt_size_gt_2 - orig_size_gt_2:+6d}")
        print(f"  % Clusters with Size 2     | {orig_size_2/orig_total*100:8.2f}% | {opt_size_2/opt_total*100:8.2f}% | {(opt_size_2/opt_total - orig_size_2/orig_total)*100:+6.2f}%")
        print(f"  % Clusters with Size > 2   | {orig_size_gt_2/orig_total*100:8.2f}% | {opt_size_gt_2/opt_total*100:8.2f}% | {(opt_size_gt_2/opt_total - orig_size_gt_2/orig_total)*100:+6.2f}%")
        
        # Show detailed distribution comparison if there are differences
        if orig_size_gt_2 != opt_size_gt_2:
            print(f"\n  Detailed Cluster Size Distribution:")
            print(f"\n  Original Matching:")
            print(orig_dist.to_string(index=False))
            print(f"\n  Optimized Matching:")
            print(opt_dist.to_string(index=False))
    else:
        print("  Cluster distribution data not available for comparison")

print("\n✓ Cluster distribution comparison complete")


Cluster Distribution Comparison: Original vs Optimized

LR Edge:
--------------------------------------------------------------------------------

  Metric                    | Original  | Optimized | Change
  --------------------------|-----------|-----------|--------
  Total Clusters            |     15102 |     15091 |    -11
  Clusters with Size 2       |     15000 |     14977 |    -23
  Clusters with Size > 2     |       102 |       114 |    +12
  % Clusters with Size 2     |    99.32% |    99.24% |  -0.08%
  % Clusters with Size > 2   |     0.68% |     0.76% |  +0.08%

  Detailed Cluster Size Distribution:

  Original Matching:
 cluster_size  frequency  percentage
            2      15000   99.324593
            3         52    0.344325
            4         47    0.311217
            5          1    0.006622
            6          1    0.006622
            7          1    0.006622

  Optimized Matching:
 cluster_size  frequency  percentage
            2      14977   99.244583
  

## 4. ML-Based Matching

For use with scikit-learn classifiers. Comparators are the features. Train on labeled pairs to learn optimal weights.

**Feature Extraction**: Feature extraction converts record pairs into feature vectors using the set of comparators. The `FeatureExtractor` class handles this transformation.

**Workflow**:
1. Define feature extractors (comparators)
2. Extract features from training pairs
3. Train a classifier (e.g., RandomForestClassifier)
4. Apply ML-based matcher to candidate pairs
5. Evaluate performance


### 4.0 Export Candidate Error Cases

Persist currently detected validation errors to `data/output/gt/manual_cases/` so the ground-truth notebook can ingest them (section 5.8). Run this after completing the rule-based/optimized evaluations and before the ML feature extractor setup.

In [124]:
# Export candidate error cases for GT augmentation (section 5.8 in GT notebook)
import pandas as pd
from pathlib import Path

MANUAL_CASES_DIR = BASE_DIR / 'data' / 'output' / 'gt' / 'manual_cases'
MANUAL_CASES_DIR.mkdir(parents=True, exist_ok=True)

error_records = {'LR': [], 'LS': []}

def _collect_errors(result_dict, label: str):
    if result_dict is None:
        return
    for edge_name in ['LR', 'LS']:
        if edge_name not in result_dict:
            continue
        correspondences = result_dict[edge_name]
        val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
        val_df['label'] = val_df['label'].astype(str).str.strip().str.upper()
        true_set = set(zip(val_df[val_df['label'] == 'TRUE']['id1'], val_df[val_df['label'] == 'TRUE']['id2']))
        false_set = set(zip(val_df[val_df['label'] == 'FALSE']['id1'], val_df[val_df['label'] == 'FALSE']['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        val_pairs = set(zip(val_df['id1'], val_df['id2']))

        fn_pairs = true_set - pred_set
        for id1, id2 in fn_pairs:
            error_records[edge_name].append({
                'id1': id1,
                'id2': id2,
                'label': 'TRUE',
                'edge': edge_name,
                'source': label,
                'error_type': 'FN'
            })

        fp_pairs = (pred_set & val_pairs) & false_set
        for id1, id2 in fp_pairs:
            error_records[edge_name].append({
                'id1': id1,
                'id2': id2,
                'label': 'FALSE',
                'edge': edge_name,
                'source': label,
                'error_type': 'FP'
            })

_collect_errors(globals().get('matching_results'), 'rule_based')
_collect_errors(globals().get('optimized_matching_results'), 'optimized')
_collect_errors(globals().get('logreg_matching_results'), 'logreg')
_collect_errors(globals().get('ml_matching_results'), 'random_forest')
_collect_errors(globals().get('gb_matching_results'), 'gradient_boosting')
_collect_errors(globals().get('xgb_matching_results'), 'xgboost')

for edge_name, records in error_records.items():
    if not records:
        continue
    df_edge = pd.DataFrame(records)
    manual_path = MANUAL_CASES_DIR / f'manual_cases_{edge_name}.csv'
    if manual_path.exists():
        existing = pd.read_csv(manual_path)
        # Align columns
        for col in df_edge.columns:
            if col not in existing.columns:
                existing[col] = ''
        for col in existing.columns:
            if col not in df_edge.columns:
                df_edge[col] = ''
        df_edge = pd.concat([existing, df_edge], ignore_index=True)
    df_edge = df_edge.drop_duplicates(subset=['id1', 'id2'])
    df_edge.to_csv(manual_path, index=False)
    print(f"[{edge_name}] Exported {len(df_edge)} cases to {manual_path}")

if not any(error_records.values()):
    print("No error cases collected (ensure evaluation cells were executed).")


[LR] Exported 7 cases to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/gt/manual_cases/manual_cases_LR.csv
[LS] Exported 10 cases to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/gt/manual_cases/manual_cases_LS.csv


### 4.1 Define Feature Extractors (Comparators)

Define feature extractors for ML training. Enhanced with birth year comparator to improve matching accuracy:
- **Name similarity**: Levenshtein distance and Jaccard similarity on normalized names
- **Birth year**: DateComparator to penalize pairs with different birth years (max_days_difference=365)


In [157]:
# Define feature extractors (comparators) for ML-based matching
# Enhanced with birth year comparator to improve matching accuracy

from PyDI.entitymatching import MLBasedMatcher, FeatureExtractor, StringComparator
from PyDI.entitymatching.comparators import DateComparator
from PyDI.io import load_csv
from sklearn.ensemble import RandomForestClassifier
from PyDI.entitymatching import VectorFeatureExtractor

ml_comparators = [
    StringComparator(
        column="full_name_normalized" if 'full_name_normalized' in L_full.columns else "full_name",
        similarity_function="levenshtein",
        preprocess=str.lower
    ),
    StringComparator(
        column="full_name_normalized" if 'full_name_normalized' in L_full.columns else "full_name",
        similarity_function="jaccard",
        tokenization="word",
        preprocess=str.lower
    ),
    # Enhanced: Add birth year comparator to penalize pairs with different birth years
    DateComparator(
        column="birth_year",
        max_days_difference=365  # Allow 1 year difference (365 days)
    )
]

# Initialize feature extractor
ml_feature_extractor = FeatureExtractor(ml_comparators)

print("ML-based matching feature extractors configured:")
print(f"  Comparators: {len(ml_comparators)}")
print(f"    - Levenshtein distance on normalized name")
print(f"    - Jaccard similarity on normalized name (word tokenization)")
print(f"    - Birth year comparator (max_days_difference=365)")
print("✓ Feature extractor initialized")


ML-based matching feature extractors configured:
  Comparators: 3
    - Levenshtein distance on normalized name
    - Jaccard similarity on normalized name (word tokenization)
    - Birth year comparator (max_days_difference=365)
✓ Feature extractor initialized


### 4.2 Train LogisticRegression Matcher

Train a baseline LogisticRegression model using the shared feature extractor. This provides a lightweight reference point before the tree-based ensembles.


In [183]:
# Train LogisticRegression-based matchers for each edge
logreg_classifiers = {}
logreg_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Training LogisticRegression Matcher ===")
    
    left_df, right_df = source_tables[edge_name]
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    
    # Normalize labels
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    print(f"  Training pairs: {len(train_df)}")
    print(f"    True matches: {train_df['label_binary'].sum()}")
    print(f"    False matches: {len(train_df) - train_df['label_binary'].sum()}")
    
    # Extract shared features
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    
    print(f"  Training features shape: {X_train.shape}")
    print(f"  Feature columns: {feature_columns}")
    
    # Configure Logistic Regression
    clf = LogisticRegression(
        max_iter=1000,
        class_weight='balanced',
        solver='liblinear'
    )
    clf.fit(X_train, y_train)
    
    logreg_classifiers[edge_name] = clf
    logreg_matchers[edge_name] = MLBasedMatcher(ml_feature_extractor)
    
    # Inspect coefficients for interpretability
    coef_df = pd.DataFrame({
        'feature': feature_columns,
        'coefficient': clf.coef_[0]
    }).sort_values('coefficient', key=lambda s: s.abs(), ascending=False)
    print("\n  Top coefficients (magnitude):")
    print(coef_df.to_string(index=False))
    print(f"  Intercept: {clf.intercept_[0]:.4f}")
    print(f"  ✓ LogisticRegression trained for {edge_name}")

print("\n✓ LogisticRegression matchers trained for all edges")



=== LR: Training LogisticRegression Matcher ===
  Training pairs: 302
    True matches: 217
    False matches: 85


[INFO ] root - Label distribution: 217 positive, 85 negative


  Training features shape: (302, 3)
  Feature columns: ['StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None)', 'StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None)', 'DateComparator(birth_year, list_strategy=None)']

  Top coefficients (magnitude):
                                                                                   feature  coefficient
    StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None)     3.518803
StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None)     2.701644
                                            DateComparator(birth_year, list_strategy=None)     1.661940
  Intercept: -4.6522
  ✓ LogisticRegression trained for LR

=== LS: Training LogisticRegression Matcher ===
  Training pairs: 299
    True matches: 164
    False matches: 135


[INFO ] root - Label distribution: 164 positive, 135 negative


  Training features shape: (299, 3)
  Feature columns: ['StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None)', 'StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None)', 'DateComparator(birth_year, list_strategy=None)']

  Top coefficients (magnitude):
                                                                                   feature  coefficient
StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None)     4.357054
    StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None)     2.490422
                                            DateComparator(birth_year, list_strategy=None)     1.440008
  Intercept: -4.9249
  ✓ LogisticRegression trained for LS

✓ LogisticRegression matchers trained for all edges


### 4.3 Apply LogisticRegression Matcher to Candidate Pairs

Reuse the shared matching pipeline to score candidate pairs with the LogisticRegression classifier.


In [190]:
# Apply LogisticRegression-based matcher to candidate pairs
logreg_matching_results = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: LogisticRegression Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    matcher = logreg_matchers[edge_name]
    clf = logreg_classifiers[edge_name]
    
    # Apply season_year hard constraint
    print("  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) &
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        use_probabilities=True,
        trained_classifier=clf,
        threshold=0.4
    )
    
    logreg_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")
    if 'score' in correspondences.columns:
        print(f"  Score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean score: {correspondences['score'].mean():.3f}")

print("\n✓ LogisticRegression matching complete for all edges")



=== LR: LogisticRegression Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.000; 130895 blocked pairs (reduction ratio: 0.9999192606183567)


  After season_year filter: 130,895 candidate pairs (from 5,733,797)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:49.062; found 19673 correspondences.


  Generated 19,673 matched pairs
  Score range: [0.401, 0.962]
  Mean score: 0.792

=== LS: LogisticRegression Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9526 blocked pairs (reduction ratio: 0.9999867415811222)


  After season_year filter: 9,526 candidate pairs (from 215,708)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.768; found 6792 correspondences.


  Generated 6,792 matched pairs
  Score range: [0.408, 0.967]
  Mean score: 0.910

✓ LogisticRegression matching complete for all edges


### 4.4 Evaluate LogisticRegression Matching

Assess LogisticRegression performance on the validation set using the PyDI evaluator.


In [191]:
# Evaluate LogisticRegression matching on validation set
logreg_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: LogisticRegression Evaluation (Validation Set) ===")
    
    correspondences = logreg_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-logreg',
            matcher_instance=logreg_matchers[edge_name]
        )
        logreg_matching_metrics_val[edge_name] = eval_results
        
        print(f"\n  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
    except Exception as exc:
        print(f"  PyDI evaluator failed: {exc}")
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        
        logreg_matching_metrics_val[edge_name] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'true_positives': tp,
            'false_positives': fp,
            'false_negatives': fn
        }
        
        print(f"\n  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("\n✓ LogisticRegression evaluation complete")



=== LR: LogisticRegression Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  65
[INFO ] root -   True Negatives:  27
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 7
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.929
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.903
[INFO ] root -   F1-Score:  0.949



  Precision: 1.000
  Recall:    0.903
  F1-Score:  0.949
  TP: 65
  FP: 0
  FN: 7

=== LS: LogisticRegression Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  54
[INFO ] root -   True Negatives:  37
[INFO ] root -   False Positives: 4
[INFO ] root -   False Negatives: 1
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.948
[INFO ] root -   Precision: 0.931
[INFO ] root -   Recall:    0.982
[INFO ] root -   F1-Score:  0.956



  Precision: 0.931
  Recall:    0.982
  F1-Score:  0.956
  TP: 54
  FP: 4
  FN: 1

✓ LogisticRegression evaluation complete


### 4.4.1 LogisticRegression Error Cases Analysis

Investigate False Positives/Negatives for the LogisticRegression matcher using the same diagnostic template.


In [192]:
# Analyze error cases for LogisticRegression
for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name}: LogisticRegression Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = logreg_matching_results[edge_name]
    
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))
    
    # False negatives
    fn_pairs = true_set - pred_set
    print(f"\n[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if fn_pairs:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            if left_rec is not None and right_rec is not None:
                print(f"\n  {id1} <-> {id2}")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                in_candidates = len(candidates[edge_name][
                    (candidates[edge_name]['id1'] == id1) &
                    (candidates[edge_name]['id2'] == id2)
                ]) > 0
                print(f"    In candidates: {in_candidates}")
                score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
                if len(score_row) > 0 and 'score' in score_row.columns:
                    print(f"    LogReg Score: {score_row['score'].iloc[0]:.3f}")
    else:
        print("  No false negatives found!")
    
    # False positives
    fp_pairs = (pred_set & val_set) & false_set
    print(f"\n[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if fp_pairs:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 and 'score' in score_row.columns else None
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else "N/A"
                print(f"\n  {id1} <-> {id2} (score: {score_str})")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
    else:
        print("  No false positives found!")

print("\n✓ LogisticRegression error analysis complete")



LR: LogisticRegression Error Cases Analysis

[FALSE NEGATIVES] (7 cases):

  5948382019|2019|L <-> 5948382019|2019|R
    Left:  'philip gosselin' | Season: 2019 | Birth: 1988
    Right: 'phil gosselin' | Season: 2019 | Birth: 1989
    In candidates: True

  5948382015|2015|L <-> 5948382015|2015|R
    Left:  'philip gosselin' | Season: 2015 | Birth: 1988
    Right: 'phil gosselin' | Season: 2015 | Birth: 1989
    In candidates: True

  6638452022|2022|L <-> 6638452022|2022|R
    Left:  'alfonso rivas' | Season: 2022 | Birth: 1996
    Right: 'alfonso rivas iii' | Season: 2022 | Birth: 1997
    In candidates: True

  5961292022|2022|L <-> 5961292022|2022|R
    Left:  'dan vogelbach' | Season: 2022 | Birth: 1992
    Right: 'daniel vogelbach' | Season: 2022 | Birth: 1993
    In candidates: False

  5471702019|2019|L <-> 5471702019|2019|R
    Left:  'nick delmonico' | Season: 2019 | Birth: 1992
    Right: 'nicky delmonico' | Season: 2019 | Birth: 1993
    In candidates: True

  5948382021|2

### 4.5 Extract Features and Train RandomForestClassifier

Extract features from training pairs and train a RandomForestClassifier.


In [162]:
# Train ML-based matchers for each edge
ml_classifiers = {}
ml_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Training ML-Based Matcher ===")
    
    left_df, right_df = source_tables[edge_name]
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    
    # Prepare labels: convert 'TRUE'/'FALSE' to 1/0
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    print(f"  Training pairs: {len(train_df)}")
    print(f"    True matches: {train_df['label_binary'].sum()}")
    print(f"    False matches: {len(train_df) - train_df['label_binary'].sum()}")
    
    # Extract features for training pairs
    print(f"  Extracting features from training pairs...")
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    print(f"  Extracted features: {len(train_features)} pairs")
    print(f"  Feature columns: {len([col for col in train_features.columns if col not in ['id1', 'id2', 'label']])}")
    
    # Prepare training data
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    
    print(f"  Training features shape: {X_train.shape}")
    print(f"  Training labels distribution: {y_train.value_counts().to_dict()}")
    
    # Train classifier
    print(f"  Training RandomForestClassifier...")
    clf = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        min_samples_split=5,
        min_samples_leaf=2,
        random_state=42,
        n_jobs=-1
    )
    clf.fit(X_train, y_train)
    
    # Store classifier and create matcher
    ml_classifiers[edge_name] = clf
    ml_matchers[edge_name] = MLBasedMatcher(ml_feature_extractor)
    
    # Log feature importance if available
    if hasattr(clf, 'feature_importances_'):
        feature_importance = pd.DataFrame({
            'feature': feature_columns,
            'importance': clf.feature_importances_
        }).sort_values('importance', ascending=False)
        print(f"\n  Top 5 Feature Importances:")
        for idx, row in feature_importance.head(5).iterrows():
            print(f"    {row['feature']}: {row['importance']:.4f}")
    
    print(f"  ✓ Classifier trained for {edge_name}")

print("\n✓ ML-based matchers trained for all edges")



=== LR: Training ML-Based Matcher ===
  Training pairs: 302
    True matches: 217
    False matches: 85
  Extracting features from training pairs...


[INFO ] root - Label distribution: 217 positive, 85 negative


  Extracted features: 302 pairs
  Feature columns: 3
  Training features shape: (302, 3)
  Training labels distribution: {1: 217, 0: 85}
  Training RandomForestClassifier...

  Top 5 Feature Importances:
    StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None): 0.5042
    StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None): 0.4058
    DateComparator(birth_year, list_strategy=None): 0.0900
  ✓ Classifier trained for LR

=== LS: Training ML-Based Matcher ===
  Training pairs: 299
    True matches: 164
    False matches: 135
  Extracting features from training pairs...


[INFO ] root - Label distribution: 164 positive, 135 negative


  Extracted features: 299 pairs
  Feature columns: 3
  Training features shape: (299, 3)
  Training labels distribution: {1: 164, 0: 135}
  Training RandomForestClassifier...

  Top 5 Feature Importances:
    StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None): 0.6069
    StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None): 0.3208
    DateComparator(birth_year, list_strategy=None): 0.0723
  ✓ Classifier trained for LS

✓ ML-based matchers trained for all edges


### 4.6 Apply RandomForest Matcher to Candidate Pairs

Apply the trained RandomForest-based matcher to candidate pairs from the blocking phase.


In [163]:
# Apply ML-based matching to candidate pairs
ml_matching_results = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: ML-Based Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    matcher = ml_matchers[edge_name]
    clf = ml_classifiers[edge_name]
    
    # Apply season_year hard constraint: filter candidate pairs where season_year matches
    print(f"  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    # Filter: season_year must match exactly
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    # Apply ML-based matcher
    print(f"  Applying ML-based matcher...")
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        trained_classifier=clf
    )
    
    # Store results
    ml_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")
    if 'score' in correspondences.columns:
        print(f"  Score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean score: {correspondences['score'].mean():.3f}")

print("\n✓ ML-based matching complete for all edges")



=== LR: ML-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.000; 130895 blocked pairs (reduction ratio: 0.9999192606183567)


  After season_year filter: 130,895 candidate pairs (from 5,733,797)
  Applying ML-based matcher...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:51.452; found 15389 correspondences.


  Generated 15,389 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000

=== LS: ML-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9526 blocked pairs (reduction ratio: 0.9999867415811222)


  After season_year filter: 9,526 candidate pairs (from 215,708)
  Applying ML-based matcher...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.497; found 6785 correspondences.


  Generated 6,785 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000

✓ ML-based matching complete for all edges


### 4.7 Evaluate RandomForest Matching

Evaluate the performance of the RandomForest-based matcher on the validation set.


In [164]:
# Evaluate ML-based matching on validation set

ml_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: ML-Based Matching Evaluation (Validation Set) ===")
    
    correspondences = ml_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    # Rename 'score' to match PyDI evaluator expectations (if needed)
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    # Evaluate matching
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-ml',
            matcher_instance=ml_matchers[edge_name]
        )
        
        ml_matching_metrics_val[edge_name] = eval_results
        
        print(f"\n  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
    except Exception as e:
        print(f"  PyDI evaluator failed: {e}")
        # Manual evaluation fallback
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        
        print(f"\n  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("\n✓ ML-based matching evaluation complete")



=== LR: ML-Based Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  70
[INFO ] root -   True Negatives:  27
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 2
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.980
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.972
[INFO ] root -   F1-Score:  0.986



  Precision: 1.000
  Recall:    0.972
  F1-Score:  0.986
  TP: 70
  FP: 0
  FN: 2

=== LS: ML-Based Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  53
[INFO ] root -   True Negatives:  38
[INFO ] root -   False Positives: 3
[INFO ] root -   False Negatives: 2
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.948
[INFO ] root -   Precision: 0.946
[INFO ] root -   Recall:    0.964
[INFO ] root -   F1-Score:  0.955



  Precision: 0.946
  Recall:    0.964
  F1-Score:  0.955
  TP: 53
  FP: 3
  FN: 2

✓ ML-based matching evaluation complete


### 4.7.1 RandomForestClassifier Error Cases Analysis

Analyze False Positives and False Negatives for RandomForestClassifier to identify patterns for further improvement.


In [165]:
# Analyze error cases for RandomForestClassifier (reusing code structure from 3.2)

for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name}: RandomForestClassifier Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = ml_matching_results[edge_name]
    
    # Get true matches and false matches in validation set
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    
    # Get predicted matches (only those in validation set)
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))  # All pairs in validation set
    
    # False Negatives: True matches in validation set that were not predicted
    fn_pairs = true_set - pred_set
    print(f"\n[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if len(fn_pairs) > 0:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            
            if left_rec is not None and right_rec is not None:
                print(f"\n  {id1} <-> {id2}")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                
                # Check if in candidates
                in_candidates = len(candidates[edge_name][
                    (candidates[edge_name]['id1'] == id1) & 
                    (candidates[edge_name]['id2'] == id2)
                ]) > 0
                print(f"    In candidates: {in_candidates}")
                
                # Check if pair was scored by RF (even if below threshold)
                score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
                if len(score_row) > 0:
                    score = score_row['score'].iloc[0] if 'score' in score_row.columns else None
                    score_str = f"{score:.3f}" if score is not None else "N/A"
                    print(f"    RF Score: {score_str}")
    else:
        print("  No false negatives found!")
    
    # False Positives: Predicted matches that are in validation set but labeled as FALSE
    # Only analyze pairs that are in the validation set
    fp_pairs = (pred_set & val_set) & false_set
    print(f"\n[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if len(fp_pairs) > 0:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 and 'score' in score_row.columns else None
            
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else "N/A"
                print(f"\n  {id1} <-> {id2} (score: {score_str})")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
    else:
        print("  No false positives found!")

print("\n✓ RandomForestClassifier error analysis complete")



LR: RandomForestClassifier Error Cases Analysis

[FALSE NEGATIVES] (2 cases):

  5961292022|2022|L <-> 5961292022|2022|R
    Left:  'dan vogelbach' | Season: 2022 | Birth: 1992
    Right: 'daniel vogelbach' | Season: 2022 | Birth: 1993
    In candidates: False

  5715102017|2017|L <-> 5715102017|2017|R
    Left:  'matt boyd' | Season: 2017 | Birth: 1991
    Right: 'matthew boyd' | Season: 2017 | Birth: 1991
    In candidates: True

[FALSE POSITIVES] (0 cases):
  No false positives found!

LS: RandomForestClassifier Error Cases Analysis

[FALSE NEGATIVES] (2 cases):

  4931142016|2016|L <-> 4931142016|2016|S
    Left:  'nori aoki' | Season: 2016 | Birth: 1982
    Right: 'norichika aoki' | Season: 2016 | Birth: 1982
    In candidates: True

  5470072016|2016|L <-> 5470072016|2016|S
    Left:  'robert whalen' | Season: 2016 | Birth: 1994
    Right: 'rob whalen' | Season: 2016 | Birth: 1994
    In candidates: True

[FALSE POSITIVES] (3 cases):

  6703512022|2022|L <-> 6689422022|2022|S (s

### 4.5 Comparison: All Matching Methods

Compare the performance of Rule-Based, Optimized, RandomForest, and GradientBoosting matching approaches.


In [166]:
# Compare all matching methods: Rule-Based, Optimized, RandomForest, and GradientBoosting
# Note: Run GradientBoosting cells (4.6) before this comparison for complete results

print("="*80)
print("Matching Performance Comparison: All Methods")
print("="*80)

# Check if GradientBoosting results are available
gb_available = 'gb_matching_metrics_val' in globals()
if not gb_available:
    print("Note: GradientBoosting results not yet computed. Run cells 4.6 first for complete comparison.\n")

comparison_data_all = []

for edge_name in ['LR', 'LS']:
    print(f"\n{edge_name} Edge:")
    print("-" * 80)
    
    # Get metrics for all methods (with safe defaults if not yet computed)
    orig_metrics = matching_metrics_val.get(edge_name, {})
    opt_metrics = optimized_matching_metrics_val.get(edge_name, {})
    ml_metrics = ml_matching_metrics_val.get(edge_name, {})
    gb_metrics = gb_matching_metrics_val.get(edge_name, {}) if 'gb_matching_metrics_val' in globals() else {}
    
    orig_p, orig_r, orig_f1 = orig_metrics.get('precision', 0.0), orig_metrics.get('recall', 0.0), orig_metrics.get('f1', 0.0)
    opt_p, opt_r, opt_f1 = opt_metrics.get('precision', 0.0), opt_metrics.get('recall', 0.0), opt_metrics.get('f1', 0.0)
    ml_p, ml_r, ml_f1 = ml_metrics.get('precision', 0.0), ml_metrics.get('recall', 0.0), ml_metrics.get('f1', 0.0)
    gb_p, gb_r, gb_f1 = gb_metrics.get('precision', 0.0), gb_metrics.get('recall', 0.0), gb_metrics.get('f1', 0.0)
    
    # Find best method for each metric
    def find_best(values, names):
        max_idx = values.index(max(values))
        return names[max_idx]
    
    print(f"  Metric       | Rule  | Optimized | RF      | GB      | Best")
    print(f"  -------------|-------|-----------|---------|---------|------")
    
    precisions = [orig_p, opt_p, ml_p, gb_p]
    recalls = [orig_r, opt_r, ml_r, gb_r]
    f1s = [orig_f1, opt_f1, ml_f1, gb_f1]
    names = ['Rule', 'Opt', 'RF', 'GB']
    
    print(f"  Precision    | {orig_p:5.3f} | {opt_p:9.3f} | {ml_p:7.3f} | {gb_p:7.3f} | {find_best(precisions, names)}")
    print(f"  Recall       | {orig_r:5.3f} | {opt_r:9.3f} | {ml_r:7.3f} | {gb_r:7.3f} | {find_best(recalls, names)}")
    print(f"  F1-Score     | {orig_f1:5.3f} | {opt_f1:9.3f} | {ml_f1:7.3f} | {gb_f1:7.3f} | {find_best(f1s, names)}")
    
    comparison_data_all.append({
        'edge': edge_name,
        'rule_based_precision': orig_p, 'rule_based_recall': orig_r, 'rule_based_f1': orig_f1,
        'optimized_precision': opt_p, 'optimized_recall': opt_r, 'optimized_f1': opt_f1,
        'rf_precision': ml_p, 'rf_recall': ml_r, 'rf_f1': ml_f1,
        'gb_precision': gb_p, 'gb_recall': gb_r, 'gb_f1': gb_f1,
    })

print("\n" + "="*80)
print("Summary:")
print("="*80)
print("  Methods: Rule-Based, Optimized, RandomForest (RF), GradientBoosting (GB)")

comparison_df_all = pd.DataFrame(comparison_data_all)
comparison_df_all.to_csv(OUTPUT_DIR / 'matching-comparison-all-methods.csv', index=False)
print(f"  Results saved to: {OUTPUT_DIR / 'matching-comparison-all-methods.csv'}")


Matching Performance Comparison: All Methods

LR Edge:
--------------------------------------------------------------------------------
  Metric       | Rule  | Optimized | RF      | GB      | Best
  -------------|-------|-----------|---------|---------|------
  Precision    | 1.000 |     1.000 |   1.000 |   1.000 | Rule
  Recall       | 0.972 |     0.962 |   0.972 |   0.962 | Rule
  F1-Score     | 0.986 |     0.980 |   0.986 |   0.980 | Rule

LS Edge:
--------------------------------------------------------------------------------
  Metric       | Rule  | Optimized | RF      | GB      | Best
  -------------|-------|-----------|---------|---------|------
  Precision    | 0.963 |     0.965 |   0.946 |   0.950 | Opt
  Recall       | 0.945 |     0.948 |   0.964 |   0.983 | GB
  F1-Score     | 0.954 |     0.957 |   0.955 |   0.966 | GB

Summary:
  Methods: Rule-Based, Optimized, RandomForest (RF), GradientBoosting (GB)
  Results saved to: /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped

### 4.6 GradientBoostingClassifier Matching

Train and apply GradientBoostingClassifier using the same feature extractor for comparison with RandomForest.


In [167]:
# Train and apply GradientBoostingClassifier (reusing feature extractor from 4.2)

# Ensure GradientBoostingClassifier is imported
from sklearn.ensemble import GradientBoostingClassifier

gb_classifiers = {}
gb_matchers = {}
gb_matching_results = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: GradientBoostingClassifier ===")
    
    left_df, right_df = source_tables[edge_name]
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    # Reuse feature extraction from 4.2
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    
    # Train GradientBoostingClassifier
    print(f"  Training GradientBoostingClassifier...")
    gb_clf = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        min_samples_split=5,
        min_samples_leaf=2,
        random_state=42
    )
    gb_clf.fit(X_train, y_train)
    
    gb_classifiers[edge_name] = gb_clf
    gb_matchers[edge_name] = MLBasedMatcher(ml_feature_extractor)
    
    # Apply to candidate pairs (with season_year constraint)
    cand_df = candidates[edge_name].copy()
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1', right_on='_rid', how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2', right_on='_rid', how='left', suffixes=('', '_right')
    )
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    # Apply matcher
    correspondences = gb_matchers[edge_name].match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        trained_classifier=gb_clf
    )
    gb_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")

print("\n✓ GradientBoostingClassifier matching complete")



=== LR: GradientBoostingClassifier ===


[INFO ] root - Label distribution: 217 positive, 85 negative


  Training GradientBoostingClassifier...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.001; 130895 blocked pairs (reduction ratio: 0.9999192606183567)
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:47.734; found 15336 correspondences.


  Generated 15,336 matched pairs

=== LS: GradientBoostingClassifier ===


[INFO ] root - Label distribution: 164 positive, 135 negative


  Training GradientBoostingClassifier...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9526 blocked pairs (reduction ratio: 0.9999867415811222)
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.241; found 6786 correspondences.


  Generated 6,786 matched pairs

✓ GradientBoostingClassifier matching complete


In [168]:
# Evaluate GradientBoostingClassifier matching

gb_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: GradientBoostingClassifier Evaluation ===")
    
    correspondences = gb_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-gb',
            matcher_instance=gb_matchers[edge_name]
        )
        gb_matching_metrics_val[edge_name] = eval_results
        print(f"  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
    except Exception as e:
        print(f"  Evaluation failed: {e}")

print("\n✓ GradientBoostingClassifier evaluation complete")



=== LR: GradientBoostingClassifier Evaluation ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  70
[INFO ] root -   True Negatives:  27
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 2
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.980
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.972
[INFO ] root -   F1-Score:  0.986


  Precision: 1.000
  Recall:    0.972
  F1-Score:  0.986

=== LS: GradientBoostingClassifier Evaluation ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  54
[INFO ] root -   True Negatives:  40
[INFO ] root -   False Positives: 1
[INFO ] root -   False Negatives: 1
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.979
[INFO ] root -   Precision: 0.982
[INFO ] root -   Recall:    0.982
[INFO ] root -   F1-Score:  0.982


  Precision: 0.982
  Recall:    0.982
  F1-Score:  0.982

✓ GradientBoostingClassifier evaluation complete


### 4.7 GradientBoostingClassifier Error Cases Analysis

Analyze False Positives and False Negatives for GradientBoostingClassifier to identify patterns for further improvement.


In [169]:
# Analyze error cases for GradientBoostingClassifier (reusing code structure from 3.2)

for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name}: GradientBoostingClassifier Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = gb_matching_results[edge_name]
    
    # Get true matches and false matches in validation set
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    
    # Get predicted matches (only those in validation set)
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))  # All pairs in validation set
    
    # False Negatives: True matches in validation set that were not predicted
    fn_pairs = true_set - pred_set
    print(f"\n[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if len(fn_pairs) > 0:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            
            if left_rec is not None and right_rec is not None:
                print(f"\n  {id1} <-> {id2}")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                
                # Check if in candidates
                in_candidates = len(candidates[edge_name][
                    (candidates[edge_name]['id1'] == id1) & 
                    (candidates[edge_name]['id2'] == id2)
                ]) > 0
                print(f"    In candidates: {in_candidates}")
                
                # Check if pair was scored by GB (even if below threshold)
                score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
                if len(score_row) > 0:
                    score = score_row['score'].iloc[0] if 'score' in score_row.columns else None
                    score_str = f"{score:.3f}" if score is not None else "N/A"
                    print(f"    GB Score: {score_str}")
    else:
        print("  No false negatives found!")
    
    # False Positives: Predicted matches that are in validation set but labeled as FALSE
    # Only analyze pairs that are in the validation set
    fp_pairs = (pred_set & val_set) & false_set
    print(f"\n[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if len(fp_pairs) > 0:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 and 'score' in score_row.columns else None
            
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else "N/A"
                print(f"\n  {id1} <-> {id2} (score: {score_str})")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
    else:
        print("  No false positives found!")

print("\n✓ GradientBoostingClassifier error analysis complete")



LR: GradientBoostingClassifier Error Cases Analysis

[FALSE NEGATIVES] (2 cases):

  5961292022|2022|L <-> 5961292022|2022|R
    Left:  'dan vogelbach' | Season: 2022 | Birth: 1992
    Right: 'daniel vogelbach' | Season: 2022 | Birth: 1993
    In candidates: False

  5715102017|2017|L <-> 5715102017|2017|R
    Left:  'matt boyd' | Season: 2017 | Birth: 1991
    Right: 'matthew boyd' | Season: 2017 | Birth: 1991
    In candidates: True

[FALSE POSITIVES] (0 cases):
  No false positives found!

LS: GradientBoostingClassifier Error Cases Analysis

[FALSE NEGATIVES] (1 cases):

  5470072016|2016|L <-> 5470072016|2016|S
    Left:  'robert whalen' | Season: 2016 | Birth: 1994
    Right: 'rob whalen' | Season: 2016 | Birth: 1994
    In candidates: True

[FALSE POSITIVES] (1 cases):

  5167142015|2015|L <-> 6458482015|2015|S (score: 1.000)
    Left:  'dario alvarez' | Season: 2015 | Birth: 1989
    Right: 'dariel alvarez' | Season: 2015 | Birth: 1989

✓ GradientBoostingClassifier error analys

### 4.8 Error Analysis Summary and Improvement Plan

Based on the error analysis, we identify the following patterns and propose targeted improvements:

#### **LR Edge Error Patterns:**

1. **Blocking Issues (2 cases):**
   - `dan vogelbach` vs `daniel vogelbach` (2 instances) - Not in candidate pairs
   - **Root Cause:** The Enhanced TokenBlocker is missing these name variant pairs during blocking phase
   - **Impact:** These are true matches that never reach the matching stage

2. **Matching Issues (1 case):**
   - `jonathon niese` vs `jon niese` - In candidates but missed by GradientBoostingClassifier
   - **Root Cause:** ML model not recognizing common name variants despite high name similarity
   - **Impact:** True match filtered out by classifier

3. **False Positives:** 0 cases (Perfect Precision)

#### **LS Edge Error Patterns:**

1. **Matching Issues - False Negatives (1 case):**
   - `dan robertson` vs `daniel robertson` - In candidates but missed by classifier
   - **Root Cause:** Same as LR - name variant not recognized by ML model

2. **Matching Issues - False Positives (3 cases, all score=1.000):**
   - `josh rojas` vs `jose rojas` (birth years: 1994 vs 1993)
   - `kevan smith` vs `kevin smith` (birth years: 1988 vs 1997)
   - `matt duffy` vs `matt duffy` (birth years: 1989 vs 1991)
   - **Root Cause:** Model overconfident on name similarity, insufficiently penalizing birth year differences
   - **Impact:** High-confidence false matches reduce precision

#### **Proposed Improvement Strategies:**

**Strategy 1: Enhance Feature Engineering for ML Models**
- **Add Birth Year Difference Feature:** Calculate `abs(birth_year_left - birth_year_right)` as an explicit feature
- **Add Name Variant Indicator:** Create binary feature indicating if names are known variants (using `NAME_VARIANTS` dictionary)
- **Action:** Extend `ml_comparators` to include `DateComparator` for birth_year, or add custom feature extraction

**Strategy 2: Improve Blocking for Name Variants (LR Edge)**
- **Enhance TokenBlocker:** Investigate why `dan`/`daniel` variants are missed in blocking phase
- **Action:** Review `normalize_name_for_blocking` function or add name variant expansion to blocking keys

**Strategy 3: Hyperparameter Tuning**
- **Adjust GradientBoostingClassifier:** Fine-tune parameters to better balance name similarity vs. birth year constraints
- **Action:** Grid search or Bayesian optimization for optimal parameters

**Strategy 4: Ensemble Methods**
- **Combine Models:** Use voting or stacking to combine RandomForest and GradientBoosting predictions
- **Action:** Implement ensemble matcher that leverages strengths of both models


### 4.11 Train XGBoostClassifier Matcher

Train a gradient-boosted tree model (XGBoost) on the same feature set to capture non-linear interactions beyond RandomForest/GradientBoosting.

In [170]:
# Train XGBoost-based matchers for each edge
from xgboost import XGBClassifier

xgb_classifiers = {}
xgb_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"=== {edge_name}: Training XGBoostClassifier ===")
    
    left_df, right_df = source_tables[edge_name]
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    print(f"  Training pairs: {len(train_df)}")
    print(f"  True matches: {train_df['label_binary'].sum()}")
    print(f"  False matches: {len(train_df) - train_df['label_binary'].sum()}")
    
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    
    clf = XGBClassifier(
        n_estimators=400,
        max_depth=6,
        learning_rate=0.1,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        reg_alpha=0.0,
        objective='binary:logistic',
        eval_metric='logloss',
        n_jobs=-1,
        random_state=42
    )
    clf.fit(X_train, y_train)
    
    xgb_classifiers[edge_name] = clf
    xgb_matchers[edge_name] = MLBasedMatcher(ml_feature_extractor)
    
    importance = sorted(
        zip(feature_columns, clf.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    )
    print("Top feature importances:")
    for feat, val in importance:
        print(f" {feat}: {val:.4f}")
    print(f" XGBoostClassifier trained for {edge_name}")

print("XGBoostClassifier training complete for all edges")

=== LR: Training XGBoostClassifier ===
  Training pairs: 302
  True matches: 217
  False matches: 85


[INFO ] root - Label distribution: 217 positive, 85 negative
[INFO ] root - Label distribution: 164 positive, 135 negative


Top feature importances:
 StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None): 0.4100
 StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None): 0.3639
 DateComparator(birth_year, list_strategy=None): 0.2260
 XGBoostClassifier trained for LR
=== LS: Training XGBoostClassifier ===
  Training pairs: 299
  True matches: 164
  False matches: 135
Top feature importances:
 StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None): 0.6271
 StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None): 0.3093
 DateComparator(birth_year, list_strategy=None): 0.0636
 XGBoostClassifier trained for LS
XGBoostClassifier training complete for all edges


### 4.12 Apply XGBoostClassifier Matcher to Candidate Pairs

Filter by the same season_year constraint, then score candidate pairs with the trained XGBoost models.

In [171]:
# Apply XGBoostClassifier-based matcher to candidate pairs
xgb_matching_results = {}

for edge_name in ['LR', 'LS']:
    print(f"=== {edge_name}: XGBoostClassifier Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    matcher = xgb_matchers[edge_name]
    clf = xgb_classifiers[edge_name]
    
    print("  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) &
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        trained_classifier=clf
    )
    
    xgb_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")
    if 'score' in correspondences.columns:
        print(f"  Score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean score: {correspondences['score'].mean():.3f}")

print("XGBoostClassifier matching complete for all edges")

=== LR: XGBoostClassifier Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.000; 130895 blocked pairs (reduction ratio: 0.9999192606183567)


  After season_year filter: 130,895 candidate pairs (from 5,733,797)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:49.858; found 15334 correspondences.


  Generated 15,334 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000
=== LS: XGBoostClassifier Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9526 blocked pairs (reduction ratio: 0.9999867415811222)


  After season_year filter: 9,526 candidate pairs (from 215,708)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.287; found 6789 correspondences.


  Generated 6,789 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000
XGBoostClassifier matching complete for all edges


### 4.13 Evaluate XGBoostClassifier Matching

Assess validation performance with the PyDI evaluator, falling back to manual metrics if needed.

In [172]:
# Evaluate XGBoostClassifier matching on validation set
xgb_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"{edge_name}: XGBoostClassifier Evaluation (Validation Set) ===")
    
    correspondences = xgb_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-xgb',
            matcher_instance=xgb_matchers[edge_name]
        )
        xgb_matching_metrics_val[edge_name] = eval_results
        
        print(f"  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
    except Exception as exc:
        print(f"  PyDI evaluator failed: {exc}")
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        xgb_matching_metrics_val[edge_name] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'true_positives': tp,
            'false_positives': fp,
            'false_negatives': fn
        }
        print(f"  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("XGBoostClassifier evaluation complete")

LR: XGBoostClassifier Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  69
[INFO ] root -   True Negatives:  27
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 3
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.970
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.958
[INFO ] root -   F1-Score:  0.979


  Precision: 1.000
  Recall:    0.958
  F1-Score:  0.979
  TP: 69
  FP: 0
  FN: 3
LS: XGBoostClassifier Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  54
[INFO ] root -   True Negatives:  39
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 1
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.969
[INFO ] root -   Precision: 0.964
[INFO ] root -   Recall:    0.982
[INFO ] root -   F1-Score:  0.973


  Precision: 0.964
  Recall:    0.982
  F1-Score:  0.973
  TP: 54
  FP: 2
  FN: 1
XGBoostClassifier evaluation complete


### 4.13.1 XGBoostClassifier Error Cases Analysis

Reuse the earlier diagnostic template to inspect false negatives / positives for the XGBoost model.

In [174]:
# Analyze error cases for XGBoostClassifier
for edge_name in ['LR', 'LS']:
    print(f"{'='*80}")
    print(f"{edge_name}: XGBoostClassifier Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = xgb_matching_results[edge_name]
    
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))
    
    fn_pairs = true_set - pred_set
    print(f"[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if fn_pairs:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            if left_rec is not None and right_rec is not None:
                print(f"{id1} <-> {id2}")
                print(f"Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                in_candidates = len(candidates[edge_name][(candidates[edge_name]['id1'] == id1) & (candidates[edge_name]['id2'] == id2)]) > 0
                print(f"In candidates: {in_candidates}")
                score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
                if len(score_row) > 0 and 'score' in score_row.columns:
                    print(f"    XGBoost Score: {score_row['score'].iloc[0]:.3f}")
    else:
        print("  No false negatives found!")
    
    fp_pairs = (pred_set & val_set) & false_set
    print(f"[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if fp_pairs:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 and 'score' in score_row.columns else None
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else 'N/A'
                print(f" {id1} <-> {id2} (score: {score_str})")
                print(f" Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f" Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
    else:
        print("  No false positives found!")

print("XGBoostClassifier error analysis complete")

LR: XGBoostClassifier Error Cases Analysis
[FALSE NEGATIVES] (3 cases):
5961292022|2022|L <-> 5961292022|2022|R
Left:  'dan vogelbach' | Season: 2022 | Birth: 1992
Right: 'daniel vogelbach' | Season: 2022 | Birth: 1993
In candidates: False
5715102017|2017|L <-> 5715102017|2017|R
Left:  'matt boyd' | Season: 2017 | Birth: 1991
Right: 'matthew boyd' | Season: 2017 | Birth: 1991
In candidates: True
6638452022|2022|L <-> 6638452022|2022|R
Left:  'alfonso rivas' | Season: 2022 | Birth: 1996
Right: 'alfonso rivas iii' | Season: 2022 | Birth: 1997
In candidates: True
[FALSE POSITIVES] (0 cases):
  No false positives found!
LS: XGBoostClassifier Error Cases Analysis
[FALSE NEGATIVES] (1 cases):
4931142016|2016|L <-> 4931142016|2016|S
Left:  'nori aoki' | Season: 2016 | Birth: 1982
Right: 'norichika aoki' | Season: 2016 | Birth: 1982
In candidates: True
[FALSE POSITIVES] (2 cases):
 6703512022|2022|L <-> 6689422022|2022|S (score: 1.000)
 Left:  'jose rojas' | Season: 2022 | Birth: 1993
 Right: 