# Identity Resolution: Matching Phase

This notebook implements the matching phase of identity resolution, using candidate pairs generated by the hybrid blocking strategy from the blocking phase.

## Workflow Overview

1. **Setup**: Import libraries, configure paths, and initialize PyDI components
2. **Load Data**: Load source tables, candidate pairs, and ground truth splits
3. **Name Normalization**: Apply consistent name normalization across all matching methods
4. **Common Matching Infrastructure**: Define reusable matching framework and evaluation functions
5. **Rule-Based Matching**: Implement and evaluate rule-based matching with optimized variants
6. **ML-Based Matching**: Train and evaluate multiple classifiers (LogisticRegression, RandomForest, GradientBoosting, XGBoost)
7. **Post-Processing**: Apply global matching (one-to-one constraint) and cluster consistency analysis
8. **Final Test Set Evaluation**: Evaluate all models on test set for final performance reporting (Section 6.4)
9. **Export Results**: Save final correspondences from **GradientBoosting** model (after global matching) for data fusion phase

## 0. Setup


### 0.1 Import Libraries and Setup Logging


In [90]:
import logging
import os
from pathlib import Path
import pandas as pd
import sys

# Setup logging
os.makedirs('logs', exist_ok=True)
logging.basicConfig(
    level=logging.INFO,
    format='[%(levelname)-5s] %(name)s - %(message)s',
    handlers=[
        logging.FileHandler('logs/matching.log'),
        logging.StreamHandler()
    ],
    force=True
)
logging.getLogger().info('Matching phase logging enabled')

print("✓ Logging setup complete")


[INFO ] root - Matching phase logging enabled


✓ Logging setup complete


### 0.2 Define Paths


In [91]:
# Project base directory
BASE_DIR = Path('/Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets')

# Input: Candidate pairs from blocking phase
CANDIDATES_DIR = BASE_DIR / 'data' / 'output' / 'workflow'
CANDIDATES_LR = CANDIDATES_DIR / 'candidates_hybrid_LR.csv'
CANDIDATES_LS = CANDIDATES_DIR / 'candidates_hybrid_LS.csv'

# Input: Source data tables
CLEAN_DIR = BASE_DIR / 'data' / 'output' / 'clean'
LAHMAN_PATH = (CLEAN_DIR / 'Lahman_Mapped_dedup.xml'
               if (CLEAN_DIR / 'Lahman_Mapped_dedup.xml').exists()
               else BASE_DIR / 'Lahman_Mapped.xml')
REFERENCE_PATH = (CLEAN_DIR / 'Reference_Mapped_dedup.xml'
                  if (CLEAN_DIR / 'Reference_Mapped_dedup.xml').exists()
                  else BASE_DIR / 'Reference_Mapped.xml')
SAVANT_PATH = (CLEAN_DIR / 'Savant_Mapped_dedup.xml'
               if (CLEAN_DIR / 'Savant_Mapped_dedup.xml').exists()
               else BASE_DIR / 'Savant_Mapped.xml')

# Input: Ground truth splits for evaluation
SPLITS_DIR = BASE_DIR / 'data' / 'output' / 'gt' / 'splits'
LR_TRAIN = SPLITS_DIR / 'gt_LR_train.csv'
LR_VAL = SPLITS_DIR / 'gt_LR_val.csv'
LR_TEST = SPLITS_DIR / 'gt_LR_test.csv'
LS_TRAIN = SPLITS_DIR / 'gt_LS_train.csv'
LS_VAL = SPLITS_DIR / 'gt_LS_val.csv'
LS_TEST = SPLITS_DIR / 'gt_LS_test.csv'

# Output: Matching results
OUTPUT_DIR = BASE_DIR / 'data' / 'output' / 'matching'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("Paths configured:")
print(f"  Candidates LR: {CANDIDATES_LR.exists()}")
print(f"  Candidates LS: {CANDIDATES_LS.exists()}")
print(f"  Output dir: {OUTPUT_DIR}")


Paths configured:
  Candidates LR: True
  Candidates LS: True
  Output dir: /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/matching


### 0.3 Import PyDI


In [92]:
import sys
import subprocess
import importlib
import PyDI  # noqa: F401

from PyDI.io import load_xml
from PyDI.entitymatching import RuleBasedMatcher, GreedyOneToOneMatchingAlgorithm, MLBasedMatcher, FeatureExtractor
from PyDI.entitymatching.comparators import StringComparator, DateComparator
from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

# ML libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

print("✓ PyDI imported successfully")
print("✓ ML libraries imported successfully")


✓ PyDI imported successfully
✓ ML libraries imported successfully


## 1. Load Data

### 1.1 Load Source Data Tables


In [93]:
# Load source data tables
print("Loading source data tables...")
L_full = load_xml(LAHMAN_PATH).convert_dtypes().reset_index(drop=True)
R_full = load_xml(REFERENCE_PATH).convert_dtypes().reset_index(drop=True)
S_full = load_xml(SAVANT_PATH).convert_dtypes().reset_index(drop=True)

# Create _rid column for matching (same format as blocking phase)
for df, tag in [(L_full, 'L'), (R_full, 'R'), (S_full, 'S')]:
    if {'player_id', 'season_year'} <= set(df.columns):
        pid = df['player_id'].astype('string').fillna('NA')
        season = df['season_year'].astype('Int64').astype('string').fillna('NA')
        df['_rid'] = pid + '|' + season + f'|{tag}'
    else:
        df['_rid'] = df.index.map(lambda i: f"{tag}{i:06d}")

print(f"  L_full: {len(L_full):,} records")
print(f"  R_full: {len(R_full):,} records")
print(f"  S_full: {len(S_full):,} records")
print("✓ Source tables loaded")


Loading source data tables...
  L_full: 106,553 records
  R_full: 15,215 records
  S_full: 6,743 records
✓ Source tables loaded


### 1.2 Load Candidate Pairs from Hybrid Blocking


In [94]:
# Load candidate pairs generated by hybrid blocking strategy
print("Loading candidate pairs from hybrid blocking...")
candidates_lr = pd.read_csv(CANDIDATES_LR)
candidates_ls = pd.read_csv(CANDIDATES_LS)

print(f"  LR candidates: {len(candidates_lr):,} pairs")
print(f"  LS candidates: {len(candidates_ls):,} pairs")
print(f"  Total candidates: {len(candidates_lr) + len(candidates_ls):,} pairs")
print("✓ Candidate pairs loaded")


Loading candidate pairs from hybrid blocking...
  LR candidates: 5,994,373 pairs
  LS candidates: 227,185 pairs
  Total candidates: 6,221,558 pairs
✓ Candidate pairs loaded


In [95]:
# Override local normalization with shared implementation
from name_utils import normalize_name_for_blocking

print("Using shared name_utils.normalize_name_for_blocking for all matching methods.")


Using shared name_utils.normalize_name_for_blocking for all matching methods.


### 1.3 Load Ground Truth Splits for Evaluation


In [96]:
# Load ground truth splits
print("Loading ground truth splits...")
lr_train_df = pd.read_csv(LR_TRAIN)
lr_val_df = pd.read_csv(LR_VAL)
lr_test_df = pd.read_csv(LR_TEST)

ls_train_df = pd.read_csv(LS_TRAIN)
ls_val_df = pd.read_csv(LS_VAL)
ls_test_df = pd.read_csv(LS_TEST)

# Organize splits by edge
splits = {
    'LR': {'train': lr_train_df, 'val': lr_val_df, 'test': lr_test_df},
    'LS': {'train': ls_train_df, 'val': ls_val_df, 'test': ls_test_df}
}

# Organize source tables by edge
source_tables = {
    'LR': (L_full, R_full),
    'LS': (L_full, S_full)
}

# Organize candidate pairs by edge
candidates = {
    'LR': candidates_lr,
    'LS': candidates_ls
}

print("✓ Ground truth splits loaded")
print(f"  LR: train={len(lr_train_df)}, val={len(lr_val_df)}, test={len(lr_test_df)}")
print(f"  LS: train={len(ls_train_df)}, val={len(ls_val_df)}, test={len(ls_test_df)}")


Loading ground truth splits...
✓ Ground truth splits loaded
  LR: train=304, val=96, test=100
  LS: train=296, val=101, test=103


## 2. Name Normalization (Reusable)

### 2.1 Apply Name normalization


In [97]:
# Check if normalized names exist, if not, apply normalization
import re
import unicodedata

def normalize_name_for_blocking(text: str) -> str:
    r"""Normalize name for consistent matching (same as workflow notebook)"""
    if not isinstance(text, str):
        return ''
    
    # Decode literal backslash-x-hex patterns
    def decode_literal_hex_sequence(match):
        hex_bytes = []
        for i in range(1, len(match.groups()) + 1):
            hex_str = match.group(i)
            try:
                hex_bytes.append(int(hex_str, 16))
            except ValueError:
                return match.group(0)
        try:
            decoded = bytes(hex_bytes).decode('utf-8')
            return decoded
        except (UnicodeDecodeError, ValueError):
            return match.group(0)
    
    text = re.sub(r'\\x([0-9a-fA-F]{2})\\x([0-9a-fA-F]{2})', decode_literal_hex_sequence, text)
    text = re.sub(r'\\x([0-9a-fA-F]{2})\\x([0-9a-fA-F]{2})\\x([0-9a-fA-F]{2})', decode_literal_hex_sequence, text)
    
    def decode_single_hex(match):
        hex_str = match.group(1)
        try:
            return chr(int(hex_str, 16))
        except (ValueError, OverflowError):
            return match.group(0)
    text = re.sub(r'\\x([0-9a-fA-F]{2})', decode_single_hex, text)
    
    # Unicode normalization
    text = unicodedata.normalize('NFD', text)
    text = ''.join(c for c in text if unicodedata.category(c) != 'Mn')
    
    # Lowercase and strip
    text = text.lower().strip()
    
    # Handle backslash escapes
    text = text.replace('\\ ', ' ').replace('\\', ' ')
    
    # Standardize punctuation
    text = text.replace('.', '').replace(',', '').replace('-', ' ').replace("'", '')
    
    # Normalize spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Remove common suffixes
    for suffix in [' jr', ' sr', ' ii', ' iii', ' iv', ' v']:
        text = text.replace(suffix, '')
    text = text.strip()
    
    return text

# Apply normalization if needed
for name, df in [('L_full', L_full), ('R_full', R_full), ('S_full', S_full)]:
    if 'full_name_normalized' not in df.columns and 'full_name' in df.columns:
        df['full_name_normalized'] = df['full_name'].astype('string').map(normalize_name_for_blocking)
        print(f"  {name}: Created 'full_name_normalized' column")
    elif 'full_name_normalized' in df.columns:
        print(f"  {name}: 'full_name_normalized' column already exists")

print("✓ Name normalization complete")


  L_full: 'full_name_normalized' column already exists
  R_full: Created 'full_name_normalized' column
  S_full: Created 'full_name_normalized' column
✓ Name normalization complete


## 3. Common Matching Infrastructure (Reusable)

This section defines reusable functions and frameworks for all matching methods.


### 3.1 Matching Function Interface

Define a standard interface for matching functions. All matching methods should follow this pattern.


In [98]:
# Matching function interface
# All matching methods should return a DataFrame with columns: ['id1', 'id2', 'sim']
# where 'sim' is the similarity score (0.0 to 1.0)

def apply_matching_method(
    edge_name: str,
    matching_func,
    *args,
    **kwargs
) -> pd.DataFrame:
    """
    Apply a matching method to candidate pairs for a given edge.
    
    Args:
        edge_name: Edge name ('LR' or 'LS')
        matching_func: Function that takes (left_df, right_df, candidates_df, *args, **kwargs)
                       and returns DataFrame with ['id1', 'id2', 'sim']
        *args, **kwargs: Additional arguments to pass to matching_func
    
    Returns:
        DataFrame with columns ['id1', 'id2', 'sim']
    """
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name]
    
    print(f"\n=== {edge_name}: Applying matching method ===")
    print(f"  Candidate pairs: {len(cand_df):,}")
    
    result = matching_func(left_df, right_df, cand_df, *args, **kwargs)
    
    # Validate result format
    required_cols = ['id1', 'id2', 'sim']
    if not all(col in result.columns for col in required_cols):
        raise ValueError(f"Matching function must return DataFrame with columns: {required_cols}")
    
    print(f"  Generated {len(result):,} scored pairs")
    print(f"  Similarity range: [{result['sim'].min():.3f}, {result['sim'].max():.3f}]")
    print(f"  Mean similarity: {result['sim'].mean():.3f}")
    
    return result

print("✓ Matching function interface defined")


✓ Matching function interface defined


### 3.2 Evaluation Functions (Reusable)


In [99]:
# Reusable evaluation functions
from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

def evaluate_matching_thresholds(
    scored_pairs: pd.DataFrame,
    val_df: pd.DataFrame,
    thresholds: list = [0.5, 0.6, 0.7, 0.8, 0.9],
    output_dir: Path = None
) -> dict:
    """
    Evaluate matching performance across different similarity thresholds.
    
    Args:
        scored_pairs: DataFrame with columns ['id1', 'id2', 'sim']
        val_df: Validation ground truth with columns ['id1', 'id2', 'label']
        thresholds: List of similarity thresholds to evaluate
        output_dir: Directory for evaluation outputs
    
    Returns:
        Dictionary with best_threshold, best_f1, and detailed metrics
    """
    best_threshold = None
    best_f1 = 0.0
    threshold_metrics = {}
    
    print("\nThreshold Analysis:")
    print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<12} {'F1-Score':<12} {'TP':<8} {'FP':<8} {'FN':<8}")
    print("-" * 80)
    
    for threshold in thresholds:
        # Filter by threshold
        matched_pairs = scored_pairs[scored_pairs['sim'] >= threshold][['id1', 'id2']].copy()
        
        # Evaluate
        metrics = EntityMatchingEvaluator.evaluate_matching(
            predicted_pairs=matched_pairs,
            test_pairs=val_df,
            out_dir=output_dir
        )
        
        precision = metrics.get('precision', 0.0)
        recall = metrics.get('recall', 0.0)
        f1 = metrics.get('f1_score', 0.0)
        tp = metrics.get('true_positives', 0)
        fp = metrics.get('false_positives', 0)
        fn = metrics.get('false_negatives', 0)
        
        threshold_metrics[threshold] = metrics
        print(f"{threshold:<12.1f} {precision:<12.3f} {recall:<12.3f} {f1:<12.3f} {tp:<8} {fp:<8} {fn:<8}")
        
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = threshold
    
    return {
        'best_threshold': best_threshold,
        'best_f1': best_f1,
        'thresholds': thresholds,
        'threshold_metrics': threshold_metrics
    }

def apply_season_year_constraint(
    scored_pairs: pd.DataFrame,
    left_df: pd.DataFrame,
    right_df: pd.DataFrame
) -> pd.DataFrame:
    """
    Filter scored pairs to ensure season_year matches exactly.
    
    Args:
        scored_pairs: DataFrame with columns ['id1', 'id2', 'sim']
        left_df: Left source table with '_rid' and 'season_year' columns
        right_df: Right source table with '_rid' and 'season_year' columns
    
    Returns:
        Filtered DataFrame with same columns
    """
    # Merge to get season_year
    matches = scored_pairs.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        suffixes=('', '_left')
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        suffixes=('', '_right')
    )
    
    # Filter: season_year must match exactly
    matches = matches[matches['season_year'] == matches['season_year_right']]
    
    # Return original columns
    return matches[['id1', 'id2', 'sim']].copy()

def apply_global_matching(
    scored_pairs: pd.DataFrame
) -> pd.DataFrame:
    """
    Apply greedy one-to-one matching to resolve conflicts.
    
    Args:
        scored_pairs: DataFrame with columns ['id1', 'id2', 'sim']
    
    Returns:
        DataFrame with one-to-one matches (same columns)
    """
    # Convert 'sim' to 'score' if needed (PyDI expects 'score' column)
    correspondences = scored_pairs.copy()
    if 'sim' in correspondences.columns and 'score' not in correspondences.columns:
        correspondences = correspondences.rename(columns={'sim': 'score'})
    
    # Apply greedy one-to-one matching using .cluster() method
    global_matcher = GreedyOneToOneMatchingAlgorithm()
    global_matched = global_matcher.cluster(correspondences)
    
    # Convert 'score' back to 'sim' if original had 'sim'
    if 'sim' in scored_pairs.columns and 'score' in global_matched.columns:
        global_matched = global_matched.rename(columns={'score': 'sim'})
    
    return global_matched

print("✓ Evaluation functions defined")


✓ Evaluation functions defined


## 4. Rule-Based Matching

Use similarity comparators to compute matching scores:
- **Name similarity**: Levenshtein distance (weight: 0.7) and Jaccard similarity (weight: 0.3)
- **Season year**: Must match exactly (hard constraint - if not matched, similarity = 0.0)


In [100]:
# Configure comparators for rule-based matching
from PyDI.entitymatching.comparators import StringComparator, DateComparator

# Name comparators (using normalized names)
name_comparators = [
    StringComparator(
        column="full_name_normalized" if 'full_name_normalized' in L_full.columns else "full_name",
        similarity_function="levenshtein",
        preprocess=str.lower
    ),
    StringComparator(
        column="full_name_normalized" if 'full_name_normalized' in L_full.columns else "full_name",
        similarity_function="jaccard",
        tokenization="word",
        preprocess=str.lower
    )
]
name_weights = [0.7, 0.3]  # Emphasize Levenshtein for name matching

# Matching strategy: name similarity + season_year hard constraint
# season_year is checked as a hard constraint (must match exactly) before computing similarity
# Only name comparators are used for similarity scoring

# Use only name comparators (season_year is handled as hard constraint)
rule_based_comparators = name_comparators
rule_based_weights = name_weights  # Levenshtein (0.7), Jaccard (0.3)

print("Rule-based comparators configured:")
print(f"  Name comparators: Levenshtein (0.7), Jaccard (0.3)")
print(f"  Hard constraint: season_year must match exactly")
print(f"  Total comparators: {len(rule_based_comparators)}")


Rule-based comparators configured:
  Name comparators: Levenshtein (0.7), Jaccard (0.3)
  Hard constraint: season_year must match exactly
  Total comparators: 2


In [101]:
# Name variant helpers and configuration

NAME_VARIANTS = {
    'dan': ['daniel'],
    'daniel': ['dan', 'danny'],
    'danny': ['daniel'],
    'matt': ['matthew'],
    'matthew': ['matt'],
    'jon': ['jonathon'],
    'jonathon': ['jon'],
    'cal': ['calvin'],
    'calvin': ['cal'],
    'phil': ['phillip'],
    'phillip': ['phil'],
    'philip': ['phil'],
    'rafael': ['raffy'],
    'raffy': ['rafael'],
    'jim': ['james'],
    'james': ['jim'],
    'bob': ['robert'],
    'robert': ['bob', 'rob'],
    'rob': ['robert'],
    'bill': ['william'],
    'william': ['bill'],
    'mike': ['michael'],
    'michael': ['mike'],
    'dave': ['david'],
    'david': ['dave'],
    'chris': ['christopher'],
    'christopher': ['chris'],
    'tom': ['thomas'],
    'thomas': ['tom'],
    'ed': ['edward'],
    'edward': ['ed'],
    'rick': ['richard'],
    'richard': ['rick'],
    'nick': ['nicholas', 'nicky'],
    'nicky': ['nick'],
    'nicholas': ['nick'],
    'nori': ['norichika'],
    'norichika': ['nori'],
}

def _normalize_variant_targets(value):
    if value is None:
        return set()
    if isinstance(value, str):
        return {value}
    if isinstance(value, (list, tuple, set)):
        return set(value)
    return set()

def tokens_are_variants(token_a: str, token_b: str) -> bool:
    a = (token_a or '').lower()
    b = (token_b or '').lower()
    if not a or not b:
        return False
    if a == b:
        return True
    return (
        b in _normalize_variant_targets(NAME_VARIANTS.get(a))
        or a in _normalize_variant_targets(NAME_VARIANTS.get(b))
    )

def split_first_last(name: str):
    tokens = str(name).lower().split()
    if not tokens:
        return '', ''
    return tokens[0], tokens[-1]

def parse_birth_year(value):
    try:
        if value is None or (isinstance(value, float) and value != value) or value == '':
            return None
        return int(float(value))
    except (ValueError, TypeError):
        return None

def birth_year_diff(year1, year2):
    y1 = parse_birth_year(year1)
    y2 = parse_birth_year(year2)
    if y1 is None or y2 is None:
        return None
    return abs(y1 - y2)

def birth_year_within(year1, year2, tolerance: int = 1) -> bool:
    diff = birth_year_diff(year1, year2)
    return diff is not None and diff <= tolerance

# Strategy 1: Name variant dictionary awareness for optimized matching

def check_name_variant_match(name1, name2):
    """Check if two multi-token names are variants ignoring suffix differences."""
    name1_words = name1.lower().split()
    name2_words = name2.lower().split()
    if len(name1_words) != len(name2_words):
        return False
    for w1, w2 in zip(name1_words, name2_words):
        if w1 == w2:
            continue
        if not tokens_are_variants(w1, w2):
            return False
    return True

def apply_birth_year_constraint(similarity_score, birth1, birth2, penalty=0.2):
    """Apply birth year soft constraint: reduce similarity if birth years differ by more than 1 year."""
    diff = birth_year_diff(birth1, birth2)
    if diff is None:
        return similarity_score
    if diff > 1:
        return similarity_score * (1 - penalty)
    if diff == 1:
        return similarity_score * (1 - penalty * 0.5)
    return similarity_score

def compute_enhanced_similarity(left_record, right_record, comparators, weights):
    """Compute enhanced similarity with name variant handling."""
    name1 = str(left_record.get('full_name_normalized', left_record.get('full_name', ''))).lower()
    name2 = str(right_record.get('full_name_normalized', right_record.get('full_name', ''))).lower()
    base_scores = []
    for comparator in comparators:
        col = comparator.column
        left_val = str(left_record.get(col, ''))
        right_val = str(right_record.get(col, ''))
        if hasattr(comparator, 'similarity_function'):
            if comparator.similarity_function == 'levenshtein':
                from difflib import SequenceMatcher
                sim = SequenceMatcher(None, left_val.lower(), right_val.lower()).ratio()
            elif comparator.similarity_function == 'jaccard':
                left_tokens = set(left_val.lower().split())
                right_tokens = set(right_val.lower().split())
                sim = 1.0 if not (left_tokens or right_tokens) else len(left_tokens & right_tokens) / len(left_tokens | right_tokens)
            else:
                sim = 0.0
        else:
            sim = 0.0
        base_scores.append(sim)
    base_score = sum(s * w for s, w in zip(base_scores, weights))
    if check_name_variant_match(name1, name2):
        base_score = min(1.0, base_score + 0.15)
    return base_score

optimized_threshold = 0.7

print("Optimized matching configuration:")
print(f"  Name variant dictionary: {len(NAME_VARIANTS)} variants")
print(f"  Adjusted threshold: {optimized_threshold} (from 0.7)")
print(f"  Birth year constraint: penalty={0.2} for year_diff > 1")



Optimized matching configuration:
  Name variant dictionary: 38 variants
  Adjusted threshold: 0.7 (from 0.7)
  Birth year constraint: penalty=0.2 for year_diff > 1


### 4.1 Define Rule-Based Matching Function


In [102]:
def rule_based_matching(
    left_df: pd.DataFrame,
    right_df: pd.DataFrame,
    cand_df: pd.DataFrame,
    comparators: list | None = None,
    weights: list | None = None,
    threshold: float = 0.7,
) -> pd.DataFrame:
    """Match candidate pairs using PyDI's RuleBasedMatcher."""
    matcher = RuleBasedMatcher()
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df,
        id_column='_rid',
        comparators=comparators or rule_based_comparators,
        weights=weights or rule_based_weights,
        threshold=threshold,
        debug=False,
    )
    # Normalize column name so downstream code can expect 'sim'
    if 'score' in correspondences.columns and 'sim' not in correspondences.columns:
        correspondences = correspondences.rename(columns={'score': 'sim'})
    return correspondences[['id1', 'id2', 'sim']]

print("✓ Rule-based matching function defined")


✓ Rule-based matching function defined


### 4.2 Apply Rule-Based Matching


In [103]:
# Apply rule-based matching to all edges
matching_results_rule_based = {}

for edge_name in ['LR', 'LS']:
    matching_results_rule_based[edge_name] = apply_matching_method(
        edge_name,
        rule_based_matching,
        comparators=rule_based_comparators,
        weights=rule_based_weights
    )

print("\n✓ Rule-based matching complete for all edges")



=== LR: Applying matching method ===
  Candidate pairs: 5,994,373


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.212; 5994373 blocked pairs (reduction ratio: 0.9963025175189331)
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:907.779; found 136453 correspondences.
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.088; 227185 blocked pairs (reduction ratio: 0.999683800767084)


  Generated 136,453 scored pairs
  Similarity range: [0.700, 1.000]
  Mean similarity: 0.990

=== LS: Applying matching method ===
  Candidate pairs: 227,185


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:27.619; found 54163 correspondences.


  Generated 54,163 scored pairs
  Similarity range: [0.700, 1.000]
  Mean similarity: 0.993

✓ Rule-based matching complete for all edges


Apply RuleBasedMatcher to compute similarity scores for all candidate pairs.


In [104]:
# Define matching threshold
threshold = 0.7

matching_results = {}
matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Rule-Based Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    
    # Apply season_year hard constraint: filter candidate pairs where season_year matches
    print(f"  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    # Filter: season_year must match exactly
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    # Initialize matcher (following exercise file pattern)
    matcher = RuleBasedMatcher()
    
    # Match with a single threshold (following exercise file pattern)
    print(f"  Computing similarity scores using PyDI RuleBasedMatcher (threshold={threshold})...")
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        comparators=rule_based_comparators,
        weights=rule_based_weights,
        threshold=threshold,
        debug=False
    )
    
    # Store results
    matching_results[edge_name] = correspondences
    matchers[edge_name] = matcher
    
    print(f"  Generated {len(correspondences):,} matched pairs (above threshold {threshold})")
    if 'score' in correspondences.columns:
        print(f"  Similarity score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean similarity: {correspondences['score'].mean():.3f}")



=== LR: Rule-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 15215 elements


  After season_year filter: 135,944 candidate pairs (from 5,994,373)
  Computing similarity scores using PyDI RuleBasedMatcher (threshold=0.7)...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.291; 135944 blocked pairs (reduction ratio: 0.9999161462661056)
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:21.904; found 14981 correspondences.


  Generated 14,981 matched pairs (above threshold 0.7)
  Similarity score range: [0.700, 1.000]
  Mean similarity: 0.998

=== LS: Rule-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.088; 9729 blocked pairs (reduction ratio: 0.9999864590429076)


  After season_year filter: 9,729 candidate pairs (from 227,185)
  Computing similarity scores using PyDI RuleBasedMatcher (threshold=0.7)...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:1.723; found 6582 correspondences.


  Generated 6,582 matched pairs (above threshold 0.7)
  Similarity score range: [0.700, 1.000]
  Mean similarity: 0.997


### 4.3 Evaluate Rule-Based Matching

Assess matching performance using different similarity thresholds.


In [105]:
# Evaluate matching on validation set (following exercise file pattern)
# Simple evaluation with a single threshold

from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Matching Evaluation (Validation Set) ===")
    
    correspondences = matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    # Evaluate matching (following exercise file pattern)
    eval_results = EntityMatchingEvaluator.evaluate_matching(
        correspondences=correspondences,
        test_pairs=val_df,
        out_dir=OUTPUT_DIR / 'matching-evaluation',
        matcher_instance=matchers[edge_name]
    )
    
    matching_metrics_val[edge_name] = eval_results
    
    print(f"\n  Precision: {eval_results.get('precision', 0.0):.3f}")
    print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
    print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
    print(f"  TP: {eval_results.get('true_positives', 0)}")
    print(f"  FP: {eval_results.get('false_positives', 0)}")
    print(f"  FN: {eval_results.get('false_negatives', 0)}")



=== LR: Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  25
[INFO ] root -   True Negatives:  65
[INFO ] root -   False Positives: 1
[INFO ] root -   False Negatives: 5
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.938
[INFO ] root -   Precision: 0.962
[INFO ] root -   Recall:    0.833
[INFO ] root -   F1-Score:  0.893



  Precision: 0.962
  Recall:    0.833
  F1-Score:  0.893
  TP: 25
  FP: 1
  FN: 5

=== LS: Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  41
[INFO ] root -   True Negatives:  45
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 8
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.896
[INFO ] root -   Precision: 0.953
[INFO ] root -   Recall:    0.837
[INFO ] root -   F1-Score:  0.891



  Precision: 0.953
  Recall:    0.837
  F1-Score:  0.891
  TP: 41
  FP: 2
  FN: 8


### 4.4 Analyze Rule-Based Error Cases

In [106]:
for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name}: Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = matching_results[edge_name]
    
    # Get true matches and false matches in validation set
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    
    # Get predicted matches (only those in validation set)
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))  # All pairs in validation set
    
    # False Negatives: True matches in validation set that were not predicted
    fn_pairs = true_set - pred_set
    print(f"\n[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if len(fn_pairs) > 0:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            
            if left_rec is not None and right_rec is not None:
                print(f"\n  {id1} <-> {id2}")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                
                # Check if in candidates
                in_candidates = len(candidates[edge_name][
                    (candidates[edge_name]['id1'] == id1) & 
                    (candidates[edge_name]['id2'] == id2)
                ]) > 0
                print(f"    In candidates: {in_candidates}")
    
    # False Positives: Predicted matches that are in validation set but labeled as FALSE
    # Only analyze pairs that are in the validation set
    fp_pairs = (pred_set & val_set) & false_set
    print(f"\n[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if len(fp_pairs) > 0:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 else None
            
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else "N/A"
                print(f"\n  {id1} <-> {id2} (score: {score_str})")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")



LR: Error Cases Analysis

[FALSE NEGATIVES] (5 cases):

  4770032011|2011|L <-> 4770032011|2011|R
    Left:  'jonathon niese' | Season: 2011 | Birth: 1986
    Right: 'jon niese' | Season: 2011 | Birth: 1987
    In candidates: True

  6211992018|2018|L <-> 6211992018|2018|R
    Left:  'matthew bowman' | Season: 2018 | Birth: 1991
    Right: 'matt bowman' | Season: 2018 | Birth: 1991
    In candidates: True

  6687512022|2022|L <-> 6687512022|2022|R
    Left:  'cal mitchell' | Season: 2022 | Birth: 1999
    Right: 'calvin mitchell' | Season: 2022 | Birth: 1999
    In candidates: True

  6135642021|2021|L <-> 6135642021|2021|R
    Left:  'jason vosler' | Season: 2021 | Birth: 1993
    Right: 'jason vosler' | Season: 2021 | Birth: 1994
    In candidates: True

  4770032014|2014|L <-> 4770032014|2014|R
    Left:  'jonathon niese' | Season: 2014 | Birth: 1986
    Right: 'jon niese' | Season: 2014 | Birth: 1987
    In candidates: True

[FALSE POSITIVES] (1 cases):

  6756562021|2021|L <-> 60

### 4.5 Optimized Rule-Based Matching (Name Variants + Birth Year Constraint)
Based on error analysis, we implement three optimization strategies:

1. **Name Variant Handling**: Handle common name variants (dan/daniel, matt/matthew, etc.)
2. **Adjusted Threshold**: Lower threshold from 0.7 to 0.65 to capture more true matches
3. **Birth Year Soft Constraint**: Apply penalty when birth years differ by more than 1 year


In [107]:
# Apply optimized matching with name variants and birth year constraint

optimized_matching_results = {}
optimized_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Optimized Rule-Based Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    
    # Apply season_year hard constraint: filter candidate pairs where season_year matches
    print(f"  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year', 'birth_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year', 'birth_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    # Filter: season_year must match exactly
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2', 'birth_year', 'birth_year_right']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    # Initialize matcher
    matcher = RuleBasedMatcher()
    
    # First, compute base similarity using PyDI (with original threshold to get all scores)
    print(f"  Computing base similarity scores using PyDI RuleBasedMatcher...")
    base_correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered[['id1', 'id2']],
        id_column='_rid',
        comparators=rule_based_comparators,
        weights=rule_based_weights,
        threshold=0.0,  # Get all scores, we'll filter later
        debug=False
    )
    
    print(f"  Base correspondences: {len(base_correspondences):,}")
    
    # Apply enhanced similarity computation with name variants and birth year constraint
    print(f"  Applying name variant handling and birth year constraint...")
    
    # Merge with source data to get names and birth years
    enhanced_correspondences = base_correspondences.merge(
        left_df[['_rid', 'full_name_normalized', 'full_name', 'birth_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'full_name_normalized', 'full_name', 'birth_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    # Convert birth_year columns to numeric once for downstream logic
    enhanced_correspondences['birth_year'] = pd.to_numeric(enhanced_correspondences['birth_year'], errors='coerce')
    enhanced_correspondences['birth_year_right'] = pd.to_numeric(enhanced_correspondences['birth_year_right'], errors='coerce')

    # Apply name variant enhancement (Strategy 1)
    def apply_name_variant_boost(row):
        base_score = row['score']
        name1 = str(row.get('full_name_normalized', row.get('full_name', ''))).lower()
        name2 = str(row.get('full_name_normalized_right', row.get('full_name_right', ''))).lower()
        first1, last1 = split_first_last(name1)
        first2, last2 = split_first_last(name2)

        if (
            last1
            and last1 == last2
            and tokens_are_variants(first1, first2)
            and birth_year_within(row.get('birth_year'), row.get('birth_year_right'))
        ):
            return max(base_score, 0.95)
        
        if check_name_variant_match(name1, name2):
            # Variant match: boost score (but cap at 1.0)
            return min(1.0, base_score + 0.15)
        return base_score
    
    enhanced_correspondences['enhanced_score'] = enhanced_correspondences.apply(apply_name_variant_boost, axis=1)
    
    # Apply birth year constraint (Strategy 3)
    
    def apply_final_adjustments(row):
        adjusted = apply_birth_year_constraint(
            row['enhanced_score'],
            row.get('birth_year'),
            row.get('birth_year_right'),
            penalty=0.2
        )
        name_equal = bool(
            row.get('full_name_normalized')
            and row.get('full_name_normalized_right')
            and str(row['full_name_normalized']).lower() == str(row['full_name_normalized_right']).lower()
        )
        if name_equal:
            gap = birth_year_diff(row.get('birth_year'), row.get('birth_year_right'))
            if gap is not None and gap >= 2:
                return adjusted * 0.2
        return adjusted

    enhanced_correspondences['final_score'] = enhanced_correspondences.apply(apply_final_adjustments, axis=1)
    
    # Keep only necessary columns
    enhanced_correspondences = enhanced_correspondences[['id1', 'id2', 'final_score']].rename(columns={'final_score': 'score'})
    
    # Apply optimized threshold (Strategy 2: use edge-specific values)
    edge_threshold = 0.75 if edge_name == 'LR' else 0.75
    optimized_correspondences = enhanced_correspondences[enhanced_correspondences['score'] >= edge_threshold].copy()
    
    # Store results
    optimized_matching_results[edge_name] = optimized_correspondences
    optimized_matchers[edge_name] = matcher
    
    print(f"  Generated {len(optimized_correspondences):,} matched pairs (above threshold {edge_threshold})")
    if len(optimized_correspondences) > 0:
        print(f"  Similarity score range: [{optimized_correspondences['score'].min():.3f}, {optimized_correspondences['score'].max():.3f}]")
        print(f"  Mean similarity: {optimized_correspondences['score'].mean():.3f}")
    
print("\n✓ Optimized matching complete for all edges")



=== LR: Optimized Rule-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 15215 elements


  After season_year filter: 135,944 candidate pairs (from 5,994,373)
  Computing base similarity scores using PyDI RuleBasedMatcher...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.224; 135944 blocked pairs (reduction ratio: 0.9999161462661056)
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:30.459; found 135944 correspondences.


  Base correspondences: 135,944
  Applying name variant handling and birth year constraint...
  Generated 14,828 matched pairs (above threshold 0.75)
  Similarity score range: [0.761, 1.000]
  Mean similarity: 0.948

=== LS: Optimized Rule-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.102; 9729 blocked pairs (reduction ratio: 0.9999864590429076)


  After season_year filter: 9,729 candidate pairs (from 227,185)
  Computing base similarity scores using PyDI RuleBasedMatcher...


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:2.336; found 9729 correspondences.


  Base correspondences: 9,729
  Applying name variant handling and birth year constraint...
  Generated 6,522 matched pairs (above threshold 0.75)
  Similarity score range: [0.750, 1.000]
  Mean similarity: 0.948

✓ Optimized matching complete for all edges


### 4.6 Evaluate Optimized Matching
Evaluate the performance of optimized matching on the validation set.


In [108]:
# Evaluate optimized matching on validation set

from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

optimized_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Optimized Matching Evaluation (Validation Set) ===")
    
    correspondences = optimized_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    # Rename 'score' to match PyDI evaluator expectations (if needed)
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    # Evaluate matching
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-optimized',
            matcher_instance=optimized_matchers[edge_name]
        )
        
        optimized_matching_metrics_val[edge_name] = eval_results
        
        print(f"\n  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
    except Exception as e:
        print(f"  PyDI evaluator failed: {e}")
        # Manual evaluation fallback
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        
        print(f"\n  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("\n✓ Optimized matching evaluation complete")



=== LR: Optimized Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  28
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 2
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.979
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.933
[INFO ] root -   F1-Score:  0.966



  Precision: 1.000
  Recall:    0.933
  F1-Score:  0.966
  TP: 28
  FP: 0
  FN: 2

=== LS: Optimized Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  46
[INFO ] root -   True Negatives:  47
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 3
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.969
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.939
[INFO ] root -   F1-Score:  0.968



  Precision: 1.000
  Recall:    0.939
  F1-Score:  0.968
  TP: 46
  FP: 0
  FN: 3

✓ Optimized matching evaluation complete


### 4.7 Comparison: Original vs Optimized Matching

Compare the performance of original matching (threshold=0.7) with optimized matching (name variants + birth year constraint + threshold=0.75).


In [109]:
# Compare original vs optimized matching results

print("="*80)
print("Matching Performance Comparison: Original vs Optimized")
print("="*80)

comparison_data = []

for edge_name in ['LR', 'LS']:
    print(f"\n{edge_name} Edge:")
    print("-" * 80)
    
    # Original results
    orig_metrics = matching_metrics_val.get(edge_name, {})
    orig_precision = orig_metrics.get('precision', 0.0)
    orig_recall = orig_metrics.get('recall', 0.0)
    orig_f1 = orig_metrics.get('f1', 0.0)
    orig_tp = orig_metrics.get('true_positives', 0)
    orig_fp = orig_metrics.get('false_positives', 0)
    orig_fn = orig_metrics.get('false_negatives', 0)
    
    # Optimized results
    opt_metrics = optimized_matching_metrics_val.get(edge_name, {})
    opt_precision = opt_metrics.get('precision', 0.0)
    opt_recall = opt_metrics.get('recall', 0.0)
    opt_f1 = opt_metrics.get('f1', 0.0)
    opt_tp = opt_metrics.get('true_positives', 0)
    opt_fp = opt_metrics.get('false_positives', 0)
    opt_fn = opt_metrics.get('false_negatives', 0)
    
    # Calculate improvements
    precision_improvement = opt_precision - orig_precision
    recall_improvement = opt_recall - orig_recall
    f1_improvement = opt_f1 - orig_f1
    tp_improvement = opt_tp - orig_tp
    fp_improvement = opt_fp - orig_fp
    fn_improvement = opt_fn - orig_fn
    
    print(f"  Metric              | Original  | Optimized | Improvement")
    print(f"  --------------------|-----------|-----------|------------")
    print(f"  Precision           | {orig_precision:7.3f}  | {opt_precision:7.3f}  | {precision_improvement:+7.3f}")
    print(f"  Recall              | {orig_recall:7.3f}  | {opt_recall:7.3f}  | {recall_improvement:+7.3f}")
    print(f"  F1-Score            | {orig_f1:7.3f}  | {opt_f1:7.3f}  | {f1_improvement:+7.3f}")
    print(f"  True Positives      | {orig_tp:7d}  | {opt_tp:7d}  | {tp_improvement:+7d}")
    print(f"  False Positives     | {orig_fp:7d}  | {opt_fp:7d}  | {fp_improvement:+7d}")
    print(f"  False Negatives     | {orig_fn:7d}  | {opt_fn:7d}  | {fn_improvement:+7d}")
    
    comparison_data.append({
        'edge': edge_name,
        'original_precision': orig_precision,
        'optimized_precision': opt_precision,
        'original_recall': orig_recall,
        'optimized_recall': opt_recall,
        'original_f1': orig_f1,
        'optimized_f1': opt_f1,
        'precision_improvement': precision_improvement,
        'recall_improvement': recall_improvement,
        'f1_improvement': f1_improvement,
    })

print("\n" + "="*80)
print("Summary:")
print("="*80)
print(f"  Optimizations applied:")
print(f"    1. Name variant handling (dan/daniel, matt/matthew, etc.)")
print(f"    2. Lowered threshold from 0.7 to 0.65")
print(f"    3. Birth year soft constraint (penalty for year_diff > 1)")

# Save comparison results
comparison_df = pd.DataFrame(comparison_data)
comparison_df.to_csv(OUTPUT_DIR / 'matching-comparison.csv', index=False)
print(f"\n  Comparison results saved to: {OUTPUT_DIR / 'matching-comparison.csv'}")


Matching Performance Comparison: Original vs Optimized

LR Edge:
--------------------------------------------------------------------------------
  Metric              | Original  | Optimized | Improvement
  --------------------|-----------|-----------|------------
  Precision           |   0.962  |   1.000  |  +0.038
  Recall              |   0.833  |   0.933  |  +0.100
  F1-Score            |   0.893  |   0.966  |  +0.073
  True Positives      |      25  |      28  |      +3
  False Positives     |       1  |       0  |      -1
  False Negatives     |       5  |       2  |      -3

LS Edge:
--------------------------------------------------------------------------------
  Metric              | Original  | Optimized | Improvement
  --------------------|-----------|-----------|------------
  Precision           |   0.953  |   1.000  |  +0.047
  Recall              |   0.837  |   0.939  |  +0.102
  F1-Score            |   0.891  |   0.968  |  +0.077
  True Positives      |      41  |   

## 5. ML-Based Matching

For use with scikit-learn classifiers. Comparators are the features. Train on labeled pairs to learn optimal weights.

**Feature Extraction**: Feature extraction converts record pairs into feature vectors using the set of comparators. The `FeatureExtractor` class handles this transformation.

**Workflow**:
1. Define feature extractors (comparators)
2. Extract features from training pairs
3. Train a classifier (e.g., RandomForestClassifier)
4. Apply ML-based matcher to candidate pairs
5. Evaluate performance
6. Error cases analysis


### 5.0 Export Candidate Error Cases

Persist currently detected validation errors to `data/output/gt/manual_cases/` so the ground-truth notebook can ingest them (section 5.8). Run this after completing the rule-based/optimized evaluations and before the ML feature extractor setup.

In [110]:
# Export candidate error cases for GT augmentation (section 5.8 in GT notebook)
import pandas as pd
from pathlib import Path

MANUAL_CASES_DIR = BASE_DIR / 'data' / 'output' / 'gt' / 'manual_cases'
MANUAL_CASES_DIR.mkdir(parents=True, exist_ok=True)

error_records = {'LR': [], 'LS': []}

def _collect_errors(result_dict, label: str):
    if result_dict is None:
        return
    for edge_name in ['LR', 'LS']:
        if edge_name not in result_dict:
            continue
        correspondences = result_dict[edge_name]
        val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
        val_df['label'] = val_df['label'].astype(str).str.strip().str.upper()
        true_set = set(zip(val_df[val_df['label'] == 'TRUE']['id1'], val_df[val_df['label'] == 'TRUE']['id2']))
        false_set = set(zip(val_df[val_df['label'] == 'FALSE']['id1'], val_df[val_df['label'] == 'FALSE']['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        val_pairs = set(zip(val_df['id1'], val_df['id2']))

        fn_pairs = true_set - pred_set
        for id1, id2 in fn_pairs:
            error_records[edge_name].append({
                'id1': id1,
                'id2': id2,
                'label': 'TRUE',
                'edge': edge_name,
                'source': label,
                'error_type': 'FN'
            })

        fp_pairs = (pred_set & val_pairs) & false_set
        for id1, id2 in fp_pairs:
            error_records[edge_name].append({
                'id1': id1,
                'id2': id2,
                'label': 'FALSE',
                'edge': edge_name,
                'source': label,
                'error_type': 'FP'
            })

_collect_errors(globals().get('matching_results'), 'rule_based')
_collect_errors(globals().get('optimized_matching_results'), 'optimized')
_collect_errors(globals().get('logreg_matching_results'), 'logreg')
_collect_errors(globals().get('ml_matching_results'), 'random_forest')
_collect_errors(globals().get('gb_matching_results'), 'gradient_boosting')
_collect_errors(globals().get('xgb_matching_results'), 'xgboost')

for edge_name, records in error_records.items():
    if not records:
        continue
    df_edge = pd.DataFrame(records)
    manual_path = MANUAL_CASES_DIR / f'manual_cases_{edge_name}.csv'
    if manual_path.exists():
        existing = pd.read_csv(manual_path)
        # Align columns
        for col in df_edge.columns:
            if col not in existing.columns:
                existing[col] = ''
        for col in existing.columns:
            if col not in df_edge.columns:
                df_edge[col] = ''
        df_edge = pd.concat([existing, df_edge], ignore_index=True)
    df_edge = df_edge.drop_duplicates(subset=['id1', 'id2'])
    df_edge.to_csv(manual_path, index=False)
    print(f"[{edge_name}] Exported {len(df_edge)} cases to {manual_path}")

if not any(error_records.values()):
    print("No error cases collected (ensure evaluation cells were executed).")


[LR] Exported 17 cases to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/gt/manual_cases/manual_cases_LR.csv
[LS] Exported 21 cases to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/gt/manual_cases/manual_cases_LS.csv


### 5.1 Define Feature Extractors (Comparators)

Define feature extractors for ML training. Enhanced with birth year comparator and additional custom features to improve matching accuracy:
- **Name similarity**: Levenshtein distance and Jaccard similarity on normalized names
- **Birth year**: DateComparator to penalize pairs with different birth years (max_days_difference=365)
- **Birth year difference**: Numeric feature calculating absolute difference between birth years
- **Name variant flag**: Binary feature indicating if first names are known variants (e.g., dan/daniel, matt/matthew)
- **Phonetic match**: Soundex-based phonetic similarity to capture pronunciation variants


In [111]:
# Define feature extractors (comparators) for ML-based matching
# Enhanced with birth year comparator to improve matching accuracy

from PyDI.entitymatching import MLBasedMatcher, FeatureExtractor, StringComparator
from PyDI.entitymatching.comparators import DateComparator
from PyDI.io import load_csv
from sklearn.ensemble import RandomForestClassifier
from PyDI.entitymatching import VectorFeatureExtractor


def feature_birth_year_diff(record1, record2):
    diff = birth_year_diff(record1.get('birth_year'), record2.get('birth_year'))
    return float(diff if diff is not None else 0.0)


def feature_name_variant_flag(record1, record2):
    left_name = str(record1.get('full_name_normalized', record1.get('full_name', ''))).lower()
    right_name = str(record2.get('full_name_normalized', record2.get('full_name', ''))).lower()
    first_left, last_left = split_first_last(left_name)
    first_right, last_right = split_first_last(right_name)
    if last_left and last_left == last_right and tokens_are_variants(first_left, first_right):
        return 1.0
    return 0.0


def _soundex(name: str) -> str:
    """Return a simple Soundex code to capture pronunciation."""
    if not name:
        return ""
    name = name.upper()
    first_letter = name[0]
    mapping = {
        **{c: "1" for c in "BFPV"},
        **{c: "2" for c in "CGJKQSXZ"},
        **{c: "3" for c in "DT"},
        "L": "4",
        **{c: "5" for c in "MN"},
        "R": "6",
    }
    prev_digit = mapping.get(first_letter, "")
    digits = []
    for ch in name[1:]:
        digit = mapping.get(ch, "")
        if digit and digit != prev_digit:
            digits.append(digit)
        prev_digit = digit
    code = (first_letter + "".join(digits) + "000")[:4]
    return code


def feature_phonetic_match(record1, record2) -> float:
    left_name = str(record1.get('full_name_normalized', record1.get('full_name', '')))
    right_name = str(record2.get('full_name_normalized', record2.get('full_name', '')))
    left_code = _soundex(left_name)
    right_code = _soundex(right_name)
    if not left_code or not right_code:
        return 0.0
    return 1.0 if left_code == right_code else 0.0


ml_comparators = [
    StringComparator(
        column="full_name_normalized" if 'full_name_normalized' in L_full.columns else "full_name",
        similarity_function="levenshtein",
        preprocess=str.lower
    ),
    StringComparator(
        column="full_name_normalized" if 'full_name_normalized' in L_full.columns else "full_name",
        similarity_function="jaccard",
        tokenization="word",
        preprocess=str.lower
    ),
    # Enhanced: Add birth year comparator to penalize pairs with different birth years
    DateComparator(
        column="birth_year",
        max_days_difference=365  # Allow 1 year difference (365 days)
    ),
    {"function": feature_birth_year_diff, "name": "birth_year_diff"},
    {"function": feature_name_variant_flag, "name": "is_name_variant"},
    {"function": feature_phonetic_match, "name": "phonetic_soundex"}
]

# Initialize feature extractor
ml_feature_extractor = FeatureExtractor(ml_comparators)

print("ML-based matching feature extractors configured:")
print(f"  Feature functions: {len(ml_comparators)}")
print(f"    - Levenshtein distance on normalized name")
print(f"    - Jaccard similarity on normalized name (word tokenization)")
print(f"    - Birth year comparator (max_days_difference=365)")
print(f"    - Birth year difference (numeric feature)")
print(f"    - Name variant flag (last name match + first-name variants)")
print(f"    - Soundex phonetic match (captures pronunciation variants)")
print("✓ Feature extractor initialized")


ML-based matching feature extractors configured:
  Feature functions: 6
    - Levenshtein distance on normalized name
    - Jaccard similarity on normalized name (word tokenization)
    - Birth year comparator (max_days_difference=365)
    - Birth year difference (numeric feature)
    - Name variant flag (last name match + first-name variants)
    - Soundex phonetic match (captures pronunciation variants)
✓ Feature extractor initialized


### 5.2 Train and Apply LogisticRegression Matcher

Train a baseline LogisticRegression model using the shared feature extractor. This provides a lightweight reference point before the tree-based ensembles.


#### 5.2.1 Train LogisticRegression

In [112]:
# Train LogisticRegression-based matchers for each edge
logreg_classifiers = {}
logreg_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Training LogisticRegression Matcher ===")
    
    left_df, right_df = source_tables[edge_name]
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    
    # Normalize labels
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    print(f"  Training pairs: {len(train_df)}")
    print(f"    True matches: {train_df['label_binary'].sum()}")
    print(f"    False matches: {len(train_df) - train_df['label_binary'].sum()}")
    
    # Extract shared features
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    
    print(f"  Training features shape: {X_train.shape}")
    print(f"  Feature columns: {feature_columns}")
    
    # Configure Logistic Regression
    clf = LogisticRegression(
        max_iter=1000,
        class_weight='balanced',
        solver='liblinear'
    )
    clf.fit(X_train, y_train)
    
    logreg_classifiers[edge_name] = clf
    logreg_matchers[edge_name] = MLBasedMatcher(ml_feature_extractor)
    
    # Inspect coefficients for interpretability
    coef_df = pd.DataFrame({
        'feature': feature_columns,
        'coefficient': clf.coef_[0]
    }).sort_values('coefficient', key=lambda s: s.abs(), ascending=False)
    print("\n  Top coefficients (magnitude):")
    print(coef_df.to_string(index=False))
    print(f"  Intercept: {clf.intercept_[0]:.4f}")
    print(f"  ✓ LogisticRegression trained for {edge_name}")

print("\n✓ LogisticRegression matchers trained for all edges")



=== LR: Training LogisticRegression Matcher ===
  Training pairs: 304
    True matches: 90
    False matches: 214


[INFO ] root - Label distribution: 90 positive, 214 negative


  Training features shape: (304, 6)
  Feature columns: ['StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None)', 'StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None)', 'DateComparator(birth_year, list_strategy=None)', 'birth_year_diff', 'is_name_variant', 'phonetic_soundex']

  Top coefficients (magnitude):
                                                                                   feature  coefficient
                                                                           is_name_variant     3.174990
StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None)     2.999598
                                                                           birth_year_diff    -1.675443
                                            DateComparator(birth_year, list_strategy=None)    -1.600098
                                                                          phonetic_soundex    -0.28

[INFO ] root - Label distribution: 142 positive, 154 negative


  Training features shape: (296, 6)
  Feature columns: ['StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None)', 'StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None)', 'DateComparator(birth_year, list_strategy=None)', 'birth_year_diff', 'is_name_variant', 'phonetic_soundex']

  Top coefficients (magnitude):
                                                                                   feature  coefficient
                                                                           is_name_variant     3.381306
StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None)     2.518544
                                            DateComparator(birth_year, list_strategy=None)    -1.884224
                                                                           birth_year_diff    -1.421372
    StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None)    -0.17

#### 5.2.2 Apply LogisticRegression Matcher

Reuse the shared matching pipeline to score candidate pairs with the LogisticRegression classifier.


In [113]:
# Apply LogisticRegression-based matcher to candidate pairs
logreg_matching_results = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: LogisticRegression Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    matcher = logreg_matchers[edge_name]
    clf = logreg_classifiers[edge_name]
    
    # Apply season_year hard constraint
    print("  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) &
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        use_probabilities=True,
        trained_classifier=clf,
        threshold=0.4
    )
    
    logreg_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")
    if 'score' in correspondences.columns:
        print(f"  Score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean score: {correspondences['score'].mean():.3f}")

print("\n✓ LogisticRegression matching complete for all edges")



=== LR: LogisticRegression Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.000; 135944 blocked pairs (reduction ratio: 0.9999161462661056)


  After season_year filter: 135,944 candidate pairs (from 5,994,373)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:49.957; found 23177 correspondences.


  Generated 23,177 matched pairs
  Score range: [0.402, 0.980]
  Mean score: 0.790

=== LS: LogisticRegression Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9729 blocked pairs (reduction ratio: 0.9999864590429076)


  After season_year filter: 9,729 candidate pairs (from 227,185)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.250; found 6853 correspondences.


  Generated 6,853 matched pairs
  Score range: [0.400, 0.979]
  Mean score: 0.951

✓ LogisticRegression matching complete for all edges


#### 5.2.3 Evaluate LogisticRegression Matching

Assess LogisticRegression performance on the validation set using the PyDI evaluator.


In [114]:
# Evaluate LogisticRegression matching on validation set
logreg_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: LogisticRegression Evaluation (Validation Set) ===")
    
    correspondences = logreg_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-logreg',
            matcher_instance=logreg_matchers[edge_name]
        )
        logreg_matching_metrics_val[edge_name] = eval_results
        
        print(f"\n  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
    except Exception as exc:
        print(f"  PyDI evaluator failed: {exc}")
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        
        logreg_matching_metrics_val[edge_name] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'true_positives': tp,
            'false_positives': fp,
            'false_negatives': fn
        }
        
        print(f"\n  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("\n✓ LogisticRegression evaluation complete")



=== LR: LogisticRegression Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  63
[INFO ] root -   False Positives: 3
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.969
[INFO ] root -   Precision: 0.909
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.952



  Precision: 0.909
  Recall:    1.000
  F1-Score:  0.952
  TP: 30
  FP: 3
  FN: 0

=== LS: LogisticRegression Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  49
[INFO ] root -   True Negatives:  45
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.979
[INFO ] root -   Precision: 0.961
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.980



  Precision: 0.961
  Recall:    1.000
  F1-Score:  0.980
  TP: 49
  FP: 2
  FN: 0

✓ LogisticRegression evaluation complete


#### 5.2.4 Analyze LogisticRegression Error Cases

Investigate False Positives/Negatives for the LogisticRegression matcher using the same diagnostic template.


In [115]:
# Analyze error cases for LogisticRegression
for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name}: LogisticRegression Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = logreg_matching_results[edge_name]
    
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))
    
    # False negatives
    fn_pairs = true_set - pred_set
    print(f"\n[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if fn_pairs:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            if left_rec is not None and right_rec is not None:
                print(f"\n  {id1} <-> {id2}")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                in_candidates = len(candidates[edge_name][
                    (candidates[edge_name]['id1'] == id1) &
                    (candidates[edge_name]['id2'] == id2)
                ]) > 0
                print(f"    In candidates: {in_candidates}")
                score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
                if len(score_row) > 0 and 'score' in score_row.columns:
                    print(f"    LogReg Score: {score_row['score'].iloc[0]:.3f}")
    else:
        print("  No false negatives found!")
    
    # False positives
    fp_pairs = (pred_set & val_set) & false_set
    print(f"\n[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if fp_pairs:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 and 'score' in score_row.columns else None
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else "N/A"
                print(f"\n  {id1} <-> {id2} (score: {score_str})")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
    else:
        print("  No false positives found!")

print("\n✓ LogisticRegression error analysis complete")



LR: LogisticRegression Error Cases Analysis

[FALSE NEGATIVES] (0 cases):
  No false negatives found!

[FALSE POSITIVES] (3 cases):

  4082302010|2010|L <-> 1502682010|2010|R (score: 0.410)
    Left:  'pedro feliciano' | Season: 2010 | Birth: 1976
    Right: 'pedro feliz' | Season: 2010 | Birth: 1975

  5169492015|2015|L <-> 5023272015|2015|R (score: 0.410)
    Left:  'hector sanchez' | Season: 2015 | Birth: 1989
    Right: 'h\xc3\xa9ctor santiago' | Season: 2015 | Birth: 1988

  4585892010|2010|L <-> 4566962010|2010|R (score: 0.478)
    Left:  'david herndon' | Season: 2010 | Birth: 1985
    Right: 'david hernandez' | Season: 2010 | Birth: 1985

LS: LogisticRegression Error Cases Analysis

[FALSE NEGATIVES] (0 cases):
  No false negatives found!

[FALSE POSITIVES] (2 cases):

  5167142015|2015|L <-> 6458482015|2015|S (score: 0.426)
    Left:  'dario alvarez' | Season: 2015 | Birth: 1989
    Right: 'dariel alvarez' | Season: 2015 | Birth: 1989

  6703512022|2022|L <-> 6689422022|2022|

### 5.3 Train and Apply RandomForestClassifier Matcher

Extract features from training pairs and train a RandomForestClassifier.


#### 5.3.1 Train RandomForestClassifier

In [116]:
# Train ML-based matchers for each edge
ml_classifiers = {}
ml_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Training ML-Based Matcher ===")
    
    left_df, right_df = source_tables[edge_name]
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    
    # Prepare labels: convert 'TRUE'/'FALSE' to 1/0
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    print(f"  Training pairs: {len(train_df)}")
    print(f"    True matches: {train_df['label_binary'].sum()}")
    print(f"    False matches: {len(train_df) - train_df['label_binary'].sum()}")
    
    # Extract features for training pairs
    print(f"  Extracting features from training pairs...")
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    print(f"  Extracted features: {len(train_features)} pairs")
    print(f"  Feature columns: {len([col for col in train_features.columns if col not in ['id1', 'id2', 'label']])}")
    
    # Prepare training data
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    
    print(f"  Training features shape: {X_train.shape}")
    print(f"  Training labels distribution: {y_train.value_counts().to_dict()}")
    
    # Train classifier
    print(f"  Training RandomForestClassifier...")
    clf = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        min_samples_split=5,
        min_samples_leaf=2,
        random_state=42,
        n_jobs=-1
    )
    clf.fit(X_train, y_train)
    
    # Store classifier and create matcher
    ml_classifiers[edge_name] = clf
    ml_matchers[edge_name] = MLBasedMatcher(ml_feature_extractor)
    
    # Log feature importance if available
    if hasattr(clf, 'feature_importances_'):
        feature_importance = pd.DataFrame({
            'feature': feature_columns,
            'importance': clf.feature_importances_
        }).sort_values('importance', ascending=False)
        print(f"\n  Top 5 Feature Importances:")
        for idx, row in feature_importance.head(5).iterrows():
            print(f"    {row['feature']}: {row['importance']:.4f}")
    
    print(f"  ✓ Classifier trained for {edge_name}")

print("\n✓ ML-based matchers trained for all edges")



=== LR: Training ML-Based Matcher ===
  Training pairs: 304
    True matches: 90
    False matches: 214
  Extracting features from training pairs...


[INFO ] root - Label distribution: 90 positive, 214 negative


  Extracted features: 304 pairs
  Feature columns: 6
  Training features shape: (304, 6)
  Training labels distribution: {0: 214, 1: 90}
  Training RandomForestClassifier...

  Top 5 Feature Importances:
    is_name_variant: 0.3784
    StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None): 0.3433
    birth_year_diff: 0.1693
    StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None): 0.0890
    DateComparator(birth_year, list_strategy=None): 0.0102
  ✓ Classifier trained for LR

=== LS: Training ML-Based Matcher ===
  Training pairs: 296
    True matches: 142
    False matches: 154
  Extracting features from training pairs...


[INFO ] root - Label distribution: 142 positive, 154 negative


  Extracted features: 296 pairs
  Feature columns: 6
  Training features shape: (296, 6)
  Training labels distribution: {0: 154, 1: 142}
  Training RandomForestClassifier...

  Top 5 Feature Importances:
    StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None): 0.3548
    is_name_variant: 0.3343
    birth_year_diff: 0.1563
    StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None): 0.1235
    phonetic_soundex: 0.0251
  ✓ Classifier trained for LS

✓ ML-based matchers trained for all edges


#### 5.3.2 Apply RandomForest Matcher
Apply the trained RandomForest-based matcher to candidate pairs from the blocking phase.


In [117]:
# Apply ML-based matching to candidate pairs
ml_matching_results = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: ML-Based Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    matcher = ml_matchers[edge_name]
    clf = ml_classifiers[edge_name]
    
    # Apply season_year hard constraint: filter candidate pairs where season_year matches
    print(f"  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    # Filter: season_year must match exactly
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    # Apply ML-based matcher
    print(f"  Applying ML-based matcher...")
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        trained_classifier=clf
    )
    
    # Store results
    ml_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")
    if 'score' in correspondences.columns:
        print(f"  Score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean score: {correspondences['score'].mean():.3f}")

print("\n✓ ML-based matching complete for all edges")



=== LR: ML-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.000; 135944 blocked pairs (reduction ratio: 0.9999161462661056)


  After season_year filter: 135,944 candidate pairs (from 5,994,373)
  Applying ML-based matcher...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:50.375; found 15375 correspondences.


  Generated 15,375 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000

=== LS: ML-Based Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9729 blocked pairs (reduction ratio: 0.9999864590429076)


  After season_year filter: 9,729 candidate pairs (from 227,185)
  Applying ML-based matcher...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.921; found 6727 correspondences.


  Generated 6,727 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000

✓ ML-based matching complete for all edges


#### 5.3.3 Evaluate RandomForest Matching

Evaluate the performance of the RandomForest-based matcher on the validation set.


In [118]:
# Evaluate ML-based matching on validation set

ml_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: ML-Based Matching Evaluation (Validation Set) ===")
    
    correspondences = ml_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    # Rename 'score' to match PyDI evaluator expectations (if needed)
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    # Evaluate matching
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-ml',
            matcher_instance=ml_matchers[edge_name]
        )
        
        ml_matching_metrics_val[edge_name] = eval_results
        
        print(f"\n  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
    except Exception as e:
        print(f"  PyDI evaluator failed: {e}")
        # Manual evaluation fallback
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        
        print(f"\n  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("\n✓ ML-based matching evaluation complete")



=== LR: ML-Based Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  1.000
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  1.000



  Precision: 1.000
  Recall:    1.000
  F1-Score:  1.000
  TP: 30
  FP: 0
  FN: 0

=== LS: ML-Based Matching Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  49
[INFO ] root -   True Negatives:  45
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.979
[INFO ] root -   Precision: 0.961
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.980



  Precision: 0.961
  Recall:    1.000
  F1-Score:  0.980
  TP: 49
  FP: 2
  FN: 0

✓ ML-based matching evaluation complete


#### 5.3.4 Analyze RandomForest Error Cases

Analyze False Positives and False Negatives for RandomForestClassifier to identify patterns for further improvement.


In [119]:
# Analyze error cases for RandomForestClassifier (reusing code structure from 3.2)

for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name}: RandomForestClassifier Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = ml_matching_results[edge_name]
    
    # Get true matches and false matches in validation set
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    
    # Get predicted matches (only those in validation set)
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))  # All pairs in validation set
    
    # False Negatives: True matches in validation set that were not predicted
    fn_pairs = true_set - pred_set
    print(f"\n[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if len(fn_pairs) > 0:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            
            if left_rec is not None and right_rec is not None:
                print(f"\n  {id1} <-> {id2}")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                
                # Check if in candidates
                in_candidates = len(candidates[edge_name][
                    (candidates[edge_name]['id1'] == id1) & 
                    (candidates[edge_name]['id2'] == id2)
                ]) > 0
                print(f"    In candidates: {in_candidates}")
                
                # Check if pair was scored by RF (even if below threshold)
                score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
                if len(score_row) > 0:
                    score = score_row['score'].iloc[0] if 'score' in score_row.columns else None
                    score_str = f"{score:.3f}" if score is not None else "N/A"
                    print(f"    RF Score: {score_str}")
    else:
        print("  No false negatives found!")
    
    # False Positives: Predicted matches that are in validation set but labeled as FALSE
    # Only analyze pairs that are in the validation set
    fp_pairs = (pred_set & val_set) & false_set
    print(f"\n[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if len(fp_pairs) > 0:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 and 'score' in score_row.columns else None
            
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else "N/A"
                print(f"\n  {id1} <-> {id2} (score: {score_str})")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
    else:
        print("  No false positives found!")

print("\n✓ RandomForestClassifier error analysis complete")



LR: RandomForestClassifier Error Cases Analysis

[FALSE NEGATIVES] (0 cases):
  No false negatives found!

[FALSE POSITIVES] (0 cases):
  No false positives found!

LS: RandomForestClassifier Error Cases Analysis

[FALSE NEGATIVES] (0 cases):
  No false negatives found!

[FALSE POSITIVES] (2 cases):

  5437062017|2017|L <-> 6210022017|2017|S (score: 1.000)
    Left:  'dan robertson' | Season: 2017 | Birth: 1985
    Right: 'daniel robertson' | Season: 2017 | Birth: 1994

  6703512022|2022|L <-> 6689422022|2022|S (score: 1.000)
    Left:  'jose rojas' | Season: 2022 | Birth: 1993
    Right: 'josh rojas' | Season: 2022 | Birth: 1994

✓ RandomForestClassifier error analysis complete


### 5.4 Train and Apply GradientBoostingClassifier Matcher

Train and apply GradientBoostingClassifier using the same feature extractor for comparison with RandomForest.


#### 5.4.1 Train and Apply GradientBoostingClassifier

In [120]:
# Train and apply GradientBoostingClassifier (reusing feature extractor from 4.2)

# Ensure GradientBoostingClassifier is imported
from sklearn.ensemble import GradientBoostingClassifier

gb_classifiers = {}
gb_matchers = {}
gb_matching_results = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: GradientBoostingClassifier ===")
    
    left_df, right_df = source_tables[edge_name]
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    # Reuse feature extraction from 4.2
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    
    # Train GradientBoostingClassifier
    print(f"  Training GradientBoostingClassifier...")
    gb_clf = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        min_samples_split=5,
        min_samples_leaf=2,
        random_state=42
    )
    gb_clf.fit(X_train, y_train)
    
    gb_classifiers[edge_name] = gb_clf
    gb_matchers[edge_name] = MLBasedMatcher(ml_feature_extractor)
    
    # Apply to candidate pairs (with season_year constraint)
    cand_df = candidates[edge_name].copy()
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1', right_on='_rid', how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2', right_on='_rid', how='left', suffixes=('', '_right')
    )
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    # Apply matcher
    correspondences = gb_matchers[edge_name].match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        trained_classifier=gb_clf
    )
    gb_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")

print("\n✓ GradientBoostingClassifier matching complete")



=== LR: GradientBoostingClassifier ===


[INFO ] root - Label distribution: 90 positive, 214 negative


  Training GradientBoostingClassifier...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.000; 135944 blocked pairs (reduction ratio: 0.9999161462661056)
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:52.456; found 15533 correspondences.
[INFO ] root - Label distribution: 142 positive, 154 negative


  Generated 15,533 matched pairs

=== LS: GradientBoostingClassifier ===
  Training GradientBoostingClassifier...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9729 blocked pairs (reduction ratio: 0.9999864590429076)
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.650; found 6692 correspondences.


  Generated 6,692 matched pairs

✓ GradientBoostingClassifier matching complete


#### 5.4.2 Hyperparameter Tuning for GradientBoostingClassifier

Use RandomizedSearchCV to find optimal hyperparameters for GradientBoostingClassifier. This uses the **validation set** for tuning (NOT the test set). The tuned model will be stored separately to compare with the default model.


In [121]:
# Hyperparameter tuning for GradientBoostingClassifier using validation set
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, f1_score

# Define parameter distribution for RandomizedSearchCV
gb_param_dist = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.05, 0.1, 0.15],
    'max_depth': [3, 4, 5, 6, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.8, 0.9, 1.0]
}

# F1 score scorer for optimization
f1_scorer = make_scorer(f1_score)

# Dictionary to store tuned classifiers (separate from default)
gb_tuned_classifiers = {}
gb_tuning_results = {}

print("="*80)
print("GradientBoosting Hyperparameter Tuning (Using Validation Set)")
print("="*80)
print("⚠️  Using validation set for tuning. Test set will be used only for final evaluation.\n")

for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name} Edge: Hyperparameter Tuning")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    
    # Prepare training data
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    # Prepare validation data
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    val_df['label'] = val_df['label'].astype(str).str.strip().str.upper()
    val_df['label_binary'] = (val_df['label'] == 'TRUE').astype(int)
    
    print(f"  Training pairs: {len(train_df)}")
    print(f"  Validation pairs: {len(val_df)}")
    
    # Extract features for training
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    # Extract features for validation
    val_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=val_df[['id1', 'id2']],
        labels=val_df['label_binary'],
        id_column='_rid'
    )
    
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    X_val = val_features[feature_columns]
    y_val = val_features['label']
    
    print(f"  Feature dimensions: {X_train.shape[1]}")
    print(f"  Training: {len(X_train)} samples")
    print(f"  Validation: {len(X_val)} samples")
    
    # Base GradientBoosting classifier
    base_clf = GradientBoostingClassifier(
        random_state=42
    )
    
    # RandomizedSearchCV with limited iterations for speed
    print(f"\n  Running RandomizedSearchCV (n_iter=15 for speed)...")
    random_search = RandomizedSearchCV(
        estimator=base_clf,
        param_distributions=gb_param_dist,
        n_iter=15,  # Limited iterations for faster execution
        scoring=f1_scorer,
        cv=3,  # 3-fold cross-validation on training set
        n_jobs=-1,
        random_state=42,
        verbose=1
    )
    
    # Fit on training data
    random_search.fit(X_train, y_train)
    
    # Get best parameters
    best_params = random_search.best_params_
    best_score = random_search.best_score_
    
    print(f"\n  Best cross-validation F1 score: {best_score:.4f}")
    print(f"  Best parameters:")
    for param, value in sorted(best_params.items()):
        print(f"    {param}: {value}")
    
    # Evaluate on validation set with best model
    best_clf = random_search.best_estimator_
    y_val_pred = best_clf.predict(X_val)
    val_f1 = f1_score(y_val, y_val_pred)
    
    print(f"\n  Validation set F1 score (best model): {val_f1:.4f}")
    
    # Compare with default model (from 5.4.1)
    default_clf = gb_classifiers[edge_name]  # Use the already trained default model
    y_val_pred_default = default_clf.predict(X_val)
    val_f1_default = f1_score(y_val, y_val_pred_default)
    
    print(f"  Validation set F1 score (default model): {val_f1_default:.4f}")
    improvement = val_f1 - val_f1_default
    print(f"  Improvement: {improvement:+.4f} ({improvement/val_f1_default*100:+.2f}%)")
    
    # Store results (separate from default model)
    gb_tuned_classifiers[edge_name] = best_clf
    gb_tuning_results[edge_name] = {
        'best_params': best_params,
        'best_cv_score': best_score,
        'val_f1': val_f1,
        'val_f1_default': val_f1_default,
        'improvement': improvement
    }
    
    print(f"\n  ✓ Hyperparameter tuning complete for {edge_name}")

print(f"\n{'='*80}")
print("✓ GradientBoosting hyperparameter tuning complete for all edges")
print(f"{'='*80}")

# Print summary
print("\nTuning Summary:")
print("-" * 80)
for edge_name in ['LR', 'LS']:
    results = gb_tuning_results[edge_name]
    print(f"{edge_name}:")
    print(f"  Default F1: {results['val_f1_default']:.4f}")
    print(f"  Tuned F1:   {results['val_f1']:.4f}")
    print(f"  Improvement: {results['improvement']:+.4f} ({results['improvement']/results['val_f1_default']*100:+.2f}%)")


GradientBoosting Hyperparameter Tuning (Using Validation Set)
⚠️  Using validation set for tuning. Test set will be used only for final evaluation.


LR Edge: Hyperparameter Tuning
  Training pairs: 304
  Validation pairs: 96


[INFO ] root - Label distribution: 90 positive, 214 negative
[INFO ] root - Label distribution: 30 positive, 66 negative


  Feature dimensions: 6
  Training: 304 samples
  Validation: 96 samples

  Running RandomizedSearchCV (n_iter=15 for speed)...
Fitting 3 folds for each of 15 candidates, totalling 45 fits

  Best cross-validation F1 score: 1.0000
  Best parameters:
    learning_rate: 0.05
    max_depth: 4
    min_samples_leaf: 2
    min_samples_split: 5
    n_estimators: 300
    subsample: 0.8

  Validation set F1 score (best model): 1.0000
  Validation set F1 score (default model): 1.0000
  Improvement: +0.0000 (+0.00%)

  ✓ Hyperparameter tuning complete for LR

LS Edge: Hyperparameter Tuning
  Training pairs: 296
  Validation pairs: 101


[INFO ] root - Label distribution: 142 positive, 154 negative
[INFO ] root - Label distribution: 49 positive, 52 negative


  Feature dimensions: 6
  Training: 296 samples
  Validation: 101 samples

  Running RandomizedSearchCV (n_iter=15 for speed)...
Fitting 3 folds for each of 15 candidates, totalling 45 fits

  Best cross-validation F1 score: 0.9856
  Best parameters:
    learning_rate: 0.05
    max_depth: 3
    min_samples_leaf: 2
    min_samples_split: 5
    n_estimators: 300
    subsample: 0.9

  Validation set F1 score (best model): 0.9796
  Validation set F1 score (default model): 0.9796
  Improvement: +0.0000 (+0.00%)

  ✓ Hyperparameter tuning complete for LS

✓ GradientBoosting hyperparameter tuning complete for all edges

Tuning Summary:
--------------------------------------------------------------------------------
LR:
  Default F1: 1.0000
  Tuned F1:   1.0000
  Improvement: +0.0000 (+0.00%)
LS:
  Default F1: 0.9796
  Tuned F1:   0.9796
  Improvement: +0.0000 (+0.00%)


#### 5.4.3 Evaluate GradientBoosting Matching

In [122]:
# Evaluate GradientBoostingClassifier matching

gb_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: GradientBoostingClassifier Evaluation ===")
    
    correspondences = gb_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-gb',
            matcher_instance=gb_matchers[edge_name]
        )
        gb_matching_metrics_val[edge_name] = eval_results
        print(f"  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
    except Exception as e:
        print(f"  Evaluation failed: {e}")

print("\n✓ GradientBoostingClassifier evaluation complete")



=== LR: GradientBoostingClassifier Evaluation ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  1.000
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  1.000


  Precision: 1.000
  Recall:    1.000
  F1-Score:  1.000

=== LS: GradientBoostingClassifier Evaluation ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  48
[INFO ] root -   True Negatives:  46
[INFO ] root -   False Positives: 1
[INFO ] root -   False Negatives: 1
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.979
[INFO ] root -   Precision: 0.980
[INFO ] root -   Recall:    0.980
[INFO ] root -   F1-Score:  0.980


  Precision: 0.980
  Recall:    0.980
  F1-Score:  0.980

✓ GradientBoostingClassifier evaluation complete


#### 5.4.5 Apply Tuned GradientBoostingClassifier Matcher

Apply the tuned GradientBoostingClassifier model to candidate pairs. This allows comparison with the default model results.


In [123]:
# Apply tuned GradientBoostingClassifier to candidate pairs
gb_tuned_matching_results = {}
gb_tuned_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"=== {edge_name}: Tuned GradientBoostingClassifier Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    clf = gb_tuned_classifiers[edge_name]
    
    # Create matcher with tuned classifier
    matcher = MLBasedMatcher(ml_feature_extractor)
    gb_tuned_matchers[edge_name] = matcher
    
    print("  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1', right_on='_rid', how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2', right_on='_rid', how='left', suffixes=('', '_right')
    )
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        trained_classifier=clf
    )
    
    gb_tuned_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")
    if 'score' in correspondences.columns:
        print(f"  Score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean score: {correspondences['score'].mean():.3f}")

print("\n✓ Tuned GradientBoostingClassifier matching complete for all edges")


=== LR: Tuned GradientBoostingClassifier Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.000; 135944 blocked pairs (reduction ratio: 0.9999161462661056)


  After season_year filter: 135,944 candidate pairs (from 5,994,373)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:47.165; found 15496 correspondences.


  Generated 15,496 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000
=== LS: Tuned GradientBoostingClassifier Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9729 blocked pairs (reduction ratio: 0.9999864590429076)


  After season_year filter: 9,729 candidate pairs (from 227,185)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.262; found 6679 correspondences.


  Generated 6,679 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000

✓ Tuned GradientBoostingClassifier matching complete for all edges


#### 5.4.6 Evaluate Tuned GradientBoostingClassifier Matching

Evaluate the tuned model performance on validation set and compare with default model.


In [124]:
# Evaluate tuned GradientBoostingClassifier matching
gb_tuned_matching_metrics_val = {}

print("="*80)
print("Tuned GradientBoostingClassifier Evaluation (Validation Set)")
print("="*80)

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Tuned GradientBoostingClassifier Evaluation ===")
    
    correspondences = gb_tuned_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-gb-tuned',
            matcher_instance=gb_tuned_matchers[edge_name]
        )
        gb_tuned_matching_metrics_val[edge_name] = eval_results
        
        print(f"  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
        
        # Compare with default model
        default_metrics = gb_matching_metrics_val.get(edge_name, {})
        if default_metrics:
            print(f"\n  Comparison with Default Model:")
            print(f"    Default F1: {default_metrics.get('f1', 0.0):.3f}")
            print(f"    Tuned F1:   {eval_results.get('f1', 0.0):.3f}")
            improvement = eval_results.get('f1', 0.0) - default_metrics.get('f1', 0.0)
            print(f"    Improvement: {improvement:+.4f}")
            if improvement > 0:
                print(f"    Relative improvement: {improvement/default_metrics.get('f1', 0.0)*100:+.2f}%")
    except Exception as e:
        print(f"  Evaluation failed: {e}")
        # Fallback to manual calculation
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        gb_tuned_matching_metrics_val[edge_name] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'true_positives': tp,
            'false_positives': fp,
            'false_negatives': fn
        }
        print(f"  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("\n✓ Tuned GradientBoostingClassifier evaluation complete")


Tuned GradientBoostingClassifier Evaluation (Validation Set)

=== LR: Tuned GradientBoostingClassifier Evaluation ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  1.000
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  1.000


  Precision: 1.000
  Recall:    1.000
  F1-Score:  1.000
  TP: 30
  FP: 0
  FN: 0

  Comparison with Default Model:
    Default F1: 1.000
    Tuned F1:   1.000
    Improvement: +0.0000

=== LS: Tuned GradientBoostingClassifier Evaluation ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  48
[INFO ] root -   True Negatives:  46
[INFO ] root -   False Positives: 1
[INFO ] root -   False Negatives: 1
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.979
[INFO ] root -   Precision: 0.980
[INFO ] root -   Recall:    0.980
[INFO ] root -   F1-Score:  0.980


  Precision: 0.980
  Recall:    0.980
  F1-Score:  0.980
  TP: 48
  FP: 1
  FN: 1

  Comparison with Default Model:
    Default F1: 0.980
    Tuned F1:   0.980
    Improvement: +0.0000

✓ Tuned GradientBoostingClassifier evaluation complete


#### 5.5.4 Analyze XGBoostClassifier Error Cases
Analyze False Positives and False Negatives for GradientBoostingClassifier to identify patterns for further improvement.


In [125]:
# Analyze error cases for GradientBoostingClassifier (reusing code structure from 3.2)

for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name}: GradientBoostingClassifier Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = gb_matching_results[edge_name]
    
    # Get true matches and false matches in validation set
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    
    # Get predicted matches (only those in validation set)
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))  # All pairs in validation set
    
    # False Negatives: True matches in validation set that were not predicted
    fn_pairs = true_set - pred_set
    print(f"\n[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if len(fn_pairs) > 0:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            
            if left_rec is not None and right_rec is not None:
                print(f"\n  {id1} <-> {id2}")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                
                # Check if in candidates
                in_candidates = len(candidates[edge_name][
                    (candidates[edge_name]['id1'] == id1) & 
                    (candidates[edge_name]['id2'] == id2)
                ]) > 0
                print(f"    In candidates: {in_candidates}")
                
                # Check if pair was scored by GB (even if below threshold)
                score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
                if len(score_row) > 0:
                    score = score_row['score'].iloc[0] if 'score' in score_row.columns else None
                    score_str = f"{score:.3f}" if score is not None else "N/A"
                    print(f"    GB Score: {score_str}")
    else:
        print("  No false negatives found!")
    
    # False Positives: Predicted matches that are in validation set but labeled as FALSE
    # Only analyze pairs that are in the validation set
    fp_pairs = (pred_set & val_set) & false_set
    print(f"\n[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if len(fp_pairs) > 0:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 and 'score' in score_row.columns else None
            
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else "N/A"
                print(f"\n  {id1} <-> {id2} (score: {score_str})")
                print(f"    Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"    Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
    else:
        print("  No false positives found!")

print("\n✓ GradientBoostingClassifier error analysis complete")



LR: GradientBoostingClassifier Error Cases Analysis

[FALSE NEGATIVES] (0 cases):
  No false negatives found!

[FALSE POSITIVES] (0 cases):
  No false positives found!

LS: GradientBoostingClassifier Error Cases Analysis

[FALSE NEGATIVES] (1 cases):

  4762702015|2015|L <-> 4762702015|2015|S
    Left:  'steven tolleson' | Season: 2015 | Birth: 1983
    Right: 'steve tolleson' | Season: 2015 | Birth: 1984
    In candidates: True

[FALSE POSITIVES] (1 cases):

  5167142015|2015|L <-> 6458482015|2015|S (score: 1.000)
    Left:  'dario alvarez' | Season: 2015 | Birth: 1989
    Right: 'dariel alvarez' | Season: 2015 | Birth: 1989

✓ GradientBoostingClassifier error analysis complete


### 4.8 Error Analysis Summary and Improvement Plan

Based on the error analysis, we identify the following patterns and propose targeted improvements:

#### **LR Edge Error Patterns:**

1. **Blocking Issues (2 cases):**
   - `dan vogelbach` vs `daniel vogelbach` (2 instances) - Not in candidate pairs
   - **Root Cause:** The Enhanced TokenBlocker is missing these name variant pairs during blocking phase
   - **Impact:** These are true matches that never reach the matching stage

2. **Matching Issues (1 case):**
   - `jonathon niese` vs `jon niese` - In candidates but missed by GradientBoostingClassifier
   - **Root Cause:** ML model not recognizing common name variants despite high name similarity
   - **Impact:** True match filtered out by classifier

3. **False Positives:** 0 cases (Perfect Precision)

#### **LS Edge Error Patterns:**

1. **Matching Issues - False Negatives (1 case):**
   - `dan robertson` vs `daniel robertson` - In candidates but missed by classifier
   - **Root Cause:** Same as LR - name variant not recognized by ML model

2. **Matching Issues - False Positives (3 cases, all score=1.000):**
   - `josh rojas` vs `jose rojas` (birth years: 1994 vs 1993)
   - `kevan smith` vs `kevin smith` (birth years: 1988 vs 1997)
   - `matt duffy` vs `matt duffy` (birth years: 1989 vs 1991)
   - **Root Cause:** Model overconfident on name similarity, insufficiently penalizing birth year differences
   - **Impact:** High-confidence false matches reduce precision

#### **Proposed Improvement Strategies:**

**Strategy 1: Enhance Feature Engineering for ML Models**
- **Add Birth Year Difference Feature:** Calculate `abs(birth_year_left - birth_year_right)` as an explicit feature
- **Add Name Variant Indicator:** Create binary feature indicating if names are known variants (using `NAME_VARIANTS` dictionary)
- **Action:** Extend `ml_comparators` to include `DateComparator` for birth_year, or add custom feature extraction

**Strategy 2: Improve Blocking for Name Variants (LR Edge)**
- **Enhance TokenBlocker:** Investigate why `dan`/`daniel` variants are missed in blocking phase
- **Action:** Review `normalize_name_for_blocking` function or add name variant expansion to blocking keys

**Strategy 3: Hyperparameter Tuning**
- **Adjust GradientBoostingClassifier:** Fine-tune parameters to better balance name similarity vs. birth year constraints
- **Action:** Grid search or Bayesian optimization for optimal parameters

**Strategy 4: Ensemble Methods**
- **Combine Models:** Use voting or stacking to combine RandomForest and GradientBoosting predictions
- **Action:** Implement ensemble matcher that leverages strengths of both models


### 5.5 Train and Apply XGBoostClassifier Matcher

Train a gradient-boosted tree model (XGBoost) on the same feature set to capture non-linear interactions beyond RandomForest/GradientBoosting.

#### 5.5.1 Train XGBoostClassifier

In [126]:
# Train XGBoost-based matchers for each edge
from xgboost import XGBClassifier

xgb_classifiers = {}
xgb_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"=== {edge_name}: Training XGBoostClassifier ===")
    
    left_df, right_df = source_tables[edge_name]
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    print(f"  Training pairs: {len(train_df)}")
    print(f"  True matches: {train_df['label_binary'].sum()}")
    print(f"  False matches: {len(train_df) - train_df['label_binary'].sum()}")
    
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    
    clf = XGBClassifier(
        n_estimators=400,
        max_depth=6,
        learning_rate=0.1,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        reg_alpha=0.0,
        objective='binary:logistic',
        eval_metric='logloss',
        n_jobs=-1,
        random_state=42
    )
    clf.fit(X_train, y_train)
    
    xgb_classifiers[edge_name] = clf
    xgb_matchers[edge_name] = MLBasedMatcher(ml_feature_extractor)
    
    importance = sorted(
        zip(feature_columns, clf.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    )
    print("Top feature importances:")
    for feat, val in importance:
        print(f" {feat}: {val:.4f}")
    print(f" XGBoostClassifier trained for {edge_name}")

print("XGBoostClassifier training complete for all edges")

=== LR: Training XGBoostClassifier ===
  Training pairs: 304
  True matches: 90
  False matches: 214


[INFO ] root - Label distribution: 90 positive, 214 negative


[INFO ] root - Label distribution: 142 positive, 154 negative


Top feature importances:
 is_name_variant: 0.5842
 birth_year_diff: 0.2692
 StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None): 0.0739
 StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None): 0.0694
 phonetic_soundex: 0.0034
 DateComparator(birth_year, list_strategy=None): 0.0000
 XGBoostClassifier trained for LR
=== LS: Training XGBoostClassifier ===
  Training pairs: 296
  True matches: 142
  False matches: 154
Top feature importances:
 is_name_variant: 0.3946
 birth_year_diff: 0.2974
 StringComparator(full_name_normalized, levenshtein, tokenization=char, list_strategy=None): 0.1367
 StringComparator(full_name_normalized, jaccard, tokenization=word, list_strategy=None): 0.1082
 DateComparator(birth_year, list_strategy=None): 0.0329
 phonetic_soundex: 0.0304
 XGBoostClassifier trained for LS
XGBoostClassifier training complete for all edges


#### 5.5.2 Hyperparameter Tuning for XGBoostClassifier

Use RandomizedSearchCV to find optimal hyperparameters for XGBoostClassifier. This uses the **validation set** for tuning (NOT the test set). RandomizedSearchCV is faster than GridSearchCV while still exploring a good parameter space.


In [127]:
# Hyperparameter tuning for XGBoostClassifier using validation set
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, f1_score
import numpy as np

# Define parameter distribution for RandomizedSearchCV
xgb_param_dist = {
    'max_depth': [4, 5, 6, 7, 8],
    'learning_rate': [0.05, 0.1, 0.15],
    'n_estimators': [300, 400, 500],
    'reg_alpha': [0, 0.1, 0.5, 1.0],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}

# F1 score scorer for optimization
f1_scorer = make_scorer(f1_score)

# Dictionary to store tuned classifiers
xgb_tuned_classifiers = {}
xgb_tuning_results = {}

print("="*80)
print("XGBoost Hyperparameter Tuning (Using Validation Set)")
print("="*80)
print("Using validation set for tuning. Test set will be used only for final evaluation.\n")

for edge_name in ['LR', 'LS']:
    print(f"\n{'='*80}")
    print(f"{edge_name} Edge: Hyperparameter Tuning")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    
    # Prepare training data
    train_df = splits[edge_name]['train'][['id1', 'id2', 'label']].copy()
    train_df['label'] = train_df['label'].astype(str).str.strip().str.upper()
    train_df['label_binary'] = (train_df['label'] == 'TRUE').astype(int)
    
    # Prepare validation data
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    val_df['label'] = val_df['label'].astype(str).str.strip().str.upper()
    val_df['label_binary'] = (val_df['label'] == 'TRUE').astype(int)
    
    print(f"  Training pairs: {len(train_df)}")
    print(f"  Validation pairs: {len(val_df)}")
    
    # Extract features for training
    train_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=train_df[['id1', 'id2']],
        labels=train_df['label_binary'],
        id_column='_rid'
    )
    
    # Extract features for validation
    val_features = ml_feature_extractor.create_features(
        df_left=left_df,
        df_right=right_df,
        pairs=val_df[['id1', 'id2']],
        labels=val_df['label_binary'],
        id_column='_rid'
    )
    
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X_train = train_features[feature_columns]
    y_train = train_features['label']
    X_val = val_features[feature_columns]
    y_val = val_features['label']
    
    print(f"  Feature dimensions: {X_train.shape[1]}")
    print(f"  Training: {len(X_train)} samples")
    print(f"  Validation: {len(X_val)} samples")
    
    # Base XGBoost classifier
    base_clf = XGBClassifier(
        objective='binary:logistic',
        eval_metric='logloss',
        n_jobs=-1,
        random_state=42
    )
    
    # RandomizedSearchCV with limited iterations for speed
    print(f"\n  Running RandomizedSearchCV (n_iter=15 for speed)...")
    random_search = RandomizedSearchCV(
        estimator=base_clf,
        param_distributions=xgb_param_dist,
        n_iter=15,  # Limited iterations for faster execution
        scoring=f1_scorer,
        cv=3,  # 3-fold cross-validation on training set
        n_jobs=-1,
        random_state=42,
        verbose=1
    )
    
    # Fit on training data
    random_search.fit(X_train, y_train)
    
    # Get best parameters
    best_params = random_search.best_params_
    best_score = random_search.best_score_
    
    print(f"\n  Best cross-validation F1 score: {best_score:.4f}")
    print(f"  Best parameters:")
    for param, value in sorted(best_params.items()):
        print(f"    {param}: {value}")
    
    # Evaluate on validation set with best model
    best_clf = random_search.best_estimator_
    y_val_pred = best_clf.predict(X_val)
    val_f1 = f1_score(y_val, y_val_pred)
    
    print(f"\n  Validation set F1 score (best model): {val_f1:.4f}")
    
    # Compare with default model
    default_clf = XGBClassifier(
        n_estimators=400,
        max_depth=6,
        learning_rate=0.1,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        reg_alpha=0.0,
        objective='binary:logistic',
        eval_metric='logloss',
        n_jobs=-1,
        random_state=42
    )
    default_clf.fit(X_train, y_train)
    y_val_pred_default = default_clf.predict(X_val)
    val_f1_default = f1_score(y_val, y_val_pred_default)
    
    print(f"  Validation set F1 score (default model): {val_f1_default:.4f}")
    improvement = val_f1 - val_f1_default
    print(f"  Improvement: {improvement:+.4f} ({improvement/val_f1_default*100:+.2f}%)")
    
    # Store results (separate from default model)
    xgb_tuned_classifiers[edge_name] = best_clf
    xgb_tuning_results[edge_name] = {
        'best_params': best_params,
        'best_cv_score': best_score,
        'val_f1': val_f1,
        'val_f1_default': val_f1_default,
        'improvement': improvement
    }
    
    # Note: Do NOT update xgb_classifiers here - keep default model separate
    
    print(f"\n  ✓ Hyperparameter tuning complete for {edge_name}")

print(f"\n{'='*80}")
print("✓ XGBoost hyperparameter tuning complete for all edges")
print(f"{'='*80}")

# Print summary
print("\nTuning Summary:")
print("-" * 80)
for edge_name in ['LR', 'LS']:
    results = xgb_tuning_results[edge_name]
    print(f"{edge_name}:")
    print(f"  Default F1: {results['val_f1_default']:.4f}")
    print(f"  Tuned F1:   {results['val_f1']:.4f}")
    print(f"  Improvement: {results['improvement']:+.4f} ({results['improvement']/results['val_f1_default']*100:+.2f}%)")


XGBoost Hyperparameter Tuning (Using Validation Set)
Using validation set for tuning. Test set will be used only for final evaluation.


LR Edge: Hyperparameter Tuning
  Training pairs: 304
  Validation pairs: 96


[INFO ] root - Label distribution: 90 positive, 214 negative
[INFO ] root - Label distribution: 30 positive, 66 negative


  Feature dimensions: 6
  Training: 304 samples
  Validation: 96 samples

  Running RandomizedSearchCV (n_iter=15 for speed)...
Fitting 3 folds for each of 15 candidates, totalling 45 fits


[INFO ] root - Label distribution: 142 positive, 154 negative



  Best cross-validation F1 score: 1.0000
  Best parameters:
    colsample_bytree: 0.8
    learning_rate: 0.05
    max_depth: 8
    min_child_weight: 1
    n_estimators: 500
    reg_alpha: 1.0
    subsample: 0.9

  Validation set F1 score (best model): 1.0000
  Validation set F1 score (default model): 1.0000
  Improvement: +0.0000 (+0.00%)

  ✓ Hyperparameter tuning complete for LR

LS Edge: Hyperparameter Tuning
  Training pairs: 296
  Validation pairs: 101


[INFO ] root - Label distribution: 49 positive, 52 negative


  Feature dimensions: 6
  Training: 296 samples
  Validation: 101 samples

  Running RandomizedSearchCV (n_iter=15 for speed)...
Fitting 3 folds for each of 15 candidates, totalling 45 fits

  Best cross-validation F1 score: 0.9821
  Best parameters:
    colsample_bytree: 0.8
    learning_rate: 0.05
    max_depth: 8
    min_child_weight: 1
    n_estimators: 500
    reg_alpha: 1.0
    subsample: 0.9

  Validation set F1 score (best model): 0.9800
  Validation set F1 score (default model): 0.9697
  Improvement: +0.0103 (+1.06%)

  ✓ Hyperparameter tuning complete for LS

✓ XGBoost hyperparameter tuning complete for all edges

Tuning Summary:
--------------------------------------------------------------------------------
LR:
  Default F1: 1.0000
  Tuned F1:   1.0000
  Improvement: +0.0000 (+0.00%)
LS:
  Default F1: 0.9697
  Tuned F1:   0.9800
  Improvement: +0.0103 (+1.06%)


#### 5.5.3 Apply XGBoostClassifier Matcher (Default)

Filter by the same season_year constraint, then score candidate pairs with the default XGBoost model (from 5.5.1).

In [128]:
# Apply XGBoostClassifier-based matcher to candidate pairs
xgb_matching_results = {}

for edge_name in ['LR', 'LS']:
    print(f"=== {edge_name}: XGBoostClassifier Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    matcher = xgb_matchers[edge_name]
    clf = xgb_classifiers[edge_name]
    
    print("  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1',
        right_on='_rid',
        how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2',
        right_on='_rid',
        how='left',
        suffixes=('', '_right')
    )
    
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) &
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        trained_classifier=clf
    )
    
    xgb_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")
    if 'score' in correspondences.columns:
        print(f"  Score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean score: {correspondences['score'].mean():.3f}")

print("XGBoostClassifier matching complete for all edges")

=== LR: XGBoostClassifier Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.000; 135944 blocked pairs (reduction ratio: 0.9999161462661056)


  After season_year filter: 135,944 candidate pairs (from 5,994,373)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:49.250; found 15429 correspondences.


  Generated 15,429 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000
=== LS: XGBoostClassifier Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9729 blocked pairs (reduction ratio: 0.9999864590429076)


  After season_year filter: 9,729 candidate pairs (from 227,185)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.244; found 6717 correspondences.


  Generated 6,717 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000
XGBoostClassifier matching complete for all edges


#### 5.5.4 Evaluate XGBoostClassifier Matching (Default)

Assess validation performance of the default XGBoost model with the PyDI evaluator, falling back to manual metrics if needed.

In [129]:
# Evaluate XGBoostClassifier matching on validation set
xgb_matching_metrics_val = {}

for edge_name in ['LR', 'LS']:
    print(f"{edge_name}: XGBoostClassifier Evaluation (Validation Set) ===")
    
    correspondences = xgb_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-xgb',
            matcher_instance=xgb_matchers[edge_name]
        )
        xgb_matching_metrics_val[edge_name] = eval_results
        
        print(f"  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
    except Exception as exc:
        print(f"  PyDI evaluator failed: {exc}")
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        xgb_matching_metrics_val[edge_name] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'true_positives': tp,
            'false_positives': fp,
            'false_negatives': fn
        }
        print(f"  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("XGBoostClassifier evaluation complete")

LR: XGBoostClassifier Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  1.000
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  1.000


  Precision: 1.000
  Recall:    1.000
  F1-Score:  1.000
  TP: 30
  FP: 0
  FN: 0
LS: XGBoostClassifier Evaluation (Validation Set) ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  48
[INFO ] root -   True Negatives:  45
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 1
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.969
[INFO ] root -   Precision: 0.960
[INFO ] root -   Recall:    0.980
[INFO ] root -   F1-Score:  0.970


  Precision: 0.960
  Recall:    0.980
  F1-Score:  0.970
  TP: 48
  FP: 2
  FN: 1
XGBoostClassifier evaluation complete


#### 5.5.6 Apply Tuned XGBoostClassifier Matcher

Apply the tuned XGBoostClassifier model to candidate pairs. This allows comparison with the default model results.


In [130]:
# Apply tuned XGBoostClassifier to candidate pairs
xgb_tuned_matching_results = {}
xgb_tuned_matchers = {}

for edge_name in ['LR', 'LS']:
    print(f"=== {edge_name}: Tuned XGBoostClassifier Matching ===")
    
    left_df, right_df = source_tables[edge_name]
    cand_df = candidates[edge_name].copy()
    clf = xgb_tuned_classifiers[edge_name]
    
    # Create matcher with tuned classifier
    matcher = MLBasedMatcher(ml_feature_extractor)
    xgb_tuned_matchers[edge_name] = matcher
    
    print("  Filtering candidate pairs by season_year constraint...")
    cand_with_seasons = cand_df.merge(
        left_df[['_rid', 'season_year']],
        left_on='id1', right_on='_rid', how='left'
    ).merge(
        right_df[['_rid', 'season_year']],
        left_on='id2', right_on='_rid', how='left', suffixes=('', '_right')
    )
    cand_df_filtered = cand_with_seasons[
        (cand_with_seasons['season_year'].notna()) & 
        (cand_with_seasons['season_year_right'].notna()) &
        (cand_with_seasons['season_year'] == cand_with_seasons['season_year_right'])
    ][['id1', 'id2']].copy()
    
    print(f"  After season_year filter: {len(cand_df_filtered):,} candidate pairs (from {len(cand_df):,})")
    
    correspondences = matcher.match(
        df_left=left_df,
        df_right=right_df,
        candidates=cand_df_filtered,
        id_column='_rid',
        trained_classifier=clf
    )
    
    xgb_tuned_matching_results[edge_name] = correspondences
    
    print(f"  Generated {len(correspondences):,} matched pairs")
    if 'score' in correspondences.columns:
        print(f"  Score range: [{correspondences['score'].min():.3f}, {correspondences['score'].max():.3f}]")
        print(f"  Mean score: {correspondences['score'].mean():.3f}")

print("\n✓ Tuned XGBoostClassifier matching complete for all edges")


=== LR: Tuned XGBoostClassifier Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 15215 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 15215 elements after 0:00:0.001; 135944 blocked pairs (reduction ratio: 0.9999161462661056)


  After season_year filter: 135,944 candidate pairs (from 5,994,373)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:53.862; found 15551 correspondences.


  Generated 15,551 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000
=== LS: Tuned XGBoostClassifier Matching ===
  Filtering candidate pairs by season_year constraint...


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 106553 x 6743 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 106553 x 6743 elements after 0:00:0.000; 9729 blocked pairs (reduction ratio: 0.9999864590429076)


  After season_year filter: 9,729 candidate pairs (from 227,185)


[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:3.816; found 6741 correspondences.


  Generated 6,741 matched pairs
  Score range: [1.000, 1.000]
  Mean score: 1.000

✓ Tuned XGBoostClassifier matching complete for all edges


#### 5.5.7 Evaluate Tuned XGBoostClassifier Matching

Evaluate the tuned model performance on validation set and compare with default model.


In [131]:
# Evaluate tuned XGBoostClassifier matching
xgb_tuned_matching_metrics_val = {}

print("="*80)
print("Tuned XGBoostClassifier Evaluation (Validation Set)")
print("="*80)

for edge_name in ['LR', 'LS']:
    print(f"\n=== {edge_name}: Tuned XGBoostClassifier Evaluation ===")
    
    correspondences = xgb_tuned_matching_results[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    
    correspondences_for_eval = correspondences.copy()
    if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
        correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
    
    try:
        eval_results = EntityMatchingEvaluator.evaluate_matching(
            correspondences=correspondences_for_eval,
            test_pairs=val_df,
            out_dir=OUTPUT_DIR / 'matching-evaluation-xgb-tuned',
            matcher_instance=xgb_tuned_matchers[edge_name]
        )
        xgb_tuned_matching_metrics_val[edge_name] = eval_results
        
        print(f"  Precision: {eval_results.get('precision', 0.0):.3f}")
        print(f"  Recall:    {eval_results.get('recall', 0.0):.3f}")
        print(f"  F1-Score:  {eval_results.get('f1', 0.0):.3f}")
        print(f"  TP: {eval_results.get('true_positives', 0)}")
        print(f"  FP: {eval_results.get('false_positives', 0)}")
        print(f"  FN: {eval_results.get('false_negatives', 0)}")
        
        # Compare with default model
        default_metrics = xgb_matching_metrics_val.get(edge_name, {})
        if default_metrics:
            print(f"\n  Comparison with Default Model:")
            print(f"    Default F1: {default_metrics.get('f1', 0.0):.3f}")
            print(f"    Tuned F1:   {eval_results.get('f1', 0.0):.3f}")
            improvement = eval_results.get('f1', 0.0) - default_metrics.get('f1', 0.0)
            print(f"    Improvement: {improvement:+.4f}")
            if improvement > 0:
                print(f"    Relative improvement: {improvement/default_metrics.get('f1', 0.0)*100:+.2f}%")
    except Exception as e:
        print(f"  Evaluation failed: {e}")
        # Fallback to manual calculation
        true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
        true_set = set(zip(true_matches['id1'], true_matches['id2']))
        pred_set = set(zip(correspondences['id1'], correspondences['id2']))
        tp = len(true_set & pred_set)
        fp = len(pred_set - true_set)
        fn = len(true_set - pred_set)
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        xgb_tuned_matching_metrics_val[edge_name] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'true_positives': tp,
            'false_positives': fp,
            'false_negatives': fn
        }
        print(f"  Precision: {precision:.3f}")
        print(f"  Recall:    {recall:.3f}")
        print(f"  F1-Score:  {f1:.3f}")
        print(f"  TP: {tp}")
        print(f"  FP: {fp}")
        print(f"  FN: {fn}")

print("\n✓ Tuned XGBoostClassifier evaluation complete")


Tuned XGBoostClassifier Evaluation (Validation Set)

=== LR: Tuned XGBoostClassifier Evaluation ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  1.000
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  1.000


  Precision: 1.000
  Recall:    1.000
  F1-Score:  1.000
  TP: 30
  FP: 0
  FN: 0

  Comparison with Default Model:
    Default F1: 1.000
    Tuned F1:   1.000
    Improvement: +0.0000

=== LS: Tuned XGBoostClassifier Evaluation ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  49
[INFO ] root -   True Negatives:  45
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.979
[INFO ] root -   Precision: 0.961
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.980


  Precision: 0.961
  Recall:    1.000
  F1-Score:  0.980
  TP: 49
  FP: 2
  FN: 0

  Comparison with Default Model:
    Default F1: 0.970
    Tuned F1:   0.980
    Improvement: +0.0103
    Relative improvement: +1.06%

✓ Tuned XGBoostClassifier evaluation complete


#### 5.5.5 Analyze XGBoostClassifier Error Cases

Reuse the earlier diagnostic template to inspect false negatives / positives for the XGBoost model.

In [132]:
# Analyze error cases for XGBoostClassifier
for edge_name in ['LR', 'LS']:
    print(f"{'='*80}")
    print(f"{edge_name}: XGBoostClassifier Error Cases Analysis")
    print(f"{'='*80}")
    
    left_df, right_df = source_tables[edge_name]
    val_df = splits[edge_name]['val'][['id1', 'id2', 'label']].copy()
    correspondences = xgb_matching_results[edge_name]
    
    true_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'TRUE']
    false_matches = val_df[val_df['label'].astype(str).str.strip().str.upper() == 'FALSE']
    true_set = set(zip(true_matches['id1'], true_matches['id2']))
    false_set = set(zip(false_matches['id1'], false_matches['id2']))
    pred_set = set(zip(correspondences['id1'], correspondences['id2']))
    val_set = set(zip(val_df['id1'], val_df['id2']))
    
    fn_pairs = true_set - pred_set
    print(f"[FALSE NEGATIVES] ({len(fn_pairs)} cases):")
    if fn_pairs:
        for id1, id2 in fn_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            if left_rec is not None and right_rec is not None:
                print(f"{id1} <-> {id2}")
                print(f"Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f"Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
                in_candidates = len(candidates[edge_name][(candidates[edge_name]['id1'] == id1) & (candidates[edge_name]['id2'] == id2)]) > 0
                print(f"In candidates: {in_candidates}")
                score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
                if len(score_row) > 0 and 'score' in score_row.columns:
                    print(f"    XGBoost Score: {score_row['score'].iloc[0]:.3f}")
    else:
        print("  No false negatives found!")
    
    fp_pairs = (pred_set & val_set) & false_set
    print(f"[FALSE POSITIVES] ({len(fp_pairs)} cases):")
    if fp_pairs:
        for id1, id2 in fp_pairs:
            left_rec = left_df[left_df['_rid'] == id1].iloc[0] if len(left_df[left_df['_rid'] == id1]) > 0 else None
            right_rec = right_df[right_df['_rid'] == id2].iloc[0] if len(right_df[right_df['_rid'] == id2]) > 0 else None
            score_row = correspondences[(correspondences['id1'] == id1) & (correspondences['id2'] == id2)]
            score = score_row['score'].iloc[0] if len(score_row) > 0 and 'score' in score_row.columns else None
            if left_rec is not None and right_rec is not None:
                score_str = f"{score:.3f}" if score is not None else 'N/A'
                print(f" {id1} <-> {id2} (score: {score_str})")
                print(f" Left:  '{left_rec.get('full_name', 'N/A')}' | Season: {left_rec.get('season_year', 'N/A')} | Birth: {left_rec.get('birth_year', 'N/A')}")
                print(f" Right: '{right_rec.get('full_name', 'N/A')}' | Season: {right_rec.get('season_year', 'N/A')} | Birth: {right_rec.get('birth_year', 'N/A')}")
    else:
        print("  No false positives found!")

print("XGBoostClassifier error analysis complete")

LR: XGBoostClassifier Error Cases Analysis
[FALSE NEGATIVES] (0 cases):
  No false negatives found!
[FALSE POSITIVES] (0 cases):
  No false positives found!
LS: XGBoostClassifier Error Cases Analysis
[FALSE NEGATIVES] (1 cases):
4762702015|2015|L <-> 4762702015|2015|S
Left:  'steven tolleson' | Season: 2015 | Birth: 1983
Right: 'steve tolleson' | Season: 2015 | Birth: 1984
In candidates: True
[FALSE POSITIVES] (2 cases):
 5167142015|2015|L <-> 6458482015|2015|S (score: 1.000)
 Left:  'dario alvarez' | Season: 2015 | Birth: 1989
 Right: 'dariel alvarez' | Season: 2015 | Birth: 1989
 5437062017|2017|L <-> 6210022017|2017|S (score: 1.000)
 Left:  'dan robertson' | Season: 2017 | Birth: 1985
 Right: 'daniel robertson' | Season: 2017 | Birth: 1994
XGBoostClassifier error analysis complete


## 6. Post-Processing and Analysis

### 6.1 Global Matching (One-to-One Constraint)

In [133]:
from PyDI.entitymatching import GreedyOneToOneMatchingAlgorithm
gb_global_matches = {}
for edge_name in ['LR','LS']:
    matcher = GreedyOneToOneMatchingAlgorithm()
    correspondences = gb_matching_results[edge_name].rename(columns={'score':'score'})
    gb_global_matches[edge_name] = matcher.cluster(correspondences)
    print(f"{edge_name}: Greedy matching {len(correspondences)} -> {len(gb_global_matches[edge_name])} pairs")

[INFO ] root - Filtered correspondences: 15533 -> 15533 (threshold=0.0)
[INFO ] root - Greedy matching: 15533 -> 15160 correspondences (30320 entities matched)
[INFO ] root - GreedyOneToOneMatchingAlgorithm: 15533 -> 15160 correspondences
[INFO ] root - GreedyOneToOneMatchingAlgorithm: 30502 -> 30320 entities
[INFO ] root - Filtered correspondences: 6692 -> 6692 (threshold=0.0)
[INFO ] root - Greedy matching: 6692 -> 6685 correspondences (13370 entities matched)
[INFO ] root - GreedyOneToOneMatchingAlgorithm: 6692 -> 6685 correspondences
[INFO ] root - GreedyOneToOneMatchingAlgorithm: 13372 -> 13370 entities


LR: Greedy matching 15533 -> 15160 pairs
LS: Greedy matching 6692 -> 6685 pairs


### 6.2 Cluster Consistency Analysis

Analyze the cluster structure to identify any inconsistencies that our evaluation set may miss. The `EntityMatchingEvaluator` offers the `create_cluster_size_distribution` method for this purpose.


In [134]:
from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

for edge_name in ['LR','LS']:
    print(f"\nCluster analysis – {edge_name} (GB, global-matched)")
    dist = EntityMatchingEvaluator.create_cluster_size_distribution(
        correspondences=gb_global_matches[edge_name],
        out_dir=OUTPUT_DIR / 'cluster_analysis' / 'gb_global'
    )
    if dist is not None:
        print(dist.to_string(index=False))


Cluster analysis – LR (GB, global-matched)


[INFO ] PyDI.entitymatching.evaluation - Cluster Size Distribution of 15160 clusters:
[INFO ] PyDI.entitymatching.evaluation - 	Cluster Size	| Frequency	| Percentage
[INFO ] PyDI.entitymatching.evaluation - 	──────────────────────────────────────────────────
[INFO ] PyDI.entitymatching.evaluation - 		2	|	15160	|	100.00%
[INFO ] root - Cluster size distribution written to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/matching/cluster_analysis/gb_global/cluster_size_distribution.csv


 cluster_size  frequency  percentage
            2      15160       100.0

Cluster analysis – LS (GB, global-matched)


[INFO ] PyDI.entitymatching.evaluation - Cluster Size Distribution of 6685 clusters:
[INFO ] PyDI.entitymatching.evaluation - 	Cluster Size	| Frequency	| Percentage
[INFO ] PyDI.entitymatching.evaluation - 	──────────────────────────────────────────────────
[INFO ] PyDI.entitymatching.evaluation - 		2	|	6685	|	100.00%
[INFO ] root - Cluster size distribution written to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/matching/cluster_analysis/gb_global/cluster_size_distribution.csv


 cluster_size  frequency  percentage
            2       6685       100.0


In [135]:
# Cluster Consistency Analysis for GradientBoosting (GB) Matching

print("="*80)
print("Cluster Consistency Analysis: GradientBoosting (GB) Matching")
print("="*80)

gb_cluster_distributions = {}

for edge_name in ['LR', 'LS']:
    print(f"\n{edge_name} Edge:")
    print("-" * 80)
    
    # Use GradientBoosting correspondences (before global one-to-one matching)
    correspondences = gb_matching_results[edge_name]
    
    # Create cluster size distribution
    cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
        correspondences=correspondences,
        out_dir=OUTPUT_DIR / "cluster_analysis" / "gb_raw"
    )
    
    gb_cluster_distributions[edge_name] = cluster_distribution
    
    print("\nCluster Size Distribution:")
    if cluster_distribution is not None and len(cluster_distribution) > 0:
        print(cluster_distribution.to_string(index=False))
        
        # Calculate summary statistics
        # Note: PyDI returns column names in lowercase: 'cluster_size', 'frequency', 'percentage'
        total_clusters = cluster_distribution['frequency'].sum()
        clusters_size_2 = cluster_distribution[cluster_distribution['cluster_size'] == 2]['frequency'].sum() if len(cluster_distribution[cluster_distribution['cluster_size'] == 2]) > 0 else 0
        clusters_size_gt_2 = total_clusters - clusters_size_2
        
        print("\nSummary:")
        print(f"  Total clusters: {total_clusters}")
        print(f"  Clusters with size 2: {clusters_size_2} ({clusters_size_2/total_clusters*100:.2f}%)")
        print(f"  Clusters with size > 2: {clusters_size_gt_2} ({clusters_size_gt_2/total_clusters*100:.2f}%)")
        
        if clusters_size_gt_2 > 0:
            print(f"\n   Warning: Found {clusters_size_gt_2} clusters with size > 2.")
            print("     This may indicate issues with the evaluation set or data quality.")
            print("     Consider manually inspecting these clusters.")
    else:
        print("  No cluster distribution data available")

print("\n✓ GradientBoosting (GB) matching cluster analysis complete")


Cluster Consistency Analysis: GradientBoosting (GB) Matching

LR Edge:
--------------------------------------------------------------------------------


[INFO ] PyDI.entitymatching.evaluation - Cluster Size Distribution of 15048 clusters:
[INFO ] PyDI.entitymatching.evaluation - 	Cluster Size	| Frequency	| Percentage
[INFO ] PyDI.entitymatching.evaluation - 	──────────────────────────────────────────────────
[INFO ] PyDI.entitymatching.evaluation - 		2	|	14825	|	98.52%
[INFO ] PyDI.entitymatching.evaluation - 		3	|	99	|	0.66%
[INFO ] PyDI.entitymatching.evaluation - 		4	|	94	|	0.62%
[INFO ] PyDI.entitymatching.evaluation - 		5	|	15	|	0.10%
[INFO ] PyDI.entitymatching.evaluation - 		6	|	8	|	0.05%
[INFO ] PyDI.entitymatching.evaluation - 		7	|	3	|	0.02%
[INFO ] PyDI.entitymatching.evaluation - 		8	|	3	|	0.02%
[INFO ] PyDI.entitymatching.evaluation - 		11	|	1	|	0.01%
[INFO ] root - Cluster size distribution written to /Users/zhangzihan/Desktop/WBI_project/Schema_Mapped_Datasets/data/output/matching/cluster_analysis/gb_raw/cluster_size_distribution.csv
[INFO ] PyDI.entitymatching.evaluation - Cluster Size Distribution of 6681 clusters:
[IN


Cluster Size Distribution:
 cluster_size  frequency  percentage
            2      14825   98.518075
            3         99    0.657895
            4         94    0.624668
            5         15    0.099681
            6          8    0.053163
            7          3    0.019936
            8          3    0.019936
           11          1    0.006645

Summary:
  Total clusters: 15048
  Clusters with size 2: 14825 (98.52%)
  Clusters with size > 2: 223 (1.48%)

     This may indicate issues with the evaluation set or data quality.
     Consider manually inspecting these clusters.

LS Edge:
--------------------------------------------------------------------------------

Cluster Size Distribution:
 cluster_size  frequency  percentage
            2       6675   99.910193
            3          2    0.029936
            4          4    0.059871

Summary:
  Total clusters: 6681
  Clusters with size 2: 6675 (99.91%)
  Clusters with size > 2: 6 (0.09%)

     This may indicate issues w

### 6.3 Compare All Matching Methods

In [136]:
# Compare all matching methods: Rule-Based, Optimized, LogisticRegression, RandomForest, GradientBoosting, and XGBoost
# Note: ensure each model section has been executed so the corresponding metrics dictionaries exist.

print("=" * 80)
print("Matching Performance Comparison: All Methods")
print("=" * 80)

methods = [
    ("Rule", matching_metrics_val),
    ("Optimized", optimized_matching_metrics_val),
    ("LogReg", logreg_matching_metrics_val),
    ("RandomForest", ml_matching_metrics_val),
    ("GB (Default)", gb_matching_metrics_val if 'gb_matching_metrics_val' in globals() else {}),
    ("GB (Tuned)", gb_tuned_matching_metrics_val if 'gb_tuned_matching_metrics_val' in globals() else {}),
    ("XGB (Default)", xgb_matching_metrics_val if 'xgb_matching_metrics_val' in globals() else {}),
    ("XGB (Tuned)", xgb_tuned_matching_metrics_val if 'xgb_tuned_matching_metrics_val' in globals() else {}),
]
method_labels = [label for label, _ in methods]
comparison_data_all = []

for edge_name in ['LR', 'LS']:
    print(f"\n{edge_name} Edge:")
    print("-" * 80)
    
    precision_values = []
    recall_values = []
    f1_values = []
    row_entry = {'edge': edge_name}
    
    for label, metrics_dict in methods:
        metrics = metrics_dict.get(edge_name, {}) or {}
        precision = metrics.get('precision', 0.0)
        recall = metrics.get('recall', 0.0)
        f1 = metrics.get('f1', metrics.get('f1_score', 0.0))
        precision_values.append(precision)
        recall_values.append(recall)
        f1_values.append(f1)
        row_entry[f"{label.lower()}_precision"] = precision
        row_entry[f"{label.lower()}_recall"] = recall
        row_entry[f"{label.lower()}_f1"] = f1
    
    header = "  Metric       | " + " | ".join(f"{label:<15}" for label in method_labels) + " | Best"
    divider = "  " + "-" * (len(header) - 2)
    print(header)
    print(divider)
    
    def best_label(values):
        max_value = max(values)
        idx = values.index(max_value)
        return method_labels[idx]
    
    precision_line = "  Precision    | " + " | ".join(f"{val:15.3f}" for val in precision_values) + f" | {best_label(precision_values)}"
    recall_line = "  Recall       | " + " | ".join(f"{val:15.3f}" for val in recall_values) + f" | {best_label(recall_values)}"
    f1_line = "  F1-Score     | " + " | ".join(f"{val:15.3f}" for val in f1_values) + f" | {best_label(f1_values)}"
    print(precision_line)
    print(recall_line)
    print(f1_line)
    
    comparison_data_all.append(row_entry)

print("\n" + "=" * 80)
print("Summary:")
print("=" * 80)
print("  Methods compared:", ", ".join(method_labels))

comparison_df_all = pd.DataFrame(comparison_data_all)
comparison_df_all.to_csv(OUTPUT_DIR / 'matching-comparison-all-methods.csv', index=False)
print(f"  Results saved to: {OUTPUT_DIR / 'matching-comparison-all-methods.csv'}")


Matching Performance Comparison: All Methods

LR Edge:
--------------------------------------------------------------------------------
  Metric       | Rule            | Optimized       | LogReg          | RandomForest    | GB (Default)    | GB (Tuned)      | XGB (Default)   | XGB (Tuned)     | Best
  -------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Precision    |           0.962 |           1.000 |           0.909 |           1.000 |           1.000 |           1.000 |           1.000 |           1.000 | Optimized
  Recall       |           0.833 |           0.933 |           1.000 |           1.000 |           1.000 |           1.000 |           1.000 |           1.000 | LogReg
  F1-Score     |           0.893 |           0.966 |           0.952 |           1.000 |           1.000 |           1.000 |           1.000 |           1.000 | RandomForest

LS Edge:
---------

# Error Analysis Summary: From Rule-Based to GradientBoosting Matching

## Executive Summary

This analysis documents the error reduction from baseline rule-based matching to the final GradientBoosting classifier, showing a 93.3% reduction in false negatives and 50% reduction in false positives.

---

## 1. Baseline Rule-Based Matching Errors

**Error Statistics:**
- **LR Edge**: 7 False Negatives (FN), 0 False Positives (FP)
- **LS Edge**: 8 False Negatives (FN), 2 False Positives (FP)
- **Total**: 15 FN, 2 FP

**Error Patterns:**

**False Negatives:** All 15 cases involved name variants (e.g., `dan/daniel`, `matt/matthew`, `jon/jonathon`, `phil/phillip`, `rafael/raffy`). Root cause: string similarity insufficient for variant detection; threshold 0.7 too strict.

**False Positives:** 2 cases with phonetically similar names (`dario/dariel alvarez`, `jose/josh rojas`). Root cause: over-reliance on string similarity; no birth year constraint.

---

## 2. Improvement Measures Implemented

### 2.1 Name Variant Dictionary
- Created `NAME_VARIANTS` dictionary with 38 common name variants
- Applied in optimized rule-based matching with score boosting
- Added `is_name_variant` binary feature for ML models

### 2.2 Birth Year Constraints
- Soft constraint in rule-based matching (penalty 0.2 when difference > 1 year)
- `DateComparator` and `birth_year_diff` numeric feature for ML models
- Strong penalty when names match but birth years differ ≥ 2 years

### 2.3 Feature Engineering
- Enhanced feature set: Levenshtein distance, Jaccard similarity, birth year comparator, birth year difference, name variant flag, Soundex phonetic match

### 2.4 Error Case Collection and Training Set Reconstruction
- Collected error cases from rule-based and optimized matching evaluations
- Exported error cases to `data/output/gt/manual_cases/` for ground truth augmentation
- Reconstructed training, validation, and test sets to include corner cases
- Ensured error patterns are represented in training data for better model learning

### 2.5 Name Normalization
- Centralized normalization in `name_utils.py`
- Handles UTF-8 encoding, Unicode normalization, punctuation, and suffixes

### 2.6 Threshold Adjustments
- Edge-specific thresholds for rule-based matching
- Classifier probability thresholds tuned per edge

---

## 3. Final GradientBoosting Performance

**Error Statistics:**
- **LR Edge**: 0 False Negatives, 0 False Positives
- **LS Edge**: 1 False Negative, 1 False Positive
- **Total**: 1 FN, 1 FP (93.3% reduction in FN, 50% reduction in FP)

**Remaining Errors:**
- **False Negative**: `steven tolleson` vs `steve tolleson` - variant not in dictionary
- **False Positive**: `dario alvarez` vs `dariel alvarez` - model overconfidence on highly similar names

---

## 4. Errors Eliminated

**All Name Variant False Negatives (14 cases):**
- `dan/daniel` (5 cases), `matt/matthew` (1 case), `cal/calvin` (1 case), `jon/jonathon` (2 cases), `phil/phillip` (2 cases), `danny/daniel` (1 case), `rafael/raffy` (2 cases)

**Phonetically Similar False Positives (1 case):**
- `jose rojas` vs `josh rojas` - fixed by birth year constraints

**All LR Edge Errors (7 FN):**
- Perfect precision and recall achieved on LR edge

---

## 5. Error Reduction Summary

| Error Type | Baseline | GradientBoosting | Reduction |
|------------|----------|------------------|-----------|
| FN (LR) | 7 | 0 | 100% |
| FN (LS) | 8 | 1 | 87.5% |
| FP (LS) | 2 | 1 | 50% |
| **Total FN** | **15** | **1** | **93.3%** |
| **Total FP** | **2** | **1** | **50%** |

---

## 6. Key Success Factors

1. **Name Variant Dictionary**: Eliminated 14/15 FN cases (93.3%)
2. **Error Case Collection and Training Set Reconstruction**: Improved model learning by including corner cases in training data
3. **Birth Year Constraints**: Prevented over-merging based on name similarity alone
4. **Feature Engineering**: ML models learned complex patterns beyond string similarity
5. **Phonetic Features**: Better handling of pronunciation variants

---

## 7. Recommendations

1. Expand name variant dictionary: add `steven/steve`
2. Strengthen birth year penalties for perfect name matches
3. Add post-processing rules for high-confidence matches with name differences
4. Consider ensemble methods to reduce overconfidence

---

## Conclusion

The transition from rule-based to GradientBoosting matching achieved significant error reduction through name variant handling, birth year constraints, feature engineering, and training set reconstruction with error cases. The remaining errors are edge cases requiring dictionary expansion and stronger constraints for high-confidence matches with subtle name differences.

### 6.4 Final Test Set Evaluation

**IMPORTANT**: This section evaluates all models on the **test set** for final performance reporting.

**Guidelines**:
- Only run this section **ONCE** after all model selection and hyperparameter tuning is complete
- Do **NOT** use test set results to make further adjustments to models
- Test set evaluation provides unbiased estimates of model performance
- These metrics should be used for final reporting and comparison

**Models Evaluated**:
- Rule-Based Matching
- Optimized Rule-Based Matching  
- LogisticRegression
- RandomForest
- GradientBoosting (final selected model)
- XGBoost


In [137]:
# Final Evaluation on Test Set
# IMPORTANT: Only run this ONCE after all tuning is complete
# Do NOT use test set results for further model adjustments

from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

print("="*80)
print("FINAL EVALUATION ON TEST SET")
print("="*80)
print(" This is the final evaluation. Do not use these results for further tuning.\n")

# Dictionary to store all test set metrics
test_metrics_all = {}

# List of models to evaluate (in order of evaluation)
models_to_evaluate = [
    ('Rule-Based', matching_results, 'matching'),
    ('Optimized', optimized_matching_results, 'optimized'),
    ('LogisticRegression', logreg_matching_results, 'logreg'),
    ('RandomForest', ml_matching_results, 'ml'),
    ('GB (Default)', gb_matching_results, 'gb'),
    ('GB (Tuned)', gb_tuned_matching_results if 'gb_tuned_matching_results' in globals() else None, 'gb-tuned'),
    ('XGB (Default)', xgb_matching_results if 'xgb_matching_results' in globals() else None, 'xgb'),
    ('XGB (Tuned)', xgb_tuned_matching_results if 'xgb_tuned_matching_results' in globals() else None, 'xgb-tuned'),
]

for model_name, results_dict, model_key in models_to_evaluate:
    if results_dict is None:
        print(f"\n Skipping {model_name}: results not available")
        continue
    
    print(f"\n{'='*80}")
    print(f"{model_name} - Test Set Evaluation")
    print(f"{'='*80}")
    
    test_metrics_model = {}
    
    for edge_name in ['LR', 'LS']:
        if edge_name not in results_dict:
            print(f"  {edge_name}: results not available")
            continue
        
        print(f"\n  === {edge_name} Edge ===")
        
        # Get correspondences (use global matched version if available for default GradientBoosting)
        if model_name == 'GB (Default)' and 'gb_global_matches' in globals() and edge_name in gb_global_matches:
            correspondences = gb_global_matches[edge_name]
            print(f"    Using global-matched correspondences")
        else:
            correspondences = results_dict[edge_name]
        
        # Get test set
        test_df = splits[edge_name]['test'][['id1', 'id2', 'label']].copy()
        
        # Prepare correspondences for evaluation
        correspondences_for_eval = correspondences.copy()
        if 'score' not in correspondences_for_eval.columns and 'sim' in correspondences_for_eval.columns:
            correspondences_for_eval = correspondences_for_eval.rename(columns={'sim': 'score'})
        
        # Evaluate on test set
        try:
            eval_results = EntityMatchingEvaluator.evaluate_matching(
                correspondences=correspondences_for_eval,
                test_pairs=test_df,
                out_dir=OUTPUT_DIR / f'matching-evaluation-test-{model_key}',
                matcher_instance=None  # Not needed for evaluation
            )
            
            test_metrics_model[edge_name] = eval_results
            
            print(f"    Precision: {eval_results.get('precision', 0.0):.3f}")
            print(f"    Recall:    {eval_results.get('recall', 0.0):.3f}")
            print(f"    F1-Score:  {eval_results.get('f1', 0.0):.3f}")
            print(f"    TP: {eval_results.get('true_positives', 0)}")
            print(f"    FP: {eval_results.get('false_positives', 0)}")
            print(f"    FN: {eval_results.get('false_negatives', 0)}")
            
        except Exception as exc:
            print(f" Evaluation failed: {exc}")
            test_metrics_model[edge_name] = None
    
    test_metrics_all[model_name] = test_metrics_model

print(f"\n{'='*80}")
print("Test Set Evaluation Summary")
print(f"{'='*80}\n")

# Create summary table
summary_data = []
for model_name, metrics_dict in test_metrics_all.items():
    for edge_name in ['LR', 'LS']:
        if edge_name in metrics_dict and metrics_dict[edge_name] is not None:
            metrics = metrics_dict[edge_name]
            summary_data.append({
                'Model': model_name,
                'Edge': edge_name,
                'Precision': metrics.get('precision', 0.0),
                'Recall': metrics.get('recall', 0.0),
                'F1-Score': metrics.get('f1', 0.0),
                'TP': metrics.get('true_positives', 0),
                'FP': metrics.get('false_positives', 0),
                'FN': metrics.get('false_negatives', 0),
            })

if summary_data:
    import pandas as pd
    summary_df = pd.DataFrame(summary_data)
    print(summary_df.to_string(index=False))
    
    # Save summary
    summary_df.to_csv(OUTPUT_DIR / 'matching-test-evaluation-summary.csv', index=False)
    print(f"\n✓ Test set evaluation summary saved to: {OUTPUT_DIR / 'matching-test-evaluation-summary.csv'}")
else:
    print(" No test set evaluation results available")

print(f"\n{'='*80}")
print("✓ Final test set evaluation complete")
print("These are your final performance metrics for reporting.")
print(f"{'='*80}")


FINAL EVALUATION ON TEST SET
 This is the final evaluation. Do not use these results for further tuning.


Rule-Based - Test Set Evaluation

  === LR Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  23
[INFO ] root -   True Negatives:  64
[INFO ] root -   False Positives: 3
[INFO ] root -   False Negatives: 7
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.897
[INFO ] root -   Precision: 0.885
[INFO ] root -   Recall:    0.767
[INFO ] root -   F1-Score:  0.821


    Precision: 0.885
    Recall:    0.767
    F1-Score:  0.821
    TP: 23
    FP: 3
    FN: 7

  === LS Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  43
[INFO ] root -   True Negatives:  50
[INFO ] root -   False Positives: 3
[INFO ] root -   False Negatives: 4
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.930
[INFO ] root -   Precision: 0.935
[INFO ] root -   Recall:    0.915
[INFO ] root -   F1-Score:  0.925


    Precision: 0.935
    Recall:    0.915
    F1-Score:  0.925
    TP: 43
    FP: 3
    FN: 4

Optimized - Test Set Evaluation

  === LR Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  26
[INFO ] root -   True Negatives:  67
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 4
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.959
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.867
[INFO ] root -   F1-Score:  0.929


    Precision: 1.000
    Recall:    0.867
    F1-Score:  0.929
    TP: 26
    FP: 0
    FN: 4

  === LS Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  41
[INFO ] root -   True Negatives:  53
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 6
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.940
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.872
[INFO ] root -   F1-Score:  0.932


    Precision: 1.000
    Recall:    0.872
    F1-Score:  0.932
    TP: 41
    FP: 0
    FN: 6

LogisticRegression - Test Set Evaluation

  === LR Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  63
[INFO ] root -   False Positives: 4
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.959
[INFO ] root -   Precision: 0.882
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.938


    Precision: 0.882
    Recall:    1.000
    F1-Score:  0.938
    TP: 30
    FP: 4
    FN: 0

  === LS Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  47
[INFO ] root -   True Negatives:  50
[INFO ] root -   False Positives: 3
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.970
[INFO ] root -   Precision: 0.940
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.969


    Precision: 0.940
    Recall:    1.000
    F1-Score:  0.969
    TP: 47
    FP: 3
    FN: 0

RandomForest - Test Set Evaluation

  === LR Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 1
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.990
[INFO ] root -   Precision: 0.968
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.984


    Precision: 0.968
    Recall:    1.000
    F1-Score:  0.984
    TP: 30
    FP: 1
    FN: 0

  === LS Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  47
[INFO ] root -   True Negatives:  51
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.980
[INFO ] root -   Precision: 0.959
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.979


    Precision: 0.959
    Recall:    1.000
    F1-Score:  0.979
    TP: 47
    FP: 2
    FN: 0

GB (Default) - Test Set Evaluation

  === LR Edge ===
    Using global-matched correspondences


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  67
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  1.000
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  1.000


    Precision: 1.000
    Recall:    1.000
    F1-Score:  1.000
    TP: 30
    FP: 0
    FN: 0

  === LS Edge ===
    Using global-matched correspondences


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  45
[INFO ] root -   True Negatives:  53
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 2
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.980
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.957
[INFO ] root -   F1-Score:  0.978


    Precision: 1.000
    Recall:    0.957
    F1-Score:  0.978
    TP: 45
    FP: 0
    FN: 2

GB (Tuned) - Test Set Evaluation

  === LR Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 1
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.990
[INFO ] root -   Precision: 0.968
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.984


    Precision: 0.968
    Recall:    1.000
    F1-Score:  0.984
    TP: 30
    FP: 1
    FN: 0

  === LS Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  45
[INFO ] root -   True Negatives:  53
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 2
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.980
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.957
[INFO ] root -   F1-Score:  0.978


    Precision: 1.000
    Recall:    0.957
    F1-Score:  0.978
    TP: 45
    FP: 0
    FN: 2

XGB (Default) - Test Set Evaluation

  === LR Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 1
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.990
[INFO ] root -   Precision: 0.968
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.984


    Precision: 0.968
    Recall:    1.000
    F1-Score:  0.984
    TP: 30
    FP: 1
    FN: 0

  === LS Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  45
[INFO ] root -   True Negatives:  53
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 2
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.980
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.957
[INFO ] root -   F1-Score:  0.978


    Precision: 1.000
    Recall:    0.957
    F1-Score:  0.978
    TP: 45
    FP: 0
    FN: 2

XGB (Tuned) - Test Set Evaluation

  === LR Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  30
[INFO ] root -   True Negatives:  66
[INFO ] root -   False Positives: 1
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.990
[INFO ] root -   Precision: 0.968
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.984


    Precision: 0.968
    Recall:    1.000
    F1-Score:  0.984
    TP: 30
    FP: 1
    FN: 0

  === LS Edge ===


[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  47
[INFO ] root -   True Negatives:  51
[INFO ] root -   False Positives: 2
[INFO ] root -   False Negatives: 0
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.980
[INFO ] root -   Precision: 0.959
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  0.979


    Precision: 0.959
    Recall:    1.000
    F1-Score:  0.979
    TP: 47
    FP: 2
    FN: 0

Test Set Evaluation Summary

             Model Edge  Precision   Recall  F1-Score  TP  FP  FN
        Rule-Based   LR   0.884615 0.766667  0.821429  23   3   7
        Rule-Based   LS   0.934783 0.914894  0.924731  43   3   4
         Optimized   LR   1.000000 0.866667  0.928571  26   0   4
         Optimized   LS   1.000000 0.872340  0.931818  41   0   6
LogisticRegression   LR   0.882353 1.000000  0.937500  30   4   0
LogisticRegression   LS   0.940000 1.000000  0.969072  47   3   0
      RandomForest   LR   0.967742 1.000000  0.983607  30   1   0
      RandomForest   LS   0.959184 1.000000  0.979167  47   2   0
      GB (Default)   LR   1.000000 1.000000  1.000000  30   0   0
      GB (Default)   LS   1.000000 0.957447  0.978261  45   0   2
        GB (Tuned)   LR   0.967742 1.000000  0.983607  30   1   0
        GB (Tuned)   LS   1.000000 0.957447  0.978261  45   0   2
     XGB (Default)

## 7. Export Results
  ### 7.1 Output Correspondences

In [138]:
correspondences_output_dir = OUTPUT_DIR / "correspondences"
correspondences_output_dir.mkdir(parents=True, exist_ok=True)

# GradientBoosting LR edge
gb_global_matches['LR'].to_pickle(correspondences_output_dir / "gb_global_LR.pkl")
gb_global_matches['LR'].to_csv(correspondences_output_dir / "gb_global_LR.csv",index=False)

# GradientBoosting LS edge
gb_global_matches['LS'].to_pickle(correspondences_output_dir / "gb_global_LS.pkl")
gb_global_matches['LS'].to_csv(correspondences_output_dir / "gb_global_LS.csv",index=False)