# Healthcare Provider Fraud Detection: Machine Learning Modeling

## Theoretical Foundation

### 1. Fraud Detection as a Classification Problem

Healthcare fraud detection is fundamentally a **binary classification problem** where we predict whether a provider is fraudulent (`1`) or legitimate (`0`). This presents unique challenges that distinguish it from standard classification tasks:

#### 1.1 Problem Characteristics
- **Severe Class Imbalance**: ~10% fraud rate creates significant bias toward majority class
- **High-Stakes Decisions**: False positives damage provider reputation; false negatives allow fraud to continue
- **Interpretability Requirements**: Regulators need explainable predictions for investigations
- **Evolving Fraud Patterns**: Fraudsters adapt, requiring robust and generalizable models

#### 1.2 Business Context
- **Cost of Investigation**: Limited resources require prioritizing high-confidence cases
- **Regulatory Compliance**: Models must align with healthcare regulations and audit requirements
- **Temporal Dynamics**: Fraud patterns evolve, necessitating regular model updates

### 2. Class Imbalance Handling Strategies

#### 2.1 Class Weighting (Chosen Approach)
**Theory**: Assigns higher penalties to misclassifying minority class samples during training.

**Mathematical Foundation**:
```
Class Weight = n_samples / (n_classes × n_samples_class)
```

**Advantages**:
- Preserves original data distribution
- Computationally efficient
- No synthetic data generation risks
- Works well with ensemble methods

**Limitations**:
- May increase false positive rate
- Requires careful threshold tuning

#### 2.2 Alternative Approaches (Considered but not implemented)
- **SMOTE**: Risk of generating unrealistic synthetic fraud patterns
- **Undersampling**: Loss of valuable information from majority class
- **Cost-Sensitive Learning**: Requires domain-specific cost matrices

### 3. Algorithm Selection Framework

#### 3.1 Evaluation Criteria
1. **Performance on Imbalanced Data**: Emphasis on F1-score and PR-AUC
2. **Interpretability**: Essential for regulatory compliance
3. **Robustness**: Stability across different data subsets
4. **Computational Efficiency**: Scalability to large datasets

#### 3.2 Selected Algorithms

**Logistic Regression**
- **Theory**: Linear decision boundary with probabilistic output
- **Advantages**: Highly interpretable, fast training, good baseline
- **Use Case**: Benchmark model and regulatory explanations

**Random Forest**
- **Theory**: Ensemble of decision trees with voting mechanism
- **Advantages**: Handles mixed data types, built-in feature importance, robust to outliers
- **Use Case**: Primary candidate for production deployment

**Decision Tree**
- **Theory**: Hierarchical splitting rules for classification
- **Advantages**: Maximum interpretability, handles non-linear patterns
- **Use Case**: Explanation model for specific cases

**Support Vector Machine (SVM)**
- **Theory**: Maximum margin classification with kernel trick
- **Advantages**: Effective in high-dimensional spaces, memory efficient
- **Use Case**: Complex pattern detection with RBF kernel

**XGBoost (Optional)**
- **Theory**: Gradient boosting with advanced regularization
- **Advantages**: State-of-the-art performance, built-in class weighting
- **Use Case**: Performance benchmark if computational resources allow

### 4. Evaluation Methodology

#### 4.1 Metrics Hierarchy
1. **Primary**: F1-Score (harmonic mean of precision/recall)
2. **Secondary**: PR-AUC (area under precision-recall curve)
3. **Supplementary**: ROC-AUC, Precision, Recall

#### 4.2 Why F1-Score Priority?
- Balances precision (false positive control) and recall (fraud detection)
- More meaningful than accuracy for imbalanced datasets
- Aligns with business objectives of effective fraud detection

#### 4.3 Cross-Validation Strategy
- **Stratified K-Fold**: Preserves class distribution across folds
- **K=5**: Balance between bias-variance trade-off and computational cost

### 5. Model Interpretability Requirements

#### 5.1 Regulatory Compliance
- Feature importance rankings
- Decision path explanations
- Confidence score interpretation

#### 5.2 Business Stakeholder Needs
- Clear feature contribution analysis
- Fraud pattern identification
- Risk scoring methodology

---

## 1. Setup and Data Loading

In [1]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries loaded successfully!")

Libraries loaded successfully!


In [8]:
# Load the prepared dataset from feature engineering notebook
try:
    # Load the final dataset created in the previous notebook
    final_dataset = pd.read_csv('../data/provider_level.csv', index_col='Provider')
    
    print(f"✅ Dataset loaded successfully!")
    print(f"Dataset shape: {final_dataset.shape}")
    
    # Prepare features and target
    X = final_dataset.drop(['PotentialFraud', 'PotentialFraud_numeric'], axis=1)
    y = final_dataset['PotentialFraud_numeric']

    X.to_csv('../data/X.csv', index=False)
    y.to_csv('../data/y.csv', index=False)

    
    print(f"Features shape: {X.shape}")
    print(f"Target distribution:")
    print(f"  Non-fraud: {(y == 0).sum()} ({(y == 0).mean():.1%})")
    print(f"  Fraud: {(y == 1).sum()} ({(y == 1).mean():.1%})")
    
except FileNotFoundError:
    print("❌ Dataset not found. Please run the data exploration and feature engineering notebook first.")
    print("Expected file: ../data/provider_level.csv")

✅ Dataset loaded successfully!
Dataset shape: (5410, 35)
Features shape: (5410, 33)
Target distribution:
  Non-fraud: 4904 (90.6%)
  Fraud: 506 (9.4%)


## 2. Machine Learning Setup

In [5]:
# Import ML libraries
from sklearn.model_selection import train_test_split, cross_val_predict, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    classification_report, confusion_matrix, precision_recall_curve,
    roc_curve, auc, precision_score, recall_score, f1_score, 
    accuracy_score, roc_auc_score, average_precision_score
)
from sklearn.pipeline import Pipeline
from sklearn.utils.class_weight import compute_class_weight

import xgboost as xgb
print("✅ XGBoost available")

print("ML libraries loaded successfully!")

✅ XGBoost available
ML libraries loaded successfully!


In [9]:
import pandas as pd

# Data preparation for modeling
print("=== Data Preparation for Modeling ===")

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.to_csv('../data/X_train.csv', index=False)
X_test.to_csv('../data/X_test.csv', index=False)
y_train.to_csv('../data/y_train.csv', index=False)
y_test.to_csv('../data/y_test.csv', index=False)

# Calculate class weights and scale_pos_weight
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = dict(zip(np.unique(y_train), class_weights))
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")
print(f"Class weights: {class_weight_dict}")
print(f"Scale pos weight: {scale_pos_weight:.2f}")

# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("✅ Data preparation complete!")

=== Data Preparation for Modeling ===
Training set: (4328, 33), Test set: (1082, 33)
Class weights: {np.int64(0): np.float64(0.5516186591893959), np.int64(1): np.float64(5.3432098765432094)}
Scale pos weight: 9.69
✅ Data preparation complete!


## 3. Model Definition and Training

In [10]:
# Model evaluation function
def evaluate_model(y_true, y_pred, y_pred_proba):
    """Comprehensive model evaluation"""
    return {
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred, average='binary'),
        'Recall': recall_score(y_true, y_pred, average='binary'),
        'F1': f1_score(y_true, y_pred, average='binary'),
        'ROC_AUC': roc_auc_score(y_true, y_pred_proba),
        'PR_AUC': average_precision_score(y_true, y_pred_proba)
    }

# Define models with class weighting
models = {
    'Logistic_Regression': Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000))
    ]),
    'Random_Forest': Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('classifier', RandomForestClassifier(class_weight='balanced', random_state=42, n_estimators=100))
    ]),
    'Decision_Tree': Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('classifier', DecisionTreeClassifier(class_weight='balanced', random_state=42, max_depth=10))
    ]),
    'SVM': Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
        ('classifier', SVC(class_weight='balanced', random_state=42, probability=True))
    ])
}

if xgb is not None:
    models['XGBoost'] = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('classifier', xgb.XGBClassifier(scale_pos_weight=scale_pos_weight, random_state=42, eval_metric='logloss'))
    ])

print(f"✅ {len(models)} models defined: {list(models.keys())}")

✅ 5 models defined: ['Logistic_Regression', 'Random_Forest', 'Decision_Tree', 'SVM', 'XGBoost']


In [15]:
# Train and evaluate all models
import joblib
import os
os.makedirs('../models', exist_ok=True)
results = []
trained_models = {}
print("=== Model Training and Evaluation ===")

results = []
trained_models = {}

for name, pipeline in models.items():
    print(f"\nTraining {name}...")
    
    try:
        # Train the model
        pipeline.fit(X_train, y_train)

        #save traained models
        joblib.dump(pipeline, f'../models/{name}.pkl')
        
        # Predictions
        y_pred_test = pipeline.predict(X_test)
        y_pred_proba_test = pipeline.predict_proba(X_test)[:, 1]
        
        # Evaluate
        metrics = evaluate_model(y_test, y_pred_test, y_pred_proba_test)
        metrics['Model'] = name
        results.append(metrics)
        
        # Store trained model
        trained_models[name] = pipeline
        
        print(f"  F1: {metrics['F1']:.4f}, PR-AUC: {metrics['PR_AUC']:.4f}")
        
    except Exception as e:
        print(f"  ❌ Error training {name}: {e}")

# Create results DataFrame
results_df = pd.DataFrame(results)
print(results_df)
results_df.to_csv('../data/model_results.csv', index=False)
print("\n=== Model Comparison Results ===")
print(results_df.round(4))

=== Model Training and Evaluation ===

Training Logistic_Regression...
  F1: 0.6298, PR-AUC: 0.7872

Training Random_Forest...
  F1: 0.7011, PR-AUC: 0.7847

Training Decision_Tree...
  F1: 0.6154, PR-AUC: 0.5187

Training SVM...
  F1: 0.6186, PR-AUC: 0.5786

Training XGBoost...
  F1: 0.7129, PR-AUC: 0.7841
   Accuracy  Precision    Recall        F1   ROC_AUC    PR_AUC  \
0  0.901109   0.484043  0.900990  0.629758  0.969722  0.787230   
1  0.951941   0.835616  0.603960  0.701149  0.969015  0.784721   
2  0.907579   0.503145  0.792079  0.615385  0.866004  0.518690   
3  0.897412   0.473684  0.891089  0.618557  0.952408  0.578576   
4  0.946396   0.712871  0.712871  0.712871  0.962808  0.784057   

                 Model  
0  Logistic_Regression  
1        Random_Forest  
2        Decision_Tree  
3                  SVM  
4              XGBoost  

=== Model Comparison Results ===
   Accuracy  Precision  Recall      F1  ROC_AUC  PR_AUC                Model
0    0.9011     0.4840  0.9010  0.