# üèÜ Model Comparison and Selection

Learn how to compare multiple models and select the best one.

## Topics Covered
1. Using ModelFactory to create models
2. Comparing multiple algorithms
3. Cross-validation
4. Statistical significance testing
5. Generating leaderboards

**Time Required**: ~25 minutes

In [None]:
import sys
sys.path.insert(0, '../../')

from data_science_master_system import (
    DataLoader, Pipeline, ModelFactory, AutoModelSelector,
    ClassificationModel, Evaluator, Plotter,
)
from data_science_master_system.evaluation import ModelComparison, calculate_metrics
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Ready!")

In [None]:
# Load and prepare data
loader = DataLoader()
df = loader.read('../data/csv/customer_churn.csv')

# Prepare features
df_ml = df.drop(columns=['customer_id'])

# Encode categorical columns
df_encoded = pd.get_dummies(df_ml, drop_first=True)

# Split features and target
X = df_encoded.drop(columns=['churn'])
y = df_encoded['churn']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

## 1. Available Models

In [None]:
# List all available models
classification_models = ModelFactory.list_available('classification')
regression_models = ModelFactory.list_available('regression')

print("üìã Available Classification Models:")
for m in classification_models:
    print(f"  ‚Ä¢ {m}")

print("\nüìã Available Regression Models:")
for m in regression_models:
    print(f"  ‚Ä¢ {m}")

## 2. Create and Train Multiple Models

In [None]:
# Define models to compare
model_configs = [
    ('Random Forest', 'random_forest', {'n_estimators': 100}),
    ('Gradient Boosting', 'gradient_boosting', {'n_estimators': 100}),
    ('Logistic Regression', 'logistic_regression', {'max_iter': 1000}),
    ('Decision Tree', 'decision_tree', {}),
    ('KNN', 'knn', {'n_neighbors': 5}),
]

# Train each model
trained_models = {}

for name, model_type, params in model_configs:
    print(f"Training {name}...")
    model = ClassificationModel(model_type, **params)
    model.fit(X_train, y_train)
    trained_models[name] = model

print("\n‚úÖ All models trained!")

## 3. Model Comparison

In [None]:
# Initialize comparison
comparison = ModelComparison(problem_type='classification')

# Add models
for name, model in trained_models.items():
    comparison.add_model(name, model.underlying_model)

# Compare on test set
results = comparison.compare(X_test, y_test)

print("üìä Model Comparison Results:")
display(results)

In [None]:
# Get leaderboard sorted by F1 score
leaderboard = comparison.get_leaderboard(metric='f1')

print("üèÜ Leaderboard (by F1 Score):")
display(leaderboard)

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
leaderboard.plot(x='model', y='accuracy', kind='bar', ax=axes[0], color='steelblue', legend=False)
axes[0].set_title('Accuracy Comparison')
axes[0].set_xlabel('')
axes[0].set_ylabel('Accuracy')
axes[0].tick_params(axis='x', rotation=45)

# F1 Score comparison
leaderboard.plot(x='model', y='f1', kind='bar', ax=axes[1], color='coral', legend=False)
axes[1].set_title('F1 Score Comparison')
axes[1].set_xlabel('')
axes[1].set_ylabel('F1 Score')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 4. Cross-Validation Comparison

In [None]:
# Cross-validate models
cv_results = comparison.cross_validate(X, y, cv=5)

print("üìä Cross-Validation Results:")
display(cv_results)

## 5. Auto Model Selection

In [None]:
# Use AutoModelSelector to find the best model automatically
auto_selector = AutoModelSelector(
    problem_type='classification',
    cv=5,
    models_to_try=['random_forest', 'gradient_boosting', 'logistic_regression']
)

print("üîç Running Auto Model Selection...")
best_model = auto_selector.select(X_train, y_train)

print("\nüìä Auto Selection Results:")
display(auto_selector.get_leaderboard())

In [None]:
# Train and evaluate the best model
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

metrics = calculate_metrics(y_test, y_pred, 'classification')

print("üèÜ Best Model Performance:")
for metric, value in metrics.items():
    print(f"  ‚Ä¢ {metric}: {value:.4f}")

## 6. Detailed Evaluation

In [None]:
from data_science_master_system.evaluation.metrics import ClassificationMetrics

# Get confusion matrix
cm = ClassificationMetrics.confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plotter = Plotter()
fig = plotter.confusion_matrix(cm, labels=['No Churn', 'Churn'], title='Confusion Matrix')
plt.show()

In [None]:
# Classification report
print("\nüìã Classification Report:")
print(ClassificationMetrics.classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))

## üéØ Key Takeaways

1. **ModelFactory** - Create any model with a unified API
2. **ModelComparison** - Compare multiple models easily
3. **AutoModelSelector** - Automatically find the best model
4. **Cross-validation** - Get robust performance estimates
5. **Visualization** - Understand model performance visually

### Ready for Intermediate Level! ‚Üí