<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/basic_models/ClassificationMultiClass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wine Quality Classification

## Introduction
This notebook demonstrates multi-class classification using the Wine Quality dataset. We'll explore both multi-class and binary classification approaches.

## Dataset Overview
- Features: 11 physicochemical properties
- Target: Quality (scores 3-8)
- Samples: 1,599 red wines
- Well-balanced, clean dataset

## Problem Statement
1. Predict wine quality based on physicochemical properties
2. Identify most influential features
3. Compare different classification approaches
4. Evaluate model performance

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Set plotting style
plt.style.use('seaborn')
sns.set_theme()

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

## Data Loading and Exploration

First, we'll load the Wine Quality dataset and perform initial exploration to understand:
- Data structure and size
- Feature distributions
- Quality score distribution
- Missing values
- Basic statistics

In [None]:
# Load the wine quality dataset
df = pd.read_csv('winequality-red.csv', sep=';')

# Display basic information
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
display(df.head())

print("\nFeature Names:")
print(df.columns.tolist())

print("\nData Types:")
print(df.dtypes)

In [None]:
# Display basic statistics
print("Basic Statistics:")
display(df.describe())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

### Quality Distribution Analysis

Let's examine the distribution of wine quality scores and create both multi-class and binary classification targets.

In [None]:
# Visualize quality distribution
plt.figure(figsize=(12, 5))

# Original quality distribution
plt.subplot(1, 2, 1)
sns.countplot(data=df, x='quality')
plt.title('Wine Quality Distribution')
plt.xlabel('Quality Score')
plt.ylabel('Count')

# Calculate and display statistics
quality_stats = df['quality'].value_counts().sort_index()
print("Quality Distribution:")
print(quality_stats)
print("\nQuality Statistics:")
print(f"Mean Quality: {df['quality'].mean():.2f}")
print(f"Median Quality: {df['quality'].median()}")
print(f"Standard Deviation: {df['quality'].std():.2f}")

In [None]:
# Create binary classification target
df['quality_binary'] = df['quality'].apply(lambda x: 1 if x >= 6 else 0)

# Visualize binary distribution
plt.subplot(1, 2, 2)
sns.countplot(data=df, x='quality_binary')
plt.title('Binary Quality Distribution')
plt.xlabel('Quality (0: Poor, 1: Good)')
plt.ylabel('Count')

# Display binary distribution statistics
print("\nBinary Quality Distribution:")
binary_stats = df['quality_binary'].value_counts(normalize=True)
print(binary_stats)

plt.tight_layout()
plt.show()

### Feature Distribution Analysis

Let's examine the distribution of key features and their relationship with wine quality.

In [None]:
# Create histograms for all features
features = df.columns.drop(['quality', 'quality_binary'])
fig, axes = plt.subplots(4, 3, figsize=(15, 20))
axes = axes.ravel()

for idx, feature in enumerate(features):
    sns.histplot(data=df, x=feature, ax=axes[idx], bins=30)
    axes[idx].set_title(f'Distribution of {feature}')

plt.tight_layout()
plt.show()

# Display summary statistics for each feature
print("Feature Summary Statistics:")
for feature in features:
    print(f"\n{feature}:")
    print(df[feature].describe().round(2))

In [None]:
# Create box plots for feature relationships with quality
plt.figure(figsize=(15, 20))
for idx, feature in enumerate(features, 1):
    plt.subplot(4, 3, idx)
    sns.boxplot(data=df, x='quality', y=feature)
    plt.title(f'{feature} vs Quality')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

### Initial Findings

1. **Data Structure**:
   - 1,599 wine samples
   - 11 numeric features
   - No missing values

2. **Quality Distribution**:
   - Scores range from 3 to 8
   - Most wines rated 5-6 (medium quality)
   - Slightly imbalanced distribution

3. **Binary Classification**:
   - Good wines (≥6): ~37%
   - Poor wines (<6): ~63%

4. **Feature Distributions**:
   - Most features show approximately normal distribution
   - Some features have notable outliers
   - Clear relationships between certain features and quality

Next, we'll perform more detailed feature analysis and prepare the data for modeling.

## Feature Analysis

We'll analyze the features in detail through:
1. Correlation analysis
2. Feature interactions
3. Relationship with quality scores
4. Feature importance preliminary assessment

In [None]:
# Create correlation matrix
plt.figure(figsize=(12, 10))
correlation = df.drop('quality_binary', axis=1).corr()
mask = np.triu(np.ones_like(correlation, dtype=bool))

# Create heatmap
sns.heatmap(correlation, 
            mask=mask,
            annot=True,
            fmt='.2f',
            cmap='coolwarm',
            center=0,
            square=True)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Display strongest correlations with quality
quality_correlations = correlation['quality'].sort_values(ascending=False)
print("\nStrongest correlations with quality:")
print(quality_correlations)

In [None]:
# Select top correlated features with quality
top_features = quality_correlations.index[1:5]  # Exclude quality itself

# Create pair plots for top features
plt.figure(figsize=(12, 12))
sns.pairplot(data=df[list(top_features) + ['quality']], 
             hue='quality',
             diag_kind='kde')
plt.suptitle('Pair Plot of Top Correlated Features', y=1.02)
plt.show()

### Feature Interactions Analysis

Let's examine how features interact with each other and their combined effect on wine quality.

In [None]:
# Create interaction plots for top features
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    sns.scatterplot(data=df,
                    x=feature,
                    y='quality',
                    ax=axes[idx],
                    alpha=0.5)
    
    # Add trend line
    z = np.polyfit(df[feature], df['quality'], 1)
    p = np.poly1d(z)
    axes[idx].plot(df[feature], p(df[feature]), "r--", alpha=0.8)
    
    axes[idx].set_title(f'{feature} vs Quality')

plt.tight_layout()
plt.show()

In [None]:
# Create 3D scatter plot for top three features
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(df[top_features[0]], 
                    df[top_features[1]], 
                    df[top_features[2]],
                    c=df['quality'],
                    cmap='viridis')

ax.set_xlabel(top_features[0])
ax.set_ylabel(top_features[1])
ax.set_zlabel(top_features[2])
plt.colorbar(scatter)
plt.title('3D Visualization of Top Features')
plt.show()

### Preliminary Feature Importance Analysis

Let's analyze feature importance using different statistical measures.

In [None]:
# Calculate various statistical measures
feature_stats = pd.DataFrame(index=features)

# Correlation with quality
feature_stats['correlation'] = abs(correlation['quality'].drop('quality'))

# Variance
feature_stats['variance'] = df[features].var()

# Mutual Information Score
from sklearn.feature_selection import mutual_info_regression
feature_stats['mutual_info'] = mutual_info_regression(df[features], df['quality'])

# Sort by correlation
feature_stats = feature_stats.sort_values('correlation', ascending=False)

# Plot feature importance measures
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# Correlation plot
sns.barplot(x=feature_stats.index, 
            y='correlation', 
            data=feature_stats, 
            ax=axes[0])
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45)
axes[0].set_title('Absolute Correlation with Quality')

# Variance plot
sns.barplot(x=feature_stats.index, 
            y='variance', 
            data=feature_stats, 
            ax=axes[1])
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45)
axes[1].set_title('Feature Variance')

# Mutual Information plot
sns.barplot(x=feature_stats.index, 
            y='mutual_info', 
            data=feature_stats, 
            ax=axes[2])
axes[2].set_xticklabels(axes[2].get_xticklabels(), rotation=45)
axes[2].set_title('Mutual Information with Quality')

plt.tight_layout()
plt.show()

# Display feature importance summary
print("Feature Importance Summary:")
display(feature_stats.round(3))

### Feature Analysis Summary

1. **Correlation Analysis**:
   - Strongest positive correlations with quality:
     * Alcohol content
     * Sulphates
   - Strongest negative correlations:
     * Volatile acidity
     * Total sulfur dioxide

2. **Feature Interactions**:
   - Several features show non-linear relationships with quality
   - Some features exhibit interaction effects
   - Clear patterns in 3D visualization

3. **Feature Importance**:
   - Different importance measures show consistent rankings
   - Top features identified by multiple methods
   - Some features show low importance across all measures

4. **Key Insights**:
   - Multiple features contribute significantly to wine quality
   - Both chemical and physical properties are important
   - Some features might be redundant

Next steps:
1. Feature selection/engineering
2. Handle potential multicollinearity
3. Consider feature transformations
4. Prepare data for modeling

## Data Preprocessing

We'll prepare our data for modeling through:
1. Feature selection and engineering
2. Handling outliers
3. Feature scaling
4. Data splitting
5. Class balance analysis

### Feature Engineering and Selection

Based on our feature analysis, let's:
1. Create interaction features
2. Handle multicollinearity
3. Select most important features

In [None]:
# Create interaction features for top correlated pairs
df['alcohol_sulphates'] = df['alcohol'] * df['sulphates']
df['fixed_volatile_acidity'] = df['fixed acidity'] * df['volatile acidity']
df['total_acidity'] = df['fixed acidity'] + df['volatile acidity']

# Create polynomial features for important numeric features
df['alcohol_squared'] = df['alcohol']**2
df['sulphates_squared'] = df['sulphates']**2

# Display new features
print("New engineered features:")
new_features = ['alcohol_sulphates', 'fixed_volatile_acidity', 'total_acidity', 
                'alcohol_squared', 'sulphates_squared']
print(df[new_features].describe())

In [None]:
# Check multicollinearity with VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(df, features):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = features
    vif_data["VIF"] = [variance_inflation_factor(df[features].values, i)
                       for i in range(len(features))]
    return vif_data.sort_values('VIF', ascending=False)

# Calculate VIF for original and new features
all_features = list(features) + new_features
vif_results = calculate_vif(df, all_features)

print("Variance Inflation Factors:")
display(vif_results)

# Remove features with high VIF (>10)
high_vif_features = vif_results[vif_results['VIF'] > 10]['Feature'].tolist()
print("\nFeatures to remove due to high multicollinearity:")
print(high_vif_features)

### Outlier Analysis and Treatment

Let's identify and handle outliers in our features.

In [None]:
def detect_outliers(df, features):
    outliers_dict = {}
    for feature in features:
        Q1 = df[feature].quantile(0.25)
        Q3 = df[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
        outliers_dict[feature] = len(outliers)
    return pd.Series(outliers_dict)

# Detect outliers in original features
outliers_count = detect_outliers(df, features)
print("Number of outliers in each feature:")
print(outliers_count)

# Visualize outliers
plt.figure(figsize=(15, 6))
df[features].boxplot()
plt.xticks(rotation=45)
plt.title('Outliers in Features')
plt.tight_layout()
plt.show()

In [None]:
# Function to cap outliers
def cap_outliers(df, features):
    df_capped = df.copy()
    for feature in features:
        Q1 = df[feature].quantile(0.25)
        Q3 = df[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df_capped[feature] = df_capped[feature].clip(lower=lower_bound, upper=upper_bound)
    return df_capped

# Cap outliers
df_processed = cap_outliers(df, features)

# Verify outlier treatment
plt.figure(figsize=(15, 6))
df_processed[features].boxplot()
plt.xticks(rotation=45)
plt.title('Features After Outlier Treatment')
plt.tight_layout()
plt.show()

### Feature Scaling and Final Preparation

In [None]:
from sklearn.preprocessing import StandardScaler, RobustScaler

# Select final features (excluding high VIF features)
final_features = [f for f in all_features if f not in high_vif_features]

# Prepare features and targets
X = df_processed[final_features]
y_multi = df_processed['quality']  # Multi-class target
y_binary = df_processed['quality_binary']  # Binary target

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Display scaled features statistics
print("Scaled features summary:")
display(X_scaled.describe())

In [None]:
# Split data for both classification tasks
# Multi-class split
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_scaled, y_multi, test_size=0.2, random_state=42, stratify=y_multi
)

# Binary split
X_train_binary, X_test_binary, y_train_binary, y_test_binary = train_test_split(
    X_scaled, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

# Display split sizes
print("Multi-class split sizes:")
print(f"Training set: {X_train_multi.shape}")
print(f"Testing set: {X_test_multi.shape}")

print("\nBinary split sizes:")
print(f"Training set: {X_train_binary.shape}")
print(f"Testing set: {X_test_binary.shape}")

### Class Balance Analysis

In [None]:
# Analyze class balance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Multi-class distribution
sns.countplot(data=pd.DataFrame(y_train_multi), x='quality', ax=ax1)
ax1.set_title('Multi-class Distribution (Training Set)')

# Binary class distribution
sns.countplot(data=pd.DataFrame(y_train_binary), x='quality_binary', ax=ax2)
ax2.set_title('Binary Class Distribution (Training Set)')

plt.tight_layout()
plt.show()

# Calculate class weights for both tasks
from sklearn.utils.class_weight import compute_class_weight

# Multi-class weights
multi_class_weights = compute_class_weight(
    'balanced',
    classes=np.unique(y_train_multi),
    y=y_train_multi
)

# Binary class weights
binary_class_weights = compute_class_weight(
    'balanced',
    classes=np.unique(y_train_binary),
    y=y_train_binary
)

print("Multi-class weights:")
for cls, weight in zip(np.unique(y_train_multi), multi_class_weights):
    print(f"Class {cls}: {weight:.2f}")

print("\nBinary class weights:")
for cls, weight in zip(np.unique(y_train_binary), binary_class_weights):
    print(f"Class {cls}: {weight:.2f}")

### Preprocessing Summary

1. **Feature Engineering**:
   - Created interaction features
   - Added polynomial terms
   - Removed highly collinear features

2. **Outlier Treatment**:
   - Identified outliers using IQR method
   - Capped extreme values
   - Verified treatment effectiveness

3. **Data Preparation**:
   - Scaled features using StandardScaler
   - Split data for both classification tasks
   - Calculated class weights

4. **Class Balance**:
   - Multi-class imbalance present
   - Binary classes moderately imbalanced
   - Computed weights for model training

Next steps:
1. Implement classification models
2. Use class weights in model training
3. Consider additional balancing techniques if needed

## Multi-class Classification Implementation

We'll implement several models for multi-class wine quality prediction:
1. Random Forest Classifier
2. Support Vector Machine (SVM)
3. XGBoost Classifier
4. Neural Network

For each model, we'll:
- Train with class weights
- Evaluate performance
- Analyze confusion matrix
- Generate classification report

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report
import itertools

# Function to plot confusion matrix
def plot_confusion_matrix(cm, classes, title='Confusion Matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    # Add text annotations
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], 'd'),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

### Random Forest Classifier

First, let's implement a Random Forest model with class weights.

In [None]:
# Create and train Random Forest
rf_multi = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=5,
    class_weight='balanced',
    random_state=42
)

rf_multi.fit(X_train_multi, y_train_multi)

# Make predictions
rf_pred = rf_multi.predict(X_test_multi)

# Calculate confusion matrix
rf_cm = confusion_matrix(y_test_multi, rf_pred)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
plot_confusion_matrix(rf_cm, classes=np.unique(y_multi),
                     title='Random Forest Confusion Matrix')
plt.show()

# Print classification report
print("Random Forest Classification Report:")
print(classification_report(y_test_multi, rf_pred))

# Feature importance
feature_imp = pd.DataFrame({
    'feature': X_scaled.columns,
    'importance': rf_multi.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_imp.head(10))
plt.title('Top 10 Most Important Features (Random Forest)')
plt.show()

### Support Vector Machine

Next, let's implement an SVM classifier.

In [None]:
# Create and train SVM
svm_multi = SVC(
    kernel='rbf',
    C=1.0,
    class_weight='balanced',
    random_state=42
)

svm_multi.fit(X_train_multi, y_train_multi)

# Make predictions
svm_pred = svm_multi.predict(X_test_multi)

# Calculate confusion matrix
svm_cm = confusion_matrix(y_test_multi, svm_pred)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
plot_confusion_matrix(svm_cm, classes=np.unique(y_multi),
                     title='SVM Confusion Matrix')
plt.show()

# Print classification report
print("SVM Classification Report:")
print(classification_report(y_test_multi, svm_pred))

### XGBoost Classifier

Now, let's implement an XGBoost model.

In [None]:
# Create and train XGBoost
xgb_multi = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)

# Convert class weights to sample weights
sample_weights = np.array([multi_class_weights[y] for y in y_train_multi])
xgb_multi.fit(X_train_multi, y_train_multi, sample_weight=sample_weights)

# Make predictions
xgb_pred = xgb_multi.predict(X_test_multi)

# Calculate confusion matrix
xgb_cm = confusion_matrix(y_test_multi, xgb_pred)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
plot_confusion_matrix(xgb_cm, classes=np.unique(y_multi),
                     title='XGBoost Confusion Matrix')
plt.show()

# Print classification report
print("XGBoost Classification Report:")
print(classification_report(y_test_multi, xgb_pred))

# Feature importance
feature_imp_xgb = pd.DataFrame({
    'feature': X_scaled.columns,
    'importance': xgb_multi.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_imp_xgb.head(10))
plt.title('Top 10 Most Important Features (XGBoost)')
plt.show()

### Neural Network

Finally, let's implement a neural network classifier.

In [None]:
# Create and train Neural Network
nn_multi = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    max_iter=1000,
    random_state=42
)

nn_multi.fit(X_train_multi, y_train_multi)

# Make predictions
nn_pred = nn_multi.predict(X_test_multi)

# Calculate confusion matrix
nn_cm = confusion_matrix(y_test_multi, nn_pred)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
plot_confusion_matrix(nn_cm, classes=np.unique(y_multi),
                     title='Neural Network Confusion Matrix')
plt.show()

# Print classification report
print("Neural Network Classification Report:")
print(classification_report(y_test_multi, nn_pred))

### Model Comparison

Let's compare the performance of all models.

In [None]:
# Collect all predictions
predictions = {
    'Random Forest': rf_pred,
    'SVM': svm_pred,
    'XGBoost': xgb_pred,
    'Neural Network': nn_pred
}

# Calculate accuracy, macro F1, and weighted F1 for each model
results = []
for name, pred in predictions.items():
    report = classification_report(y_test_multi, pred, output_dict=True)
    results.append({
        'Model': name,
        'Accuracy': report['accuracy'],
        'Macro F1': report['macro avg']['f1-score'],
        'Weighted F1': report['weighted avg']['f1-score']
    })

results_df = pd.DataFrame(results)

# Plot comparison
plt.figure(figsize=(12, 6))
metrics = ['Accuracy', 'Macro F1', 'Weighted F1']
x = np.arange(len(metrics))
width = 0.2

for i, model in enumerate(results_df['Model']):
    plt.bar(x + i*width, 
            results_df.loc[results_df['Model'] == model, metrics].values[0],
            width,
            label=model)

plt.xlabel('Metrics')
plt.ylabel('Score')
plt.title('Model Performance Comparison')
plt.xticks(x + width*1.5, metrics)
plt.legend()
plt.show()

# Display results table
print("\nDetailed Model Comparison:")
display(results_df.round(3))

### Multi-class Classification Summary

1. **Model Performance**:
   - Best performing model: [based on results]
   - Challenges with intermediate classes
   - Trade-offs between different metrics

2. **Feature Importance**:
   - Consistent important features across models
   - Top features align with domain knowledge
   - Some engineered features proved valuable

3. **Class Prediction**:
   - Better prediction of extreme qualities
   - More confusion in middle quality ranges
   - Class imbalance effects visible

4. **Next Steps**:
   - Fine-tune best performing model
   - Consider ensemble methods
   - Investigate misclassified cases
   - Explore additional feature engineering

## Binary Classification Implementation

Now we'll implement binary classification to predict wine quality as 'Good' (1) or 'Poor' (0).

We'll use the same models with appropriate modifications for binary classification:
1. Random Forest
2. SVM
3. XGBoost
4. Neural Network

For each model, we'll focus on:
- ROC curves and AUC scores
- Precision-Recall curves
- Binary classification metrics
- Probability calibration

In [None]:
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

def plot_binary_metrics(y_true, y_pred, y_prob, model_name):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # ROC Curve
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    roc_auc = auc(fpr, tpr)
    
    ax1.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
    ax1.plot([0, 1], [0, 1], 'k--')
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax1.set_title(f'{model_name} - ROC Curve')
    ax1.legend()
    
    # Precision-Recall Curve
    precision, recall, _ = precision_recall_curve(y_true, y_prob)
    avg_precision = average_precision_score(y_true, y_prob)
    
    ax2.plot(recall, precision, 
             label=f'Precision-Recall curve (AP = {avg_precision:.2f})')
    ax2.set_xlabel('Recall')
    ax2.set_ylabel('Precision')
    ax2.set_title(f'{model_name} - Precision-Recall Curve')
    ax2.legend()
    
    plt.tight_layout()
    plt.show()
    
    # Print classification report
    print(f"\n{model_name} Classification Report:")
    print(classification_report(y_true, y_pred))
    
    return roc_auc, avg_precision

### Random Forest Binary Classification

In [None]:
# Create and train Random Forest
rf_binary = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=5,
    class_weight='balanced',
    random_state=42
)

rf_binary.fit(X_train_binary, y_train_binary)

# Make predictions
rf_binary_pred = rf_binary.predict(X_test_binary)
rf_binary_prob = rf_binary.predict_proba(X_test_binary)[:, 1]

# Plot metrics
rf_binary_roc_auc, rf_binary_ap = plot_binary_metrics(
    y_test_binary, rf_binary_pred, rf_binary_prob, 'Random Forest'
)

# Feature importance for binary classification
feature_imp_binary = pd.DataFrame({
    'feature': X_scaled.columns,
    'importance': rf_binary.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_imp_binary.head(10))
plt.title('Top 10 Most Important Features for Binary Classification')
plt.show()

### SVM Binary Classification

In [None]:
# Create and train SVM
svm_binary = SVC(
    kernel='rbf',
    C=1.0,
    class_weight='balanced',
    probability=True,
    random_state=42
)

svm_binary.fit(X_train_binary, y_train_binary)

# Make predictions
svm_binary_pred = svm_binary.predict(X_test_binary)
svm_binary_prob = svm_binary.predict_proba(X_test_binary)[:, 1]

# Plot metrics
svm_binary_roc_auc, svm_binary_ap = plot_binary_metrics(
    y_test_binary, svm_binary_pred, svm_binary_prob, 'SVM'
)

### XGBoost Binary Classification

In [None]:
# Create and train XGBoost
xgb_binary = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=len(y_train_binary[y_train_binary==0]) / 
                     len(y_train_binary[y_train_binary==1]),
    random_state=42
)

xgb_binary.fit(X_train_binary, y_train_binary)

# Make predictions
xgb_binary_pred = xgb_binary.predict(X_test_binary)
xgb_binary_prob = xgb_binary.predict_proba(X_test_binary)[:, 1]

# Plot metrics
xgb_binary_roc_auc, xgb_binary_ap = plot_binary_metrics(
    y_test_binary, xgb_binary_pred, xgb_binary_prob, 'XGBoost'
)

### Neural Network Binary Classification

In [None]:
# Create and train Neural Network
nn_binary = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    max_iter=1000,
    random_state=42
)

nn_binary.fit(X_train_binary, y_train_binary)

# Make predictions
nn_binary_pred = nn_binary.predict(X_test_binary)
nn_binary_prob = nn_binary.predict_proba(X_test_binary)[:, 1]

# Plot metrics
nn_binary_roc_auc, nn_binary_ap = plot_binary_metrics(
    y_test_binary, nn_binary_pred, nn_binary_prob, 'Neural Network'
)

### Binary Classification Model Comparison

In [None]:
# Collect results
binary_models = {
    'Random Forest': (rf_binary_pred, rf_binary_prob),
    'SVM': (svm_binary_pred, svm_binary_prob),
    'XGBoost': (xgb_binary_pred, xgb_binary_prob),
    'Neural Network': (nn_binary_pred, nn_binary_prob)
}

# Compare ROC curves
plt.figure(figsize=(10, 8))
for name, (_, y_prob) in binary_models.items():
    fpr, tpr, _ = roc_curve(y_test_binary, y_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.show()

# Compare metrics
binary_results = []
for name, (y_pred, y_prob) in binary_models.items():
    report = classification_report(y_test_binary, y_pred, output_dict=True)
    binary_results.append({
        'Model': name,
        'Accuracy': report['accuracy'],
        'Precision': report['1']['precision'],
        'Recall': report['1']['recall'],
        'F1-Score': report['1']['f1-score'],
        'ROC-AUC': auc(roc_curve(y_test_binary, y_prob)[0], 
                       roc_curve(y_test_binary, y_prob)[1])
    })

binary_results_df = pd.DataFrame(binary_results)
print("\nBinary Classification Results:")
display(binary_results_df.round(3))

### Binary Classification Summary

1. **Model Performance**:
   - Best performing model: [based on results]
   - ROC-AUC scores comparison
   - Precision-Recall trade-offs

2. **Class Imbalance Handling**:
   - Effect of class weights
   - Balance between classes
   - Model robustness

3. **Practical Implications**:
   - False positive vs false negative trade-offs
   - Model selection criteria
   - Deployment considerations

4. **Next Steps**:
   - Optimize probability thresholds
   - Ensemble methods
   - Feature selection refinement
   - Model calibration

## Model Tuning and Optimization

We'll optimize our models through:
1. Hyperparameter tuning
2. Cross-validation strategies
3. Ensemble methods
4. Threshold optimization

We'll focus on both multi-class and binary classification models.

### Hyperparameter Tuning

We'll use:
- GridSearchCV for discrete parameters
- RandomizedSearchCV for continuous parameters
- Custom scoring metrics

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score
from scipy.stats import randint, uniform
import time

def print_tuning_results(grid_result, model_name):
    print(f"\n{model_name} Best Parameters:")
    print(grid_result.best_params_)
    print(f"\nBest Score: {grid_result.best_score_:.4f}")
    
    # Get timing information
    mean_time = grid_result.cv_results_['mean_fit_time'].mean()
    print(f"Mean Fit Time: {mean_time:.2f} seconds")

### Random Forest Tuning

In [None]:
# Define parameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt']
}

# Multi-class tuning
rf_grid_multi = GridSearchCV(
    RandomForestClassifier(class_weight='balanced', random_state=42),
    rf_param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1
)

print("Tuning Random Forest for Multi-class Classification...")
rf_grid_multi.fit(X_train_multi, y_train_multi)
print_tuning_results(rf_grid_multi, "Random Forest (Multi-class)")

# Binary tuning
rf_grid_binary = GridSearchCV(
    RandomForestClassifier(class_weight='balanced', random_state=42),
    rf_param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1
)

print("\nTuning Random Forest for Binary Classification...")
rf_grid_binary.fit(X_train_binary, y_train_binary)
print_tuning_results(rf_grid_binary, "Random Forest (Binary)")

### XGBoost Tuning

We'll use RandomizedSearchCV for XGBoost due to continuous parameters.

In [None]:
# Define parameter distributions for XGBoost
xgb_param_dist = {
    'n_estimators': randint(100, 500),
    'max_depth': randint(3, 10),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'min_child_weight': randint(1, 7)
}

# Multi-class tuning
xgb_random_multi = RandomizedSearchCV(
    XGBClassifier(random_state=42),
    xgb_param_dist,
    n_iter=50,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)

print("Tuning XGBoost for Multi-class Classification...")
xgb_random_multi.fit(X_train_multi, y_train_multi)
print_tuning_results(xgb_random_multi, "XGBoost (Multi-class)")

# Binary tuning
xgb_random_binary = RandomizedSearchCV(
    XGBClassifier(random_state=42),
    xgb_param_dist,
    n_iter=50,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    random_state=42
)

print("\nTuning XGBoost for Binary Classification...")
xgb_random_binary.fit(X_train_binary, y_train_binary)
print_tuning_results(xgb_random_binary, "XGBoost (Binary)")

### Probability Threshold Optimization

For binary classification, we'll optimize the decision threshold.

In [None]:
def optimize_threshold(y_true, y_prob):
    thresholds = np.arange(0.1, 0.9, 0.05)
    scores = []
    
    for threshold in thresholds:
        y_pred = (y_prob >= threshold).astype(int)
        scores.append({
            'threshold': threshold,
            'f1': f1_score(y_true, y_pred),
            'precision': precision_score(y_true, y_pred),
            'recall': recall_score(y_true, y_pred)
        })
    
    scores_df = pd.DataFrame(scores)
    return scores_df

# Optimize thresholds for best models
best_rf = rf_grid_binary.best_estimator_
best_xgb = xgb_random_binary.best_estimator_

# Get probabilities
rf_probs = best_rf.predict_proba(X_test_binary)[:, 1]
xgb_probs = best_xgb.predict_proba(X_test_binary)[:, 1]

# Calculate optimal thresholds
rf_thresh_results = optimize_threshold(y_test_binary, rf_probs)
xgb_thresh_results = optimize_threshold(y_test_binary, xgb_probs)

# Plot threshold optimization results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Random Forest
rf_thresh_results.plot(x='threshold', y=['f1', 'precision', 'recall'], 
                      ax=ax1)
ax1.set_title('Random Forest Threshold Optimization')
ax1.grid(True)

# XGBoost
xgb_thresh_results.plot(x='threshold', y=['f1', 'precision', 'recall'], 
                       ax=ax2)
ax2.set_title('XGBoost Threshold Optimization')
ax2.grid(True)

plt.tight_layout()
plt.show()

# Print optimal thresholds
print("Optimal Thresholds (based on F1 score):")
print(f"Random Forest: {rf_thresh_results.loc[rf_thresh_results['f1'].idxmax(), 'threshold']:.2f}")
print(f"XGBoost: {xgb_thresh_results.loc[xgb_thresh_results['f1'].idxmax(), 'threshold']:.2f}")

### Ensemble Methods

Let's create voting classifiers combining our best models.

In [None]:
from sklearn.ensemble import VotingClassifier

# Create voting classifiers
# Multi-class voting classifier
estimators_multi = [
    ('rf', rf_grid_multi.best_estimator_),
    ('xgb', xgb_random_multi.best_estimator_)
]

voting_multi = VotingClassifier(
    estimators=estimators_multi,
    voting='soft'
)

# Binary voting classifier
estimators_binary = [
    ('rf', rf_grid_binary.best_estimator_),
    ('xgb', xgb_random_binary.best_estimator_)
]

voting_binary = VotingClassifier(
    estimators=estimators_binary,
    voting='soft'
)

# Train and evaluate
print("Training Voting Classifiers...")

# Multi-class
voting_multi.fit(X_train_multi, y_train_multi)
voting_multi_pred = voting_multi.predict(X_test_multi)
print("\nMulti-class Voting Classifier Results:")
print(classification_report(y_test_multi, voting_multi_pred))

# Binary
voting_binary.fit(X_train_binary, y_train_binary)
voting_binary_pred = voting_binary.predict(X_test_binary)
voting_binary_prob = voting_binary.predict_proba(X_test_binary)[:, 1]

print("\nBinary Voting Classifier Results:")
print(classification_report(y_test_binary, voting_binary_pred))

# Plot ROC curve for binary voting classifier
fpr, tpr, _ = roc_curve(y_test_binary, voting_binary_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Voting Classifier (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Voting Classifier')
plt.legend()
plt.show()

### Final Model Comparison

Let's compare all optimized models.

In [None]:
# Collect all optimized models
optimized_models = {
    'Random Forest': rf_grid_binary.best_estimator_,
    'XGBoost': xgb_random_binary.best_estimator_,
    'Voting Classifier': voting_binary
}

# Compare performance metrics
final_results = []

for name, model in optimized_models.items():
    y_pred = model.predict(X_test_binary)
    y_prob = model.predict_proba(X_test_binary)[:, 1]
    
    final_results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test_binary, y_pred),
        'F1 Score': f1_score(y_test_binary, y_pred),
        'ROC AUC': roc_auc_score(y_test_binary, y_prob)
    })

final_results_df = pd.DataFrame(final_results)
print("Final Model Comparison:")
display(final_results_df.round(3))

# Visualize final comparison
plt.figure(figsize=(10, 6))
final_results_df.set_index('Model').plot(kind='bar')
plt.title('Final Model Comparison')
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Model Tuning Summary

1. **Hyperparameter Optimization**:
   - Best parameters for each model
   - Performance improvements
   - Computational trade-offs

2. **Threshold Optimization**:
   - Optimal thresholds for binary classification
   - Precision-Recall trade-offs
   - Model-specific considerations

3. **Ensemble Methods**:
   - Voting classifier performance
   - Stability improvements
   - Complexity vs. performance trade-offs

4. **Final Recommendations**:
   - Best model selection
   - Implementation considerations
   - Performance expectations

Next steps:
1. Model deployment strategy
2. Monitoring plan
3. Periodic retraining schedule