# Machine Learning
---
## Assignment: Implementation of Information Based Models:  Decision Trees, Random Forests and Gradient Boosting
**Group 5**

---

### Student Details  
- **Name:** MUGOT, CHRIS JALLAINE S.   
- **Section:** DS3A  
- **Date of Submission:** MAY 11, 2025



### Instructions
[![params.png](https://i.ibb.co/YBP34CBT/params.png)](https://ibb.co/9HcNmDH3)

#### Objectives:
1. **Create and compare three machine learning models (Decision Tree, Random Forest, and Gradient Boosting Machine) for breast cancer classification.**

2. **Find the "best" possible version of each model by systematically exploring different hyperparameter values.** 
   - "Best" is defined as achieving the highest test accuracy.

3. **Demonstrate understanding of hyperparameter tuning by:**
   - Starting with baseline models (default parameters)
   - Implementing proper hyperparameter searching techniques (Grid Search with cross-validation)
   - Analyzing how different hyperparameter values affect model performance

4. **Evaluate and compare model performance using appropriate metrics and visualizations for classification tasks.**

5. **Maintain consistent experimental conditions across all models by:**
   - Using the 75:25 train-test split ratio
   - Setting `random_state=42` for reproducibility
   - Using evaluation metrics for fair comparison

6. **Document the process in a well-structured Jupyter notebook that shows both the baseline and optimized versions of each model, with clear visualizations and explanations.**


---

In [1]:
# ----------------------------------------------------------------
# Breast Cancer Classification: Model Comparison and Optimization
# ----------------------------------------------------------------

## Setup and Data Loading

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

In [2]:
# random state value for reproducibility

RANDOM_STATE = 42

In [3]:
# ----------------------------
# Load breast cancer dataset
# ----------------------------

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

In [4]:
# ----------------------------------
# Display dataset information
# ----------------------------------

print(f"Dataset shape: {X.shape}")
print()
print(f"Number of features: {X.shape[1]}")
print()
print(f"Number of classes: {len(np.unique(y))}")
print()
print(f"Class distribution: \n{pd.Series(y).value_counts()}")

Dataset shape: (569, 30)

Number of features: 30

Number of classes: 2

Class distribution: 
1    357
0    212
Name: count, dtype: int64


In [5]:
# -------------------------------------------
#  Train-test split (75:25 as specified)
# ------------------------------------------

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

Training set shape: (426, 30)
Testing set shape: (143, 30)


In [6]:
PINK_PALETTE = ["#FF69B4", "#FFB6C1", "#FFC0CB", "#FF1493", "#DB7093"]

In [7]:
# ---------------------------------------------------
# Helper Functions for Easier Visualization and Evaluation
# ---------------------------------------------------

def plot_confusion_matrix(y_true, y_pred, model_name="Model"):
    """Generate a plotly confusion matrix visualization."""
    cm = confusion_matrix(y_true, y_pred)
    fig = px.imshow(cm, 
                   labels=dict(x="Predicted", y="Actual", color="Count"),
                   x=['Malignant', 'Benign'],
                   y=['Malignant', 'Benign'],
                   text_auto=True,
                   color_continuous_scale=px.colors.sequential.Pinkyl)
    
    fig.update_layout(
        title=f"Confusion Matrix - {model_name}",
        width=600,
        height=500
    )
    return fig


def plot_roc_curve(y_true, y_prob, model_name="Model"):
    """Generate a plotly ROC curve visualization."""
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    roc_auc = auc(fpr, tpr)
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines',
                            line=dict(color='#FF69B4', width=2),
                            name=f'{model_name} (AUC = {roc_auc:.3f})'))
    fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines',
                            line=dict(color='navy', width=1, dash='dash'),
                            name='Random Guess'))
    
    fig.update_layout(
        title=f'ROC Curve - {model_name}',
        xaxis_title='False Positive Rate',
        yaxis_title='True Positive Rate',
        legend=dict(x=0.7, y=0.05),
        width=700,
        height=500
    )
    return fig, roc_auc


def plot_precision_recall_curve(y_true, y_prob, model_name="Model"):
    """Generate a plotly precision-recall curve visualization."""
    precision, recall, _ = precision_recall_curve(y_true, y_prob)
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=recall, y=precision, mode='lines',
                            line=dict(color='#FF1493', width=2),
                            name=f'{model_name}'))
    
    fig.update_layout(
        title=f'Precision-Recall Curve - {model_name}',
        xaxis_title='Recall',
        yaxis_title='Precision',
        legend=dict(x=0.7, y=0.05),
        width=700,
        height=500
    )
    return fig


def evaluate_model(model, X_train, X_test, y_train, y_test, model_name="Model"):
    """Evaluate a model and print classification metrics."""
    # Training and test predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Probability predictions for ROC curve
    if hasattr(model, "predict_proba"):
        y_train_prob = model.predict_proba(X_train)[:, 1]
        y_test_prob = model.predict_proba(X_test)[:, 1]
    else:
        y_train_prob = y_train_pred
        y_test_prob = y_test_pred
    
    # Accuracy
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    
    print(f"===== {model_name} Evaluation =====")
    print(f"Training Accuracy: {train_accuracy:.4f}")
    print(f"Testing Accuracy: {test_accuracy:.4f}")
    print("\nClassification Report (Test Set):")
    print(classification_report(y_test, y_test_pred))
    
    # Generate confusion matrix
    cm_fig = plot_confusion_matrix(y_test, y_test_pred, model_name)
    cm_fig.show()
    
    # Generate ROC curve
    roc_fig, roc_auc = plot_roc_curve(y_test, y_test_prob, model_name)
    roc_fig.show()
    
    # Generate precision-recall curve
    pr_fig = plot_precision_recall_curve(y_test, y_test_prob, model_name)
    pr_fig.show()
    
    # Plot training vs. testing accuracy comparison
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=['Training', 'Testing'],
        y=[train_accuracy, test_accuracy],
        marker_color=['#FF69B4', '#FF1493']
    ))
    fig.update_layout(
        title=f"Training vs Testing Accuracy - {model_name}",
        xaxis_title="Dataset",
        yaxis_title="Accuracy",
        yaxis=dict(range=[0.7, 1.0]),
        width=600,
        height=400
    )
    fig.show()
    
    return {
        'model': model,
        'train_accuracy': train_accuracy,
        'test_accuracy': test_accuracy,
        'y_test_pred': y_test_pred,
        'y_test_prob': y_test_prob
    }


### Model Training

In [8]:
# ----------------------------
# Decision Tree Classifier
# ----------------------------

# This is a Baseline Decision Tree Model

print("\nBaseline Decision Tree (without hyperparameter tuning)")
dt_baseline = DecisionTreeClassifier(random_state=RANDOM_STATE)
dt_baseline.fit(X_train, y_train)




Baseline Decision Tree (without hyperparameter tuning)


In [9]:
# Evaluate baseline model
dt_baseline_results = evaluate_model(dt_baseline, X_train, X_test, y_train, y_test, 
                                     model_name="Baseline Decision Tree")

===== Baseline Decision Tree Evaluation =====
Training Accuracy: 1.0000
Testing Accuracy: 0.9510

Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.93      0.94      0.94        54
           1       0.97      0.96      0.96        89

    accuracy                           0.95       143
   macro avg       0.95      0.95      0.95       143
weighted avg       0.95      0.95      0.95       143



In [10]:
# -------------------------------------------------
# Decision Tree with Cross-Validation and Grid Search
# ---------------------------------------------------

print("\nTuning Decision Tree with GridSearchCV and Cross-Validation")

# Define parameter grid
dt_param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}



Tuning Decision Tree with GridSearchCV and Cross-Validation


In [11]:
# GridSearchCV object
dt_grid = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=RANDOM_STATE),
    param_grid=dt_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Fit GridSearchCV
dt_grid.fit(X_train, y_train)

Fitting 5 folds for each of 90 candidates, totalling 450 fits


In [12]:
# ----------------------------
# Best hyperparameters
# -----------------------------

print("\nBest Decision Tree Hyperparameters:")
print()
print(dt_grid.best_params_)
print(f"Best cross-validation accuracy: {dt_grid.best_score_:.4f}")


Best Decision Tree Hyperparameters:

{'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best cross-validation accuracy: 0.9366


In [13]:
# -----------------------
#  Optimized Model
# -----------------------

dt_optimized = dt_grid.best_estimator_

In [14]:
# --------------------------
# Evaluate optimized model
# --------------------------

dt_optimized_results = evaluate_model(dt_optimized, X_train, X_test, y_train, y_test, 
                                      model_name="Optimized Decision Tree")

===== Optimized Decision Tree Evaluation =====
Training Accuracy: 1.0000
Testing Accuracy: 0.9510

Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.93      0.94      0.94        54
           1       0.97      0.96      0.96        89

    accuracy                           0.95       143
   macro avg       0.95      0.95      0.95       143
weighted avg       0.95      0.95      0.95       143



## Hyperparameter Optimization Analysis

### Decision Tree Model Analysis

#### Baseline vs. Optimized Performance Comparison

| Metric | Baseline Model | Optimized Model | Change |
|--------|---------------|-----------------|--------|
| Training Accuracy | 1.0000 | 1.0000 | No change |
| Testing Accuracy | 0.9510 | 0.9510 | No change |
| Class 0 Precision | 0.93 | 0.93 | No change |
| Class 0 Recall | 0.94 | 0.94 | No change |
| Class 0 F1-Score | 0.94 | 0.94 | No change |
| Class 1 Precision | 0.97 | 0.97 | No change |
| Class 1 Recall | 0.96 | 0.96 | No change |
| Class 1 F1-Score | 0.96 | 0.96 | No change |

### Analysis of Decision Tree Hyperparameter Tuning

For the Decision Tree model, the hyperparameter optimization process did not yield any improvements in performance metrics. This is an interesting finding that warrants further analysis:

1. **Identical Performance**: The baseline and optimized models show identical metrics across all evaluation criteria (accuracy, precision, recall, and F1-score). This suggests that the default hyperparameters of the Decision Tree model were already well-suited for this particular dataset.

2. **Perfect Training Accuracy**: Both models achieved a training accuracy of 1.0, indicating they perfectly fit the training data. This could potentially indicate overfitting, though the relatively high testing accuracy (0.9510) suggests the model generalizes reasonably well.

3. **Hyperparameter Search Space**: It's possible that the grid search or cross-validation process explored hyperparameters that were either:
   - Too similar to the default values
   - Not diverse enough to capture meaningful differences in model architecture
   - Not the most impactful parameters for this specific classification task

4. **Dataset Characteristics**: The nature of your dataset may be such that the decision boundaries are relatively straightforward, making the model less sensitive to hyperparameter tuning.

5. **Optimization Objective**: It's worth considering whether the optimization process was focused on the right metric. Sometimes optimizing for one metric (e.g., accuracy) might not lead to improvements if the baseline model was already optimized for that metric.

6. **Potential Next Steps**: 
   - Expand the hyperparameter search space
   - Try different hyperparameter combinations manually
   - Consider feature engineering or selection to potentially improve the model's performance
   - Explore ensemble methods or more complex model architectures

Despite the lack of improvement, this result is valuable as it confirms that the default parameters were already effective for this classification task, potentially saving computational resources and simplifying the model implementation.

 ---

In [15]:
# ----------------------------
# Random Forest Classifier
# ----------------------------

# This is a Baseline Random Forest Model

print("\nBaseline Random Forest (without hyperparameter tuning)")
rf_baseline = RandomForestClassifier(random_state=RANDOM_STATE)
rf_baseline.fit(X_train, y_train)


Baseline Random Forest (without hyperparameter tuning)


In [16]:
rf_baseline_results = evaluate_model(rf_baseline, X_train, X_test, y_train, y_test, 
                                    model_name="Baseline Random Forest")

===== Baseline Random Forest Evaluation =====
Training Accuracy: 1.0000
Testing Accuracy: 0.9650

Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.96      0.94      0.95        54
           1       0.97      0.98      0.97        89

    accuracy                           0.97       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.97      0.97      0.96       143



In [17]:
# -------------------------------------------------
# Random Forest with Grid Search
# ---------------------------------------------------

print("\nTuning Random Forest with GridSearchCV")

# Define parameter grid
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



Tuning Random Forest with GridSearchCV


In [18]:
# GridSearchCV object
rf_grid = GridSearchCV(
    estimator=RandomForestClassifier(random_state=RANDOM_STATE),
    param_grid=rf_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Fit GridSearchCV
rf_grid.fit(X_train, y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [19]:
# ------------------------- 
# Best hyperparameters
# ------------------------

print("\nBest Random Forest Hyperparameters:")
print()
print(rf_grid.best_params_)
print(f"Best cross-validation accuracy: {rf_grid.best_score_:.4f}")


Best Random Forest Hyperparameters:

{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}
Best cross-validation accuracy: 0.9624


In [20]:
# ----------------------
# Optimized Model
# ----------------------

rf_optimized = rf_grid.best_estimator_

In [21]:
# --------------------------
# Evaluate optimized model
# --------------------------

rf_optimized_results = evaluate_model(rf_optimized, X_train, X_test, y_train, y_test, 
                                     model_name="Optimized Random Forest")

===== Optimized Random Forest Evaluation =====
Training Accuracy: 0.9930
Testing Accuracy: 0.9650

Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.96      0.94      0.95        54
           1       0.97      0.98      0.97        89

    accuracy                           0.97       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.97      0.97      0.96       143



## Hyperparameter Optimization Analysis

### Random Forest Model Analysis

#### Baseline vs. Optimized Performance Comparison

| Metric | Baseline Model | Optimized Model | Change |
|--------|---------------|-----------------|--------|
| Training Accuracy | 1.0000 | 0.9930 | -0.0070 ↓ |
| Testing Accuracy | 0.9650 | 0.9650 | No change |
| Class 0 Precision | 0.96 | 0.96 | No change |
| Class 0 Recall | 0.94 | 0.94 | No change |
| Class 0 F1-Score | 0.95 | 0.95 | No change |
| Class 1 Precision | 0.97 | 0.97 | No change |
| Class 1 Recall | 0.98 | 0.98 | No change |
| Class 1 F1-Score | 0.97 | 0.97 | No change |

### Analysis of Random Forest Hyperparameter Tuning

The hyperparameter optimization for the Random Forest model has yielded some interesting insights, even though the test performance metrics remained unchanged:

1. **Slight Reduction in Training Accuracy**: The optimized model shows a slight decrease in training accuracy (1.0000 → 0.9930) while maintaining the same testing accuracy. This is actually a positive outcome as it suggests reduced overfitting without sacrificing performance on unseen data.

2. **Optimal Hyperparameters**: The grid search identified the following optimal configuration:
   - `max_depth`: None (allowing trees to grow to their full depth)
   - `min_samples_leaf`: 1 (minimum of 1 sample required to be at a leaf node)
   - `min_samples_split`: 5 (minimum of 5 samples required to split an internal node)
   - `n_estimators`: 50 (the ensemble consists of 50 trees)

3. **Cross-Validation Performance**: The best cross-validation accuracy of 0.9624 indicates strong and consistent performance across different data splits, confirming the model's stability.

4. **Generalization Improvement**: The slight decrease in training accuracy coupled with maintained test performance indicates that the optimized model generalizes better. This is one of the main goals of hyperparameter tuning—to find models that perform well on unseen data rather than just memorizing the training set.

5. **Balanced Class Performance**: Both models show strong and balanced performance across classes, with slightly better metrics for Class 1 (likely the majority class), which is common in classification tasks.

6. **Effectiveness of Ensemble Size**: The optimal number of estimators being 50 suggests that this is sufficient to capture the complexity of the data without introducing unnecessary computational overhead that would come with more trees.

7. **Pruning Parameters**: Setting `min_samples_split` to 5 (higher than the default of 2) introduces a form of pruning that helps prevent overfitting by requiring more evidence before making a split decision.

The optimization process has successfully maintained the high predictive performance while slightly reducing the model's tendency to overfit, resulting in a more robust and generalizable Random Forest classifier. This demonstrates the value of hyperparameter tuning even when headline metrics like test accuracy don't change dramatically.

---

In [22]:
# ----------------
# Gradient Boosting
# -----------------

# Baseline GBM Model

print("\nBaseline GBM (without hyperparameter tuning)")
gbm_baseline = GradientBoostingClassifier(random_state=RANDOM_STATE)
gbm_baseline.fit(X_train, y_train)


Baseline GBM (without hyperparameter tuning)


In [24]:
# Evaluate baseline model

gbm_baseline_results = evaluate_model(gbm_baseline, X_train, X_test, y_train, y_test, 
                                     model_name="Baseline GBM")

===== Baseline GBM Evaluation =====
Training Accuracy: 1.0000
Testing Accuracy: 0.9580

Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.94      0.94      0.94        54
           1       0.97      0.97      0.97        89

    accuracy                           0.96       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143



In [26]:
# Define parameter grid
gbm_param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0]
}

In [27]:
# GridSearchCV object
gbm_grid = GridSearchCV(
    estimator=GradientBoostingClassifier(random_state=RANDOM_STATE),
    param_grid=gbm_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Fit GridSearchCV
gbm_grid.fit(X_train, y_train)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


In [28]:
# ------------------------
# Best Hyperparameters 
# ------------------------

print("\nBest GBM Hyperparameters:")
print()
print(gbm_grid.best_params_)
print(f"Best cross-validation accuracy: {gbm_grid.best_score_:.4f}")



Best GBM Hyperparameters:

{'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.8}
Best cross-validation accuracy: 0.9648


In [29]:
# ---------------------
# Optimized Model
# ----------------------

gbm_optimized = gbm_grid.best_estimator_

In [30]:
# --------------------------
# Evaluate optimized model
# ---------------------------

gbm_optimized_results = evaluate_model(gbm_optimized, X_train, X_test, y_train, y_test, 
                                      model_name="Optimized GBM")

===== Optimized GBM Evaluation =====
Training Accuracy: 1.0000
Testing Accuracy: 0.9650

Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.96      0.94      0.95        54
           1       0.97      0.98      0.97        89

    accuracy                           0.97       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.97      0.97      0.96       143



## Hyperparameter Optimization Analysis

### Gradient Boosting Model Analysis

#### Baseline vs. Optimized Performance Comparison

| Metric | Baseline Model | Optimized Model | Change |
|--------|---------------|-----------------|--------|
| Training Accuracy | 1.0000 | 1.0000 | No change |
| Testing Accuracy | 0.9580 | 0.9650 | +0.0070 ↑ |
| Class 0 Precision | 0.94 | 0.96 | +0.02 ↑ |
| Class 0 Recall | 0.94 | 0.94 | No change |
| Class 0 F1-Score | 0.94 | 0.95 | +0.01 ↑ |
| Class 1 Precision | 0.97 | 0.97 | No change |
| Class 1 Recall | 0.97 | 0.98 | +0.01 ↑ |
| Class 1 F1-Score | 0.97 | 0.97 | No change |

### Analysis of Gradient Boosting Hyperparameter Tuning

The hyperparameter optimization for the Gradient Boosting model has successfully improved the model's performance, demonstrating the importance of proper tuning for this algorithm:

1. **Improved Testing Accuracy**: The optimized model achieved a higher testing accuracy (0.9580 → 0.9650), representing a meaningful improvement in the model's ability to generalize to unseen data.

2. **Enhanced Class 0 Performance**: The precision for Class 0 improved from 0.94 to 0.96, and the F1-score increased from 0.94 to 0.95, indicating better overall performance for this class.

3. **Increased Class 1 Recall**: The recall for Class 1 improved from 0.97 to 0.98, suggesting the model became more effective at identifying positive cases.

4. **Optimal Hyperparameters**: The grid search determined the following optimal configuration:
   - `learning_rate`: 0.1 (moderately conservative step size)
   - `max_depth`: 7 (allowing for moderately complex trees)
   - `n_estimators`: 100 (using 100 sequential boosting stages)
   - `subsample`: 0.8 (using 80% of the training data for each boosting iteration)

5. **Strong Cross-Validation Performance**: The best cross-validation accuracy of 0.9648 indicates excellent and consistent performance across different data splits, confirming the model's reliability.

6. **Balanced Learning Approach**: The combination of a moderate learning rate (0.1) with a substantial number of trees (100) creates a balanced approach that allows the model to learn gradually but thoroughly, avoiding both underfitting and overfitting.

7. **Effective Use of Subsampling**: The optimal subsample rate of 0.8 introduces stochasticity that helps prevent overfitting and improves generalization, similar to the random sampling technique in Random Forests.

8. **Model Complexity Control**: A max_depth of 7 allows for sufficiently complex trees to capture important patterns in the data without becoming too specific to the training set.

The optimization process for the Gradient Boosting model has yielded tangible improvements in predictive performance while maintaining a perfect training accuracy. This suggests that the tuned hyperparameters enabled the model to better capture the underlying patterns in the data without overfitting. Of the three models analyzed (Decision Tree, Random Forest, and Gradient Boosting), the Gradient Boosting model

---

## Model Comparison

In [32]:
# All Results [Models]
models = {
    'Baseline Decision Tree': dt_baseline_results,
    'Optimized Decision Tree': dt_optimized_results,
    'Baseline Random Forest': rf_baseline_results,
    'Optimized Random Forest': rf_optimized_results,
    'Baseline GBM': gbm_baseline_results,
    'Optimized GBM': gbm_optimized_results
}

# Comparison DF
comparison_df = pd.DataFrame({
    'Model': list(models.keys()),
    'Training Accuracy': [models[model]['train_accuracy'] for model in models],
    'Testing Accuracy': [models[model]['test_accuracy'] for model in models]
})

print("\nModel Comparison:")
comparison_df 


Model Comparison:


Unnamed: 0,Model,Training Accuracy,Testing Accuracy
0,Baseline Decision Tree,1.0,0.951049
1,Optimized Decision Tree,1.0,0.951049
2,Baseline Random Forest,1.0,0.965035
3,Optimized Random Forest,0.992958,0.965035
4,Baseline GBM,1.0,0.958042
5,Optimized GBM,1.0,0.965035


In [33]:
# Plot for Model Comparison

fig = go.Figure()

fig.add_trace(go.Bar(
    x=comparison_df['Model'],
    y=comparison_df['Training Accuracy'],
    name='Training Accuracy',
    marker_color='#FFB6C1'
))

fig.add_trace(go.Bar(
    x=comparison_df['Model'],
    y=comparison_df['Testing Accuracy'],
    name='Testing Accuracy',
    marker_color='#FF69B4'
))

fig.update_layout(
    title='Model Comparison: Training vs Testing Accuracy',
    xaxis_title='Model',
    yaxis_title='Accuracy',
    barmode='group',
    yaxis=dict(range=[0.85, 1.0]),
    width=900,
    height=500
)

fig.show()

In [34]:
# --------------------------------------
#  Best Model based on Test Accuracy
# ------------------------------------

best_model_name = comparison_df.loc[comparison_df['Testing Accuracy'].idxmax(), 'Model']
best_model_accuracy = comparison_df['Testing Accuracy'].max()

print(f"\nBest Model: {best_model_name}")
print(f"Best Model Test Accuracy: {best_model_accuracy:.4f}")


Best Model: Baseline Random Forest
Best Model Test Accuracy: 0.9650


In [36]:
# ----------------------------------
# Feature importance [Best Model]
# ----------------------------------

best_model = None
if best_model_name == 'Optimized Decision Tree':
    best_model = dt_optimized
elif best_model_name == 'Optimized Random Forest':
    best_model = rf_optimized
elif best_model_name == 'Optimized GBM':
    best_model = gbm_optimized
elif best_model_name == 'Baseline Decision Tree':
    best_model = dt_baseline
elif best_model_name == 'Baseline Random Forest':
    best_model = rf_baseline
elif best_model_name == 'Baseline GBM':
    best_model = gbm_baseline

if best_model is not None:
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Importance': best_model.feature_importances_
    }).sort_values('Importance', ascending=True)
    
    # Plot feature importance
    fig = px.bar(feature_importance.head(15), 
                x='Importance', 
                y='Feature',
                orientation='h',
                color_discrete_sequence=['#FF1493'],
                title=f'Top 15 Feature Importance - {best_model_name}')
    
    fig.update_layout(
        xaxis_title='Importance',
        yaxis_title='Feature',
        width=900,
        height=500
    )
    
    fig.show()

---

In [39]:
print(f"""
Based on our analysis, the {best_model_name} performs best with a test accuracy of {best_model_accuracy:.4f}.

The optimal hyperparameters for each model were:

1. Decision Tree:
   {dt_grid.best_params_}

2. Random Forest:
   {rf_grid.best_params_}

3. Gradient Boosting Machine:
   {gbm_grid.best_params_}

The feature importance analysis shows which features had the most impact on the classification.

This notebook demonstrates the importance of hyperparameter tuning for improving model
performance especially on models that really requires tuning in order to show their optimal performance.
""")


Based on our analysis, the Baseline Random Forest performs best with a test accuracy of 0.9650.

The optimal hyperparameters for each model were:

1. Decision Tree:
   {'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 2}

2. Random Forest:
   {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}

3. Gradient Boosting Machine:
   {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.8}

The feature importance analysis shows which features had the most impact on the classification.

This notebook demonstrates the importance of hyperparameter tuning for improving model
performance especially on models that really requires tuning in order to show their optimal performance.



---