# Model Building: Classification Models Evaluation
In this phase, I will evaluate multiple classification models (Logistic Regression, Random Forest, XGBoost, LightGBM) using cross-validation, tune hyperparameters, and save the best-performing model.

In [4]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
import joblib
import os

## Load Transformed Data
In this section, I will load the preprocessed training and test data from the `data/transformed` folder.

In [5]:
# Load transformed data
X_train = joblib.load('../data/transformed/X_train.pkl')
X_test = joblib.load('../data/transformed/X_test.pkl')
y_train = joblib.load('../data/transformed/y_train.pkl')
y_test = joblib.load('../data/transformed/y_test.pkl')

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (614, 8)
X_test shape: (154, 8)
y_train shape: (614,)
y_test shape: (154,)


## Define Models
In this section, I will define the four classification models to evaluate.

In [6]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
    'LightGBM': LGBMClassifier(random_state=42)
}

## Cross-Validation and Model Evaluation
Evaluate each model using cross-validation and compare their performance.

In [7]:
# Cross-validation and evaluation
cv_results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    cv_results[name] = scores
    print(f"{name} ROC-AUC: {scores.mean():.4f} (+/- {scores.std():.4f})")

Logistic Regression ROC-AUC: 0.8372 (+/- 0.0286)
Random Forest ROC-AUC: 0.8206 (+/- 0.0216)
Random Forest ROC-AUC: 0.8206 (+/- 0.0216)


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost ROC-AUC: 0.7808 (+/- 0.0189)
[LightGBM] [Info] Number of positive: 171, number of negative: 320
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000877 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 580
[LightGBM] [Info] Number of data points in the train set: 491, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.348269 -> initscore=-0.626657
[LightGBM] [Info] Start training from score -0.626657
[LightGBM] [Info] Number of positive: 171, number of negative: 320
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000121 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 578
[LightGBM] [Info] Number of data points in the train set: 491, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.348269 -> i

## Hyperparameter Tuning
Tune hyperparameters for the best-performing models using GridSearchCV.

In [8]:
# Example: Hyperparameter tuning for Random Forest
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}
gs_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=3, scoring='roc_auc', n_jobs=-1)
gs_rf.fit(X_train, y_train)
print(f"Best Random Forest params: {gs_rf.best_params_}")
print(f"Best Random Forest ROC-AUC: {gs_rf.best_score_:.4f}")

# You can repeat similar tuning for other models as needed.

Best Random Forest params: {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 200}
Best Random Forest ROC-AUC: 0.8306


## Final Model Training and Evaluation
Train the best model on the full training set and evaluate on the test set.

In [9]:
# Train best model (example: Random Forest with best params)
best_model = gs_rf.best_estimator_
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test ROC-AUC:", roc_auc_score(y_test, y_proba))
print(classification_report(y_test, y_pred))

Test Accuracy: 0.7272727272727273
Test ROC-AUC: 0.812037037037037
              precision    recall  f1-score   support

           0       0.77      0.83      0.80       100
           1       0.63      0.54      0.58        54

    accuracy                           0.73       154
   macro avg       0.70      0.68      0.69       154
weighted avg       0.72      0.73      0.72       154



## Save the Best Model
Save the trained best model to the `models` directory for future use.

In [10]:
# Save the best model
os.makedirs('../models', exist_ok=True)
joblib.dump(best_model, '../models/best_model.pkl')
print("Best model saved as '../models/best_model.pkl'")

Best model saved as '../models/best_model.pkl'
