# 3. Modeling & Results

This will be the notebook for the modeling, results and comparison of the different algorithms the idea is to follow the following parameters:
1. Classification Models
    1. Logistic Regression
    2. RandomForrest 
    3. Tunning with GridSearchCV 
    4. Oversampling technique (if needed)
2. Model Performance Comparison
    1. F1 Score & Confusion Matrix / Recall
    2. Camilo don’t remember

In [1]:
import pandas as pd

## Load the 5 datasets

In [2]:
scaled_features = pd.read_csv('../data/raw/scaled_features.csv', index_col=0)
vif_features = pd.read_csv('../data/raw/vif_features.csv', index_col=0)
polynomial_features = pd.read_csv('../data/raw/polynomial_features.csv', index_col=0)
rfe_features = pd.read_csv('../data/raw/rfe_features.csv', index_col=0)
pca_features = pd.read_csv('../data/raw/pca_features.csv', index_col=0)
y = pd.read_csv('../data/raw/target_variable.csv').squeeze()  # Use .squeeze() to convert to a Series if needed


## Splitting Dataset (Train/Validation/Test)

In [3]:
from sklearn.model_selection import train_test_split
# Split data into training and validation sets directly using train_test_split
def split_train_val_test(X, y, val_size=0.2, test_size=0.2, random_state=33):
    """
    Splits the data into training, validation, and test sets with stratification.

    Parameters:
    - X: Features
    - y: Target variable
    - val_size: Proportion of the validation set (default 0.2)
    - test_size: Proportion of the test set (default 0.2)
    - random_state: Seed for reproducibility

    Returns:
    - X_train, X_val, X_test, y_train, y_val, y_test
    """
    # First split to get the test set with stratification
    X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=test_size, stratify=y, random_state=random_state)
    # Second split to separate validation from training with stratification
    X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=val_size, stratify=y_train_val, random_state=random_state)
    return X_train, X_val, X_test, y_train, y_val, y_test


In [4]:
# For scaled_features dataset
X_train_scaled, X_val_scaled, X_test_scaled, y_train_scaled, y_val_scaled, y_test_scaled = split_train_val_test(scaled_features, y)

# For vif_features dataset
X_train_vif, X_val_vif, X_test_vif, y_train_vif, y_val_vif, y_test_vif = split_train_val_test(vif_features, y)

# For polynomial_features dataset
X_train_poly, X_val_poly, X_test_poly, y_train_poly, y_val_poly, y_test_poly = split_train_val_test(polynomial_features, y)

# For rfe_features dataset
X_train_rfe, X_val_rfe, X_test_rfe, y_train_rfe, y_val_rfe, y_test_rfe = split_train_val_test(rfe_features, y)

# For pca_features dataset
X_train_pca, X_val_pca, X_test_pca, y_train_pca, y_val_pca, y_test_pca = split_train_val_test(pca_features, y)

## Logistic Regression 

We will import LogisticRegression from sklearn and initialize the model for all datasets. After that, you can train the model for each dataset using your split data.



In [5]:
from sklearn.linear_model import LogisticRegression

# Step 4: Initialize Logistic Regression for each dataset
def initialize_logistic_regression():
    """
    Initializes Logistic Regression with class weight balanced to handle class imbalance.
    
    Returns:
    - LogisticRegression object
    """
    # Logistic Regression model with class_weight='balanced' to handle class imbalance
    model = LogisticRegression(solver='liblinear', max_iter=1000, class_weight='balanced')
    return model

# Initialize the Logistic Regression model for all datasets
logistic_model_scaled = initialize_logistic_regression()
logistic_model_vif = initialize_logistic_regression()
logistic_model_poly = initialize_logistic_regression()
logistic_model_rfe = initialize_logistic_regression()
logistic_model_pca = initialize_logistic_regression()

# Now you have Logistic Regression initialized for each dataset


### Model performance metrics

When predicting bankruptcy it is necessary to minimize the False Negatives (predicting a company is not going to bankruptcy when it actually will) so the key metric for model performance and selection will be RECALL.   On the other hand we will use f1 score to balance performance providing a measure that takes into account imbalanced data (which is our case).

In [6]:
from sklearn.metrics import recall_score, f1_score

### Hyperparameter tunning and Cross Validation GridSearchCV

In [7]:
from sklearn.model_selection import GridSearchCV
# Define the parameter grid for Logistic Regression with different class weights
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10],  # Regularization strength
    'penalty': ['l1', 'l2'],   # Regularization type
    'class_weight': [
        'balanced',  # Automatically adjust class weights inversely proportional to class frequencies
       
    ]
}

### Scaled_Features (no data engineering or feature selection)

#### Fit the model

In [42]:
from sklearn.metrics import make_scorer

# Define scoring metrics for recall and F1
scoring = {
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score)
}

grid_search_lr = GridSearchCV(
    estimator=logistic_model_scaled,
    param_grid=param_grid_lr,
    scoring=scoring,  # Use both recall and F1
    refit='f1',  # Refit the model using the best F1 score
    cv=5,  # 5-fold cross-validation
    n_jobs=-1,
    verbose=1,
    return_train_score=True  # Include training scores in results
)

# Fit the model
grid_search_lr.fit(X_train_scaled, y_train_scaled)


Fitting 5 folds for each of 4 candidates, totalling 20 fits


#### Evaluate and analyze

In [35]:

# Get the best model and evaluate
best_model_scaled = grid_search_lr.best_estimator_

# Print best hyperparameters
best_params_scaled = grid_search_lr.best_params_
print(f"Best hyperparameters for scaled_features: {best_params_scaled}")


Best hyperparameters for scaled_features: {'C': 0.1, 'class_weight': 'balanced', 'penalty': 'l1'}


In [36]:
# --- Evaluate the best model on the training set ---
y_train_pred_scaled = best_model_scaled.predict(X_train_scaled)
train_recall_scaled = recall_score(y_train_scaled, y_train_pred_scaled)
train_f1_scaled = f1_score(y_train_scaled, y_train_pred_scaled)

print(f"Training Recall for scaled_features: {train_recall_scaled:.4f}")
print(f"Training F1 Score for scaled_features: {train_f1_scaled:.4f}")

Training Recall for scaled_features: 0.8936
Training F1 Score for scaled_features: 0.2979


In [37]:
# Evaluate on validation set
y_val_pred_scaled = best_model_scaled.predict(X_val_scaled)
recall_scaled = recall_score(y_val_scaled, y_val_pred_scaled)
f1_scaled = f1_score(y_val_scaled, y_val_pred_scaled)

# Print evaluation metrics
print(f"Recall for scaled_features: {recall_scaled:.4f}")
print(f"F1 Score for scaled_features: {f1_scaled:.4f}")

Recall for scaled_features: 0.8571
F1 Score for scaled_features: 0.3046


In [46]:
def get_detailed_results(grid_search):
    """
    Returns a DataFrame with recall and F1 scores for training and validation sets from a fitted GridSearchCV object.
    
    Parameters:
    - grid_search: Fitted GridSearchCV object.
    
    Returns:
    - DataFrame containing hyperparameters, recall and F1 scores for training and validation sets.
    """
    # Convert cv_results_ to a DataFrame
    results_df = pd.DataFrame(grid_search.cv_results_)
    
    # Extract and rename the relevant columns
    results_df = results_df[['params', 
                             'mean_train_recall', 'mean_train_f1', 
                             'mean_test_recall', 'mean_test_f1']]
    
    # Rename columns for clarity
    results_df.columns = ['params', 'train_recall', 'train_f1', 'val_recall', 'val_f1']
    
    return results_df.sort_values(by='val_f1', ascending=False)

# Now you can run the function as follows:
results_df_scaled = get_detailed_results(grid_search_lr)

# Display the top results
results_df_scaled

Unnamed: 0,params,train_recall,train_f1,val_recall,val_f1
0,"{'C': 0.1, 'class_weight': 'balanced', 'penalt...",0.897187,0.299733,0.851724,0.289672
2,"{'C': 1, 'class_weight': 'balanced', 'penalty'...",0.890107,0.308217,0.809113,0.279349
1,"{'C': 0.1, 'class_weight': 'balanced', 'penalt...",0.900727,0.297472,0.851724,0.277584
3,"{'C': 1, 'class_weight': 'balanced', 'penalty'...",0.895433,0.310018,0.809113,0.276087


In [47]:
results_df_scaled.to_csv('../data/results/results_df_scaled.csv')

Lets evaluate the impact of generating feature engineering and creating interactions. 

### Polynomial Features Dataset (Including feature engineering interactions)

#### Fit the model

Lets evaluate the model in a reduced gridsearch to not involve all the parameters.

In [50]:
param_grid_lr_poly = {
    'C': [0.1, 1],  # Regularization strength
    'penalty': ['l1'],   # Regularization type
    'class_weight': [
        'balanced',  # Automatically adjust class weights inversely proportional to class frequencies
       
    ]
}
grid_search_lr_poly = GridSearchCV(
    estimator=logistic_model_poly,
    param_grid=param_grid_lr_poly,
    scoring=scoring,  # Use both recall and F1
    refit='f1',  # Refit the model using the best F1 score
    cv=5,  # 5-fold cross-validation
    n_jobs=-1,
    verbose=1
)

# Fit the model
grid_search_lr_poly.fit(X_train_poly, y_train_poly)

Fitting 5 folds for each of 2 candidates, totalling 10 fits




In [None]:
results_df_poly = get_detailed_results(grid_search_lr_poly)
results_df_poly.to_csv('../data/results/results_df_poly.csv')
results_df_poly 

### VIF Features (Selected using VIF technique)

#### Fit the model

In [None]:
# Fit and evaluate on vif_features dataset
grid_search_lr_vif = GridSearchCV(
    estimator=logistic_model_vif,
    param_grid=param_grid_lr,
    scoring= scoring,  # Use both recall and F1
    n_jobs=-1,
    verbose=1
)

# Fit the model on scaled_features
grid_search_lr_vif.fit(X_train_vif, y_train_vif)

In [None]:
results_df_vif = get_detailed_results(grid_search_lr_vif)
results_df_vif.to_csv('../data/results/results_df_vif.csv')
results_df_vif

### PCA Features (Selected using PCA technique)

In [None]:
# Fit and evaluate on vif_features dataset
grid_search_lr_pca = GridSearchCV(
    estimator=logistic_model_pca,
    param_grid=param_grid_lr,
    scoring= scoring,  # Use both recall and F1
    cv=5,  # 5-fold cross-validation
    n_jobs=-1,
    verbose=1
)

# Fit the model on scaled_features
grid_search_lr_pca.fit(X_train_pca, y_train_pca)

In [None]:
results_df_pca = get_detailed_results(grid_search_lr_pca)
results_df_pca.to_csv('../data/results/results_df_pca.csv')
results_df_pca

### RFE Features (Selected features using RFE)

In [None]:
# Fit and evaluate on vif_features dataset
grid_search_lr_rfe = GridSearchCV(
    estimator=logistic_model_rfe,
    param_grid=param_grid_lr,
    scoring= scoring,  # Use both recall and F1
    cv=5,  # 5-fold cross-validation
    n_jobs=-1,
    verbose=1
)

# Fit the model on scaled_features
grid_search_lr_rfe.fit(X_train_rfe, y_train_rfe)

In [None]:
results_df_rfe = get_detailed_results(grid_search_lr_rfe)
results_df_rfe.to_csv('../data/results/results_df_rfe.csv')
results_df_rfe

## Random Forrest

Based on the previous section we are only going to apply this model to the best results found on Logistic Regression and compare final results and which model can predict better the bankruptcy.

## Model Comparison

## Oversampling Technique 