---


<img width=25% src="https://raw.githubusercontent.com/gabrielcapela/credit_risk/main/images/myself.png" align=right>

# **Credit Risk Assessment Project**

*by Gabriel Capela*

[<img src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white"/>](https://www.linkedin.com/in/gabrielcapela)
[<img src="https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white" />](https://medium.com/@gabrielcapela)

---

This project aims to develop a **Machine Learning model capable of predicting the probability of customer default** at the time of a credit card application, even before any payment history is available.

Default prediction is critical to minimize financial losses, preserve institutional credibility, and provide fair and efficient access to credit. However, the task is challenging due to limited data at the application stage, potential classification errors (false positives/negatives), and the need for representative historical data.

The ultimate goal is to provide financial institutions with a **data-driven decision-support tool** that improves the accuracy and fairness of credit approval processes.



<p align="center">
<img width=90% src="https://raw.githubusercontent.com/gabrielcapela/credit_risk/main/images/crisp-dm.jpeg">
</p>



This notebook will cover the **last three phases** of the project: Modeling, Evaluation and Deployment.

#  Modeling

This step will follow the following sequence: First, we will **create a pipeline** for training multiple models, after which **two models will be chosen** to move forward. These two models will be **hyperparameterized** and, finally, they will perharps be combined into a **ensemble model**.

In [4]:
# Importing the data
import pandas as pd
X_train= pd.read_csv("../data/X_train.csv")
X_test= pd.read_csv("../data/X_test.csv")
y_train= pd.read_csv("../data/y_train.csv")
y_test= pd.read_csv("../data/y_test.csv")

print(f'The X_train shape is {X_train.shape[0]} x {X_train.shape[1]}')
print(f'The X_test shape is {X_test.shape[0]} x {X_test.shape[1]}')
print(f'The y_train shape is {y_train.shape[0]} x {y_train.shape[1]}')
print(f'The y_test shape is {y_test.shape[0]} x {y_test.shape[1]}')


The X_train shape is 26754 x 82
The X_test shape is 11466 x 82
The y_train shape is 26754 x 1
The y_test shape is 11466 x 1


##Creating a Pipeline

A Pipeline will be created to chain the cross validation and data standardization process, in order to organize this flow and allow the simultaneous testing of several different models, avoiding code repetition.

In [None]:
#Importing the necessary packages and models
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import recall_score, precision_score, accuracy_score, confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

#Highlighting quantitative variables so that only they are standardized in pre-processing
num_features = ['account_length', 'number_vmail_messages', 'total_day_minutes',
                'total_day_calls', 'total_day_charge', 'total_eve_minutes',
                'total_eve_calls', 'total_eve_charge', 'total_night_minutes',
                'total_night_calls', 'total_night_charge', 'total_intl_minutes',
                'total_intl_calls', 'total_intl_charge', 'customer_service_calls']

cat_features = ['international_plan', 'voice_mail_plan', '408', '415', '510']

preprocessor = ColumnTransformer([
    ('num_scaler', StandardScaler(), num_features),
    ('passthrough', 'passthrough', cat_features)
])

#List of models
models = {
    'Random Forest': RandomForestClassifier(random_state=42),
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC(probability=True),
    'Decision Tree': DecisionTreeClassifier(),
    'KNN': KNeighborsClassifier()
}

#Defining the number of folds for Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

#List to store the results
results = []

#Testing all models
for name, model in models.items():
    print(f"\nTesting {name}...")

    #Creating the pipeline
    pipeline = ImbPipeline([
        ('preprocessing', preprocessor),
        ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
        ('classifier', model)
    ])

    #Getting Predictions with Cross-Validation
    y_pred = cross_val_predict(pipeline, X_train, y_train, cv=cv)

    #Calculating metrics
    recall = recall_score(y_train, y_pred)
    precision = precision_score(y_train, y_pred)
    accuracy = accuracy_score(y_train, y_pred)

    results.append([name, recall, precision, accuracy])

    #Creating classification Report and confusion matrix
    print("Classification Report:\n", classification_report(y_train, y_pred, digits=4))
    cm = confusion_matrix(y_train, y_pred)

    #Plotting the confusion matrix
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
    plt.title(f'Confusion Matrix - {name}')
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()
    print('\n\n')

#Creating a dataframe with the results
df_results = pd.DataFrame(results, columns=['Model', 'Recall', 'Precision', 'Accuracy'])



## Choosing the best Models

In churn analysis, our main goal is to **correctly identify customers who will cancel**, as this allows the company to take preventive actions to retain them. For this reason, **Recall will be our main metric**, as it measures the proportion of customers who actually canceled and who were correctly classified by the model.

However, **we will also evaluate Precision** in the background, as it helps us ensure that, when predicting churn, **we are minimizing false positives, avoiding unnecessary alarms**.

In [None]:
#Displaying the table with the main metrics
print("\n📊 Model Performance Summary:")
df_results

The goal is to **choose 2 models**. Random Forest, Logistic regression and Decision tree had the best recall rate, but the Logistic regression model had a lower precision rate. Therefore, **Random Forest and Decision tree will be chosen for the next stage**.

##Hyperparameters Tuning

Hyperparameter tuning is **select the best combination of hyperparameters** (e.g., number of estimators, learning rate, or tree depth) that allow the model to generalize well to unseen data, **maximizing its predictive accuracy while minimizing overfitting**.


To this **RandomizedSearchCV will be used**. It randomly samples from the specified hyperparameter space for a given number of iterations and evaluates each combination using cross-validation. RandomizedSearch helps to identify the best performing hyperparameters for the model, balancing accuracy and computational efficiency.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Model dictionary (Random Forest and Decision Tree)
models = {
    'Random Forest': RandomForestClassifier(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42)
}

# Hyperparameter grids for RandomizedSearchCV
param_grid_rf = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

param_grid_dt = {
    'classifier__max_depth': [5, 10, 20],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Define cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Initialize an empty list to store metrics
metrics = []

# Loop through models and perform training/evaluation with and without hyperparameter tuning
for name, model in models.items():
    # Create pipeline without hyperparameter tuning
    pipeline = ImbPipeline([
        ('preprocessing', preprocessor),
        ('smote', SMOTE(sampling_strategy='auto', random_state=42)),  # Oversampling within CV
        ('classifier', model)
    ])

    # Evaluate without hyperparameter tuning
    y_pred = cross_val_predict(pipeline, X_train, y_train, cv=cv)
    recall = recall_score(y_train, y_pred)
    precision = precision_score(y_train, y_pred)
    accuracy = accuracy_score(y_train, y_pred)

    metrics.append([name + ' (No Hyperparameter Tuning)', recall, precision, accuracy])

    # Hyperparameter tuning with RandomizedSearchCV for Random Forest and Decision Tree
    if name == 'Random Forest':
        random_search_rf = RandomizedSearchCV(pipeline, param_distributions=param_grid_rf, n_iter=10, cv=cv, random_state=42, n_jobs=-1)
        random_search_rf.fit(X_train, y_train)
        best_model = random_search_rf.best_estimator_
    elif name == 'Decision Tree':
        random_search_dt = RandomizedSearchCV(pipeline, param_distributions=param_grid_dt, n_iter=10, cv=cv, random_state=42, n_jobs=-1)
        random_search_dt.fit(X_train, y_train)
        best_model = random_search_dt.best_estimator_

    # Evaluate with hyperparameter tuning
    y_pred_tuned = cross_val_predict(best_model, X_train, y_train, cv=cv)
    recall_tuned = recall_score(y_train, y_pred_tuned)
    precision_tuned = precision_score(y_train, y_pred_tuned)
    accuracy_tuned = accuracy_score(y_train, y_pred_tuned)

    metrics.append([name + ' (With Hyperparameter Tuning)', recall_tuned, precision_tuned, accuracy_tuned])

# Create a DataFrame with the results
metrics_df = pd.DataFrame(metrics, columns=['Model', 'Recall', 'Precision', 'Accuracy'])

# Display the results as a table
print("\n📊 Performance summary of models with hyperparameterization:")
metrics_df


##Ensemble

In this section, **the two models will be combined to improve the performance** of the final model.

The **voting classifier** is an effective way to combine models, where each model contributes its prediction and the final decision is based on the majority or average of the predictions.

In [None]:
from sklearn.ensemble import VotingClassifier

#Create a list of the tuned models
models_tuned = {
    'Random Forest (With Hyperparameter Tuning)': random_search_rf.best_estimator_,
    'Decision Tree (With Hyperparameter Tuning)': random_search_dt.best_estimator_
}

#Create an ensemble model using VotingClassifier (soft voting)
ensemble_model = VotingClassifier(estimators=[
    ('random_forest', models_tuned['Random Forest (With Hyperparameter Tuning)']),
    ('decision_tree', models_tuned['Decision Tree (With Hyperparameter Tuning)'])
], voting='soft')

#Evaluating the ensemble model
ensemble_model.fit(X_train, y_train)
y_pred_ensemble = cross_val_predict(ensemble_model, X_train, y_train, cv=cv)
recall_ensemble = recall_score(y_train, y_pred_ensemble)
precision_ensemble = precision_score(y_train, y_pred_ensemble)
accuracy_ensemble = accuracy_score(y_train, y_pred_ensemble)

# Add the ensemble results to the metrics list
metrics.append(['Ensemble (Random Forest + Decision Tree)', recall_ensemble, precision_ensemble, accuracy_ensemble])

# Convert the metrics list to a DataFrame
metrics_df = pd.DataFrame(metrics, columns=['Model', 'Recall', 'Precision', 'Accuracy'])

# Display the results as a table
print("\n📊 Performance summary of models with hyperparameterization and with Ensemble:")
metrics_df



The **recall rate of the ensemble model was very close** to that of the Decision tree model with hyperparameters, which had the highest recall rate. In addition, the **ensemble model had higher precision and accuracy rates** than the Decision tree model with hyperparameters.

#Evaluation

Now we will perform the **final evaluation of the ensemble model** using the **test data,** which were separated before modeling.

In [None]:
#Evaluating the ensemble model on the test data
y_pred_test = ensemble_model.predict(X_test)

#Calculating metrics
recall_test = recall_score(y_test, y_pred_test)
precision_test = precision_score(y_test, y_pred_test)
accuracy_test = accuracy_score(y_test, y_pred_test)

#Generating a confusion matrix
cm = confusion_matrix(y_test, y_pred_test)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix - Ensemble Model')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Adding the new row with the ensemble model metrics with the teste data
metrics.append(['Ensemble (TEST DATA)', recall_test, precision_test, accuracy_test])
# Convert the metrics list to a DataFrame
metrics_df = pd.DataFrame(metrics, columns=['Model', 'Recall', 'Precision', 'Accuracy'])

# Display the results as a table
print("\n📊 Performance summary of models with hyperparameterization and with Ensemble:")
metrics_df = metrics_df.drop([0, 2]).reset_index(drop=True)
metrics_df


Model evaluation shows **promising results**. After fine-tuning the hyperparameters of the Random Forest and Decision Tree models, their performance was further improved when combined into an ensemble. The final ensemble model demonstrated **strong recall (0.778)**, precision (0.717), and accuracy (0.922) on the test data, indicating its robustness in predicting churn. This performance suggests that the ensemble approach, leveraging the strengths of both models, **is effective for the task at hand and can be trusted for future predictions**.

With these results, **the model is now ready for deployment.** By integrating it into an application, users will be able to input customer data and receive a churn probability score in real-time. This will enable businesses to proactively identify at-risk customers and implement targeted retention strategies. **The model's strong performance ensures that it can be a valuable tool for decision-making,** helping to minimize churn and improve customer satisfaction.