#Financial Data Modeling: Credit Balance Prediction
By Benjamin Cabrera & Alexander Ohye

____________________________________________________________________________________________________________________________________________________________________

# Introduction

This project explores how demographic and financial data can be used to **predict credit card balances** using machine learning regression techniques.  

The dataset contains variables such as income, credit limit, credit rating, education, and marital/student status.  

The goal is to apply the regression approaches Linear, Ridge, and Lasso; to model the relationship between predictors and credit card balances, evaluate their performance, and identify the most influential factors.  

A data frame with 400 observations on the following variables:

- ID: Identification

- Income: Income in $1,000

- Limit: Credit limit

- Rating: Credit rating

- Cards: Number of credit cards

- Age: Age in years

- Education: Number of years of education

- Gender: A factor with levels Male and Female

- Student: A factor with levels No and Yes indicating whether the individual was a student

- Married: A factor with levels No and Yes indicating whether the individual was married

- Balance: Average credit card balance in $(dollars).

**The outcome variable to predict is *Balance*.**


In [None]:
#Importing datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline

In [None]:
# Connect to google drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Import data set
# Change the file path as needed!!!
credit_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Credit_ISLR.csv')

In [None]:
# Verify all of the data set has been imported
credit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  400 non-null    int64  
 1   ID          400 non-null    int64  
 2   Income      400 non-null    float64
 3   Limit       400 non-null    int64  
 4   Rating      400 non-null    int64  
 5   Cards       400 non-null    int64  
 6   Age         400 non-null    int64  
 7   Education   400 non-null    int64  
 8   Gender      400 non-null    object 
 9   Student     400 non-null    object 
 10  Married     400 non-null    object 
 11  Balance     400 non-null    int64  
dtypes: float64(1), int64(8), object(3)
memory usage: 37.6+ KB


**Run the following lines of code before you start answering the questions**

In [None]:
# Remove unnecessary columns
credit_df.drop(['Unnamed: 0', 'ID'], axis= 1, inplace= True)

In [None]:
# Convert categorical columns into dummies
credit_df_dummies=pd.get_dummies(credit_df,columns=['Gender','Student','Married'], drop_first = True)

In [None]:
# Verify changes have been made
credit_df_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Income         400 non-null    float64
 1   Limit          400 non-null    int64  
 2   Rating         400 non-null    int64  
 3   Cards          400 non-null    int64  
 4   Age            400 non-null    int64  
 5   Education      400 non-null    int64  
 6   Balance        400 non-null    int64  
 7   Gender_Female  400 non-null    bool   
 8   Student_Yes    400 non-null    bool   
 9   Married_Yes    400 non-null    bool   
dtypes: bool(3), float64(1), int64(6)
memory usage: 23.2 KB


____________________________________________________________________________________________________________________________________________________________________

#Applying the ridge method using the ML approach.

**a)** Using the optimal alpha value suggested by ridge regression to obtain a multiple linear regression equation fitted to the training data.

In [None]:
# Seperate the data into predictors and outcome variables

X_all_credit_df_dummies = credit_df_dummies.drop('Balance', axis=1)

y_credit_df_dummies = credit_df_dummies['Balance']

In [None]:
# Split to train and test data
X_train_credit, X_test_credit, y_train_credit, y_test_credit = train_test_split(X_all_credit_df_dummies, y_credit_df_dummies, test_size=0.2, random_state=1)

In [None]:
# Scale the X training data
scaler = StandardScaler()

X_train_credit_scaled = scaler.fit_transform(X_train_credit)

In [None]:
# set up the alphas
alpha_penalty = [0.001, 0.01, 0.05, 0.1, 1, 10, 50, 100]

In [None]:
# Control the CV
cv_set_up = KFold (n_splits=10, shuffle=True, random_state=1)

In [None]:
# Set up the grid search
ridge_grid_search = GridSearchCV(estimator= Ridge(), param_grid={'alpha': alpha_penalty}, scoring='neg_mean_squared_error',  cv=cv_set_up)

In [None]:
# Apply the grid serch to the training data
ridge_grid_search.fit(X_train_credit_scaled, y_train_credit)

In [None]:
# Alpha of 1 means that the ridge regression suggests a moderate level of regularization
# Just to verify
best_alpha_ridge = ridge_grid_search.best_params_['alpha']

print("Best alpha:", best_alpha_ridge)

Best alpha: 1


In [None]:
# Get the CV MSE for the best alpha
lowest_mean_cv_mse = - ridge_grid_search.best_score_

print("CV MSE with the best alpha:", lowest_mean_cv_mse)

CV MSE with the best alpha: 9494.626511125423


In [None]:
# Obtain regression coefficents with the best solution
ridge_grid_search.best_estimator_. coef_

array([-2.58278797e+02,  3.10886468e+02,  2.78171979e+02,  1.86440459e+01,
       -1.37334741e+01, -1.49221585e+00,  1.35054826e-01,  1.20482937e+02,
        6.99074633e-01])

In [None]:
# Get the intercept for the best solution
ridge_grid_search.best_estimator_.intercept_

np.float64(508.39375000000007)

In [None]:
# Get the list of predictors being used in the best model
coef_values = np.concatenate(([ridge_grid_search.best_estimator_.intercept_], ridge_grid_search.best_estimator_. coef_))

column_names = np.concatenate((['Intercept'], X_all_credit_df_dummies.columns))

coefficients_ridge_df = pd.DataFrame({'Coefficient Name': column_names, 'Coefficient Value': coef_values, 'Rounded coeff values': np.round(coef_values,2)})

print(coefficients_ridge_df)

  Coefficient Name  Coefficient Value  Rounded coeff values
0        Intercept         508.393750                508.39
1           Income        -258.278797               -258.28
2            Limit         310.886468                310.89
3           Rating         278.171979                278.17
4            Cards          18.644046                 18.64
5              Age         -13.733474                -13.73
6        Education          -1.492216                 -1.49
7    Gender_Female           0.135055                  0.14
8      Student_Yes         120.482937                120.48
9      Married_Yes           0.699075                  0.70


____________________________________________________________________________________________________________________________________________________________________

##Equation

**Predicted Balance** = 508.39 - 258.28*(standardized income) + 310.89*(standardized limit) + 278.17*(standardized rating) + 18.64*(standardized cards) - 13.73*(standardized age) - 1.49*(standardized education) + 0.14*(standardized gender) + 120.48*(standardized student) + 0.70*(standardized married)


**b)** Using the equation from **a)** to obtain the **test RMSE**. Justifying if the value of the test RMSE is good (low) enough.

In [None]:
# Transform data to prevent data leakage
X_test_credit_scaled = scaler.transform(X_test_credit)

In [None]:
y_pred_ridge_test = ridge_grid_search.best_estimator_.predict(X_test_credit_scaled)

In [None]:
# Test MSE
mean_squared_error(y_test_credit, y_pred_ridge_test)

12813.118512252802

In [None]:
# Get RMSE
root_mean_squared_error(y_test_credit, y_pred_ridge_test)

113.19504632382463

In [None]:
# Coefficient of variation to assess how good the RMSE is
print (np.round(root_mean_squared_error(y_test_credit, y_pred_ridge_test)/ np.mean(y_credit_df_dummies)*100, 2))

21.77


#Conclusion
The RSME from the model is 21.77, meaning that the prediction error is ~22%. It is also above the goal 20 for a test RSME, therefore the model is not low enough and does not meet our performance goal.

_____________________________________________________________

**a)** Applying lasso regression with an alpha of 10 to fit a multiple linear regression model on the training data. Using this model, computed the **R squared statistic** on the **test data** and **reported its value**.

In [None]:
# The data was previously split into training and test and the training data was scaled in Question 1
# Create and fit the model with an alpha of 10
lasso_model_ten = Lasso(alpha=10)
lasso_model_ten.fit(X_train_credit_scaled, y_train_credit)

In [None]:
# Make prediction on the test set
y_pred_lasso_ten = lasso_model_ten.predict(X_test_credit_scaled)

In [None]:
# Calculate the R squared statistic with an alpha of 10
r2_lasso_ten = r2_score(y_test_credit, y_pred_lasso_ten)
print("R squared statistic with an alpha of 10:", r2_lasso_ten)

R squared statistic with an alpha of 10: 0.9390820853765356


____________________________________________________________________________________________________________________________________________________________________

**b)** Justifying if the lasso regression solution from **a)** excludes any predictors from the model and which ones.



In [None]:
# Get list of values predictor coefficients to see if any are set to 0
coef_values = np.concatenate(([lasso_model_ten.intercept_], lasso_model_ten.coef_))

column_names = np.concatenate((['Intercept'], X_all_credit_df_dummies.columns))

coefficients_lasso_df = pd.DataFrame({'Coefficient Name': column_names, 'Coefficient Value': coef_values})

print(coefficients_lasso_df)

  Coefficient Name  Coefficient Value
0        Intercept         508.393750
1           Income        -217.097905
2            Limit         221.081705
3           Rating         326.830267
4            Cards           6.877540
5              Age          -6.087546
6        Education          -0.000000
7    Gender_Female           0.000000
8      Student_Yes         108.210548
9      Married_Yes          -0.000000


##Justification
Yes, the lasso regression solution does exclude a total of 3 predictors. Those predictors are: Education, Gender, and Married. At an alpha of 10, lasso kept the variables that mattered the most and dropped the ones with little predictive power.

____________________________________________________________________________________________________________________________________________________________________

**c)** Applying lasso regression with an alpha of 1 to fit a multiple linear regression model on the training data. Using this model, computed the **R squared statistic** on the **test data**. Used this value to compare the equations from (a) and (c). Justified Which one is better.

In [None]:
# Create and the fit model with an alpha of 1
lasso_model_one = Lasso(alpha=1)
lasso_model_one.fit(X_train_credit_scaled, y_train_credit)

In [None]:
y_pred_lasso_one = lasso_model_one.predict(X_test_credit_scaled)

In [None]:
r2_lasso_one = r2_score(y_test_credit, y_pred_lasso_one)
print("R squared statistic with an alpha of 1:", r2_lasso_one)

R squared statistic with an alpha of 1: 0.9497866143308823


##Justification

The r2_score for an alpha of 10 is 0.939 and with an alpha of 1 it is 0.949.

The lasso regression model that is the best is the model with an alpha of 1. The model produced a higher r2_score and in comparison to the model with an alpha of 10, it is closer to a value of 1.

____________________________________________________________________________________________________________________________________________________________________

#Conclusion
The comparison between ridge and lasso regression shows that regularization improves predictive accuracy and reduces overfitting when modeling financial behavior. While ridge provided stable coefficients, its error remained above the acceptable performance goal. Lasso regression with α=1 achieved the best test R² (≈0.95), highlighting its ability to balance accuracy with feature selection. Importantly, lasso excluded weaker predictors (education, gender, marital status), emphasizing that balance is driven mainly by credit limit, credit rating, income, and student status. This project demonstrates how regularized regression can both improve predictions and uncover which variables truly matter in financial datasets.