# Predicting Customer Creditworthiness with Logistic Regression

## Project Overview

This project explores a classification model for predicting customer creditworthiness using a real-world-style dataset. The target variable indicates whether a customer's credit is deemed "good" or "bad," and I build a machine learning pipeline to automate feature preprocessing and classification using logistic regression.

Through this baseline model, I aim to:

- Prepare and clean the dataset using standard preprocessing tools.
- Create a modeling pipeline with imputation, encoding, and scaling steps.
- Evaluate the model’s performance using classification metrics such as accuracy, precision, recall, and F1-score.
- Lay the groundwork for further optimization and deep learning enhancements in follow-up analyses.


In [1]:
import pandas as pd
import numpy as np
from sklearn import set_config

set_config(transform_output="pandas")

### Feature Selection

To simplify the modeling process and focus on the most informative features, I retained a subset of variables based on their financial relevance, behavioral significance, and potential predictive power.

These selected features include:

- **Financial indicators**: such as `INCOME`, `SAVINGS`, and `DEBT`
- **Transaction-based ratios**: representing spending patterns in categories like `CLOTHING`, `EDUCATION`, `HEALTH`, `GROCERIES`, and `HOUSING`
- **Derived financial ratios**: including `R_SAVINGS_INCOME`, `R_DEBT_INCOME`, and `R_DEBT_SAVINGS`
- **Categorical variables**: such as `CAT_GAMBLING`, `CAT_DEBT`, `CAT_MORTGAGE`, and others indicating customer characteristics

I also retained both `DEFAULT` (the binary classification target) and `CREDIT_SCORE` (a continuous variable used later for regression analysis).

In [None]:
# Load the local dataset
df = pd.read_csv("data/credit_score.csv")

# Define selected features for modeling
selected_columns = [
    'INCOME', 'SAVINGS', 'DEBT',
    'T_CLOTHING_12', 'T_CLOTHING_6', 'R_CLOTHING', 'R_CLOTHING_INCOME', 'R_CLOTHING_SAVINGS', 'R_CLOTHING_DEBT',
    'T_EDUCATION_12', 'T_EDUCATION_6', 'R_EDUCATION', 'R_EDUCATION_INCOME', 'R_EDUCATION_SAVINGS', 'R_EDUCATION_DEBT',
    'T_GROCERIES_12', 'T_GROCERIES_6', 'R_GROCERIES', 'R_GROCERIES_INCOME', 'R_GROCERIES_SAVINGS', 'R_GROCERIES_DEBT',
    'T_HEALTH_12', 'T_HEALTH_6', 'R_HEALTH', 'R_HEALTH_INCOME', 'R_HEALTH_SAVINGS', 'R_HEALTH_DEBT',
    'T_HOUSING_12', 'T_HOUSING_6', 'R_HOUSING', 'R_HOUSING_INCOME', 'R_HOUSING_SAVINGS', 'R_HOUSING_DEBT',
    'CAT_GAMBLING', 'CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS',
    'CREDIT_SCORE', 'DEFAULT'
]

# Subset the DataFrame
df = df[selected_columns]

# Preview the cleaned dataset
df.head()


Unnamed: 0,INCOME,SAVINGS,DEBT,T_CLOTHING_12,T_CLOTHING_6,R_CLOTHING,R_CLOTHING_INCOME,R_CLOTHING_SAVINGS,R_CLOTHING_DEBT,T_EDUCATION_12,...,R_HOUSING_SAVINGS,R_HOUSING_DEBT,CAT_GAMBLING,CAT_DEBT,CAT_CREDIT_CARD,CAT_MORTGAGE,CAT_SAVINGS_ACCOUNT,CAT_DEPENDENTS,CREDIT_SCORE,DEFAULT
0,33269,0.0,532304,1889,945,0.5003,0.0568,0.0,0.0035,0,...,0.0,0.0056,High,1,0,0,0.0,0,444,1
1,77158,91187.0,315648,5818,111,0.0191,0.0754,0.0638,0.0184,0,...,0.1792,0.0518,No,1,0,0,1.0,0,625,0
2,30917,21642.0,534864,1157,860,0.7433,0.0374,0.0535,0.0022,0,...,0.1275,0.0052,High,1,0,0,1.0,0,469,1
3,80657,64526.0,629125,6857,3686,0.5376,0.085,0.1063,0.0109,4402,...,0.1956,0.0201,High,1,0,0,1.0,0,559,0
4,149971,1172498.0,2399531,1978,322,0.1628,0.0132,0.0017,0.0008,0,...,0.0364,0.0178,High,1,1,1,1.0,1,473,0


### Define Target Variable and Split the Data

I defined `DEFAULT` as the target variable for this binary classification task, and used the remaining columns as features. To prepare the data for modeling, I performed a train-test split, allocating 80% of the data for training and 20% for testing, while stratifying the target to preserve class distribution.

Although the dataset includes a `CREDIT_SCORE` column, I excluded it from the feature set (`X`) for this classification task because:

- It is a **proxy for creditworthiness**, and including it would introduce **target leakage** — artificially improving model performance by providing information that is too closely related to the outcome.
- I will use `CREDIT_SCORE` later in the notebook as the **target variable** for a separate regression task (see Section 6).

This ensures a clean, independent feature matrix (`X`) and avoids contaminating the classification model.

In [3]:
from sklearn.model_selection import train_test_split

# Define features and target
X = df.drop(columns=["DEFAULT", "CREDIT_SCORE"])
y = df["DEFAULT"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

Training set: (800, 39)
Test set: (200, 39)


### Encoding the `CAT_GAMBLING` Feature

As part of feature preprocessing, I encoded the `CAT_GAMBLING` variable using one-hot encoding. This categorical feature captures the customer's gambling behavior and may provide useful signals for credit risk modeling.

To avoid multicollinearity issues, I dropped the first category using the `drop="first"` parameter. The rest of the dataset was left unchanged by setting `remainder="passthrough"` in the transformer. This approach ensures that all other features remain intact and available for modeling.

In [4]:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Define the transformer to one-hot encode 'CAT_GAMBLING'.
# The parameter drop='first' avoids multicollinearity by removing the first category.
transformer = ColumnTransformer(
    [('ohe', OneHotEncoder(sparse_output=False, drop='first'), ['CAT_GAMBLING'])],
    remainder='passthrough'
)

### Initial Modeling Pipeline with Simplified Preprocessing

As a starting point, I created a basic machine learning pipeline to demonstrate the use of sequential preprocessing steps with `Pipeline`.

This pipeline includes:
- **SimpleImputer** with the `most_frequent` strategy to handle missing values
- A previously defined `ColumnTransformer` to one-hot encode `CAT_GAMBLING` while leaving other features unchanged
- **StandardScaler** to normalize the features
- A **Logistic Regression** model as the final estimator

Although simplified, this structure lays the groundwork for a more robust pipeline in the next step.

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Define a simplified pipeline using previous transformer (only CAT_GAMBLING)
pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("transformer", transformer),
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

# Fit the pipeline to the training set
pipe.fit(X_train, y_train)

# Quick check on test performance
pipe.score(X_test, y_test)

0.725

### Model Evaluation Metrics

After training the model, I evaluated its performance on the test set using common classification metrics:

- **Accuracy**: Overall correctness of predictions
- **Precision**: Proportion of predicted positives that were correct
- **Recall**: Proportion of actual positives that were correctly identified
- **F1 Score**: Harmonic mean of precision and recall

These metrics provide a more complete view of model performance, especially in the presence of class imbalance.

In [6]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

# Predict on test set
y_pred = pipe.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Display results
print(f"Accuracy:  {accuracy:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score:  {f1:.4f}")

Accuracy:  0.7250
Recall:    0.1930
Precision: 0.5500
F1 Score:  0.2857


### Feature Importance from Logistic Regression Coefficients

Logistic regression is a linear model, which means each input feature is associated with a coefficient that indicates its impact on the prediction. A **positive coefficient** increases the likelihood of default, while a **negative coefficient** reduces it.

To visualize feature importance, I extracted the learned weights and plotted them. This helps interpret which features have the strongest influence on the model’s decisions.

In [7]:
import plotly.express as px
import pandas as pd

# Extract model coefficients and feature names
coefficients = pd.Series(
    pipe[-1].coef_.ravel(),
    index=pipe[-1].feature_names_in_
).sort_values()

# Plot feature importance
fig = px.bar(
    x=coefficients.index,
    y=coefficients.values,
    labels={"x": "Feature", "y": "Coefficient"},
    title="Feature Importance from Logistic Regression"
)
fig.show()

### Regression Task: Predicting Customer Credit Score

To complement the classification analysis, I trained a regression model to predict the customer's `CREDIT_SCORE` — a numerical indicator of creditworthiness. I reused the same set of features used previously, but adjusted the preprocessing and evaluation steps to suit a regression task.

For this model, I used **Lasso regression**, which adds L1 regularization and is useful for automatically selecting the most relevant features by shrinking the coefficients of less informative ones to zero.

The pipeline includes:
- **Imputation**: Mean strategy for numerical features, most frequent for categorical
- **Encoding**: One-hot encoding for categorical variables
- **Scaling**: Standardization prior to model training
- **Model**: Lasso regression with default regularization strength

Model performance was evaluated using:
- **Mean Absolute Error (MAE)**
- **Mean Squared Error (MSE)**
- **R² Score**

I also plotted feature importances based on the regression coefficients.

In [9]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Define the new regression target
y_reg = df["CREDIT_SCORE"]

# Train-test split for regression
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y_reg, test_size=0.25, random_state=20
)

# Identify column types
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()
numerical_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Pipelines for each type
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(drop="first", handle_unknown="ignore", sparse_output=False))
])

num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="mean"))
])

# Combine with ColumnTransformer
transformer = ColumnTransformer([
    ("cat", cat_pipe, categorical_features),
    ("num", num_pipe, numerical_features)
], remainder="passthrough")

# Final pipeline with Lasso regression
reg_pipeline = Pipeline([
    ("preprocessing", transformer),
    ("scaler", StandardScaler()),
    ("lasso", Lasso(random_state=20))
])

# Fit model
reg_pipeline.fit(X_train_reg, y_train_reg)

# Predictions
y_pred_reg = reg_pipeline.predict(X_test_reg)

# Evaluate performance
mse = mean_squared_error(y_test_reg, y_pred_reg)
mae = mean_absolute_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"R² Score: {r2:.4f}")

MAE: 30.50
MSE: 1928.38
R² Score: 0.5354


### Feature Importance in Lasso Regression

After training the regression model, I extracted and visualized the feature coefficients from the Lasso regression. These coefficients reflect the contribution of each input feature to the predicted credit score.

Due to Lasso's regularization, many coefficients are reduced to zero, effectively eliminating less relevant features and helping improve interpretability. Features with larger absolute values (positive or negative) are more influential in predicting the credit score.


In [11]:
# Visualize Lasso feature importances
import plotly.express as px

lasso_coef = pd.Series(
    reg_pipeline[-1].coef_.ravel(),
    index=reg_pipeline[-1].feature_names_in_
).sort_values()

fig = px.bar(
    x=lasso_coef.index,
    y=lasso_coef.values,
    labels={"x": "Feature", "y": "Coefficient"},
    title="Feature Importance – Lasso Regression (Regression)"
)
fig.show()

### Model Comparison and Interpretability

To compare the classification and regression models, I examined the learned coefficients from both `LogisticRegression` and `Lasso`. These weights reflect the relative importance of each feature in determining either the likelihood of default (classification) or the numerical credit score (regression).

#### Key Observations:

- **Shared influential features**: Certain variables (e.g., `R_HEALTH`, `R_GAMBLING`, `DEBT`) showed significant weight in both models, suggesting they are strong indicators of a customer's credit behavior in general.
- **Classification vs. regression signals**: The classification model tends to emphasize **binary risk factors** that signal default (like `CAT_DEBT` or `CAT_MORTGAGE`), whereas the regression model assigns weight to **continuous financial indicators** like `INCOME`, `SAVINGS`, or spending ratios.
- **Regularization effect in Lasso**: Several features had zero or near-zero coefficients in the regression model, indicating that Lasso effectively eliminated weak predictors — improving model simplicity and interpretability.
- **Direction of impact**: Positive weights in the classification model increase the likelihood of default, while negative coefficients in the regression model reflect factors that are associated with higher credit scores.

These differences highlight how each model provides unique insights: classification models help flag risk, while regression models help quantify it.

In [10]:
import plotly.express as px

coef = pd.Series(pipe[-1].coef_.ravel(), index=pipe[-1].feature_names_in_).sort_values()
fig = px.bar(x=coef.index, y=coef.values, labels={'x': 'Feature', 'y': 'Coefficient'}, title='Feature importance')
fig.show()

### Model Comparison and Interpretability

To compare the classification and regression models, I examined the learned coefficients from both `LogisticRegression` and `Lasso`. These coefficients reflect the relative importance of each feature in determining either the likelihood of default (classification) or the credit score (regression).

---

### Key Observations

- **Shared influential features**: Variables such as `R_HEALTH`, `R_GAMBLING`, and `DEBT` had significant weights in both models, suggesting they are strong indicators of credit behavior overall.
  
- **Classification vs. regression signals**: The classification model placed more weight on **binary risk signals** (e.g., `CAT_DEBT`, `CAT_MORTGAGE`), whereas the regression model focused on **continuous financial indicators** like `INCOME`, `SAVINGS`, and derived ratios.

- **Regularization effect in Lasso**: Many variables received zero or near-zero weights in the regression model, demonstrating Lasso’s effectiveness in eliminating weak predictors and enhancing model interpretability.

- **Direction of impact**:
  - Positive weights in the classification model increase the likelihood of default.
  - Positive coefficients in the regression model indicate association with **higher** credit scores, while negative ones signal a **lower** score.

---

### Performance Summary

- **Classification model (`LogisticRegression`)**  
  - **Accuracy**: 0.725  
  - **Recall**: 0.193  
  - **Precision**: 0.550  
  - **F1 Score**: 0.285  
  
  This model successfully identifies most non-default cases but struggles with recall, meaning it misses a large number of actual defaulters. However, the features it prioritizes can be useful for **flagging potential risks**.

- **Regression model (`Lasso`)**  
  - **MAE**: 30.50  
  - **MSE**: 1928.38  
  - **R² Score**: 0.535  
  
  This model can explain around 53% of the variance in credit scores, providing a **quantitative** risk profile rather than a binary one.

---

### Final Thoughts

While both models share preprocessing steps like imputation, scaling, and encoding, they serve **distinct purposes**:

- The **classification model** (`LogisticRegression`) is optimized for flagging whether an individual will default — a binary outcome.
- The **regression model** (`Lasso`) aims to **quantify** creditworthiness via credit score prediction and reduces overfitting through feature regularization.

Despite their differences, both offer **complementary insights**: classification models help identify risk categories, while regression models add precision by estimating creditworthiness on a continuous scale.

These results serve as a baseline and highlight key areas to explore further, such as tree-based models or more advanced feature selection techniques.