# Introduction
Credit risk assessment is all about predicting whether someone will default on a loan, which helps banks and lenders make smarter decisions. In practice, the data used for these predictions is often messy—missing values are common, and ignoring them can lead to unreliable results. Imputation techniques fill in these gaps, making sure models have the complete information they need.

This project explores how different ways of handling missing data—using the median, linear regression, and non-linear regression—affect the accuracy of credit default predictions. By testing these methods on the UCI Credit Card Default Clients dataset, the goal is to see which approach gives the best results and why thoughtful data preparation matters in real-world credit risk modeling.



To kick off the analysis, the credit card default dataset is loaded and a small portion of values in key numerical columns are randomly set to missing. This simulates the kind of incomplete data often found in real-world credit risk problems and sets up a realistic challenge for testing different imputation methods. The target for prediction is whether a client defaulted on their payment next month.



In [22]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

In [4]:
# Load the dataset
df = pd.read_csv('UCI_credit_card.csv')

# Define numerical columns for missing value injection
columns_to_inject = ['AGE', 'BILL_AMT1', 'BILL_AMT2'] 

# Set random seed for reproducibility
np.random.seed(42)

# Introduce 5% missing (MAR) values per selected column
missing_frac = 0.05
for col in columns_to_inject:
    n_missing = int(missing_frac * len(df))
    missing_indices = np.random.choice(df.index, n_missing, replace=False)
    df.loc[missing_indices, col] = np.nan

# Quick check: Display percent missing per column
print(df[columns_to_inject].isnull().mean())  

# Preview dataset
print(df.head())


AGE          0.05
BILL_AMT1    0.05
BILL_AMT2    0.05
dtype: float64
   ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE   AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0   1    20000.0    2          2         1  24.0      2      2     -1     -1   
1   2   120000.0    2          2         2  26.0     -1      2      0      0   
2   3    90000.0    2          2         2  34.0      0      0      0      0   
3   4    50000.0    2          2         1  37.0      0      0      0      0   
4   5    50000.0    1          2         1  57.0     -1      0     -1      0   

   ...  BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  PAY_AMT2  PAY_AMT3  \
0  ...        0.0        0.0        0.0       0.0     689.0       0.0   
1  ...     3272.0     3455.0     3261.0       0.0    1000.0    1000.0   
2  ...    14331.0    14948.0    15549.0    1518.0    1500.0    1000.0   
3  ...    28314.0    28959.0    29547.0    2000.0    2019.0    1200.0   
4  ...    20940.0    19146.0    19131.0    2000.0   36681.0   10000.0   

   PAY_AMT4

As a starting point, let's fill the missing values with the median of each column.
# Why median is preferred over mean for imputation?

The median is often preferred over the mean for imputation because it is less sensitive to outliers and better represents the center of skewed data distributions. This helps maintain the integrity of the data, especially when some values are unusually high or low

In [11]:
# Create a copy for baseline imputation
dataset_A = df.copy()

# Fill missing values in each column with the median
for col in dataset_A.columns:
    if dataset_A[col].isnull().any():
        median_value = dataset_A[col].median()
        dataset_A[col].fillna(median_value, inplace=True)

# Quick check: confirm no missing values remain
print(dataset_A.isnull().sum())


ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset_A[col].fillna(median_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset_A[col].fillna(median_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting 

Next, missing values in one column are predicted using a linear regression model built from the other features. 

This method assumes the missing values are "Missing At Random"—that is, the fact that a value is missing depends only on the observed data, not the missing value itself. This way, we can reasonably estimate what the missing value might have been using the other available features

In [12]:
# Create Dataset B as a copy
dataset_B = df.copy()

# Choose the column to impute
target_col = 'AGE'

# Use only rows where predictors and target are not missing for training
predictors = [col for col in dataset_B.columns if col not in [target_col, 'default payment next month']]
train_mask = dataset_B[target_col].notnull() & dataset_B[predictors].notnull().all(axis=1)
train_X = dataset_B.loc[train_mask, predictors]
train_y = dataset_B.loc[train_mask, target_col]

# Fit regression model
linreg = LinearRegression()
linreg.fit(train_X, train_y)

# For prediction, use rows where target is missing but predictors are present
pred_mask = dataset_B[target_col].isnull() & dataset_B[predictors].notnull().all(axis=1)
pred_X = dataset_B.loc[pred_mask, predictors]

# Predict and fill missing values
dataset_B.loc[pred_mask, target_col] = linreg.predict(pred_X)

# If any missing values remain (because predictors were missing), you can fill those with the median as a fallback
dataset_B[target_col].fillna(dataset_B[target_col].median(), inplace=True)

# Confirm no missing in the imputed column
print(dataset_B[target_col].isnull().sum())


0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset_B[target_col].fillna(dataset_B[target_col].median(), inplace=True)


Now, let's go to a non-linear regression based imputation

In [14]:

# Copy original data for non-linear imputation
dataset_C = df.copy()

target_col = 'AGE'
predictors = [col for col in dataset_C.columns if col not in [target_col, 'default payment next month']]

# Only use rows where predictors and target are not missing for training
train_mask = dataset_C[target_col].notnull() & dataset_C[predictors].notnull().all(axis=1)
train_X = dataset_C.loc[train_mask, predictors]
train_y = dataset_C.loc[train_mask, target_col]

# Initialize and fit non-linear regression model
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(train_X, train_y)


# Rows where the target is missing and predictors are present
pred_mask = dataset_C[target_col].isnull() & dataset_C[predictors].notnull().all(axis=1)
pred_X = dataset_C.loc[pred_mask, predictors]

# Predict and fill missing values
dataset_C.loc[pred_mask, target_col] = knn.predict(pred_X)
# dataset_C.loc[pred_mask, target_col] = dtree.predict(pred_X)  # Alternative

# Fill any remaining missing values (where predictors were also missing) with the median as fallback
dataset_C[target_col].fillna(dataset_C[target_col].median(), inplace=True)

# Check missing values are handled
print(dataset_C[target_col].isnull().sum())


0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset_C[target_col].fillna(dataset_C[target_col].median(), inplace=True)



With the missing values handled in different ways, it’s time to split each dataset into training and testing sets. This step sets the stage for fair model evaluation, and also introduces a fourth approach—listwise deletion—where only fully complete rows are kept for analysis.


In [18]:
# Ensure column names are stripped of extra spaces
df.columns = df.columns.str.strip()
dataset_A.columns = dataset_A.columns.str.strip()
dataset_B.columns = dataset_B.columns.str.strip()
dataset_C.columns = dataset_C.columns.str.strip()

target = 'default.payment.next.month'

# Split imputed datasets (A, B, C)
splits = {}
for name, data in zip(['A', 'B', 'C'], [dataset_A, dataset_B, dataset_C]):
    X = data.drop(columns=[target])
    y = data[target]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    splits[name] = (X_train, X_test, y_train, y_test)

# Dataset D: listwise deletion (drop all rows with missing values)
dataset_D = df.dropna()
X_D = dataset_D.drop(columns=[target])
y_D = dataset_D[target]
X_D_train, X_D_test, y_D_train, y_D_test = train_test_split(
    X_D, y_D, test_size=0.2, random_state=42, stratify=y_D
)
splits['D'] = (X_D_train, X_D_test, y_D_train, y_D_test)

# Now splits['A'], splits['B'], splits['C'], splits['D'] contain train/test sets for all four datasets


Before training the classifier, each dataset’s features are standardized so they share the same scale. This step ensures that all variables contribute equally to the model and helps improve the reliability of the results

In [20]:
# Initialize scaler
scaler = StandardScaler()

# Standardize features for each dataset split
for name in ['A', 'B', 'C', 'D']:
    X_train, X_test, y_train, y_test = splits[name]
    
    # Fit scaler on training features and transform both train and test
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Replace original splits with scaled versions
    splits[name] = (X_train_scaled, X_test_scaled, y_train, y_test)

# Now splits contain standardized feature arrays ready for model training


With the data ready, it’s time to put each imputation strategy to the test. By training a logistic regression classifier and comparing accuracy, precision, recall, and F1-score across all four datasets, we’ll see how each approach impacts real-world prediction performance

In [24]:
# Define parameter grid for tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l2'],
    'solver': ['lbfgs', 'saga'],
    'max_iter': [1000]
}

for name in ['A', 'B', 'C', 'D']:
    X_train, X_test, y_train, y_test = splits[name]
    
    # Remove any rows with NaNs in features for both train and test sets
    train_mask = ~np.isnan(X_train).any(axis=1)
    test_mask = ~np.isnan(X_test).any(axis=1)
    X_train_clean = X_train[train_mask]
    y_train_clean = y_train.iloc[train_mask]
    X_test_clean = X_test[test_mask]
    y_test_clean = y_test.iloc[test_mask]
    
    grid = GridSearchCV(
        LogisticRegression(random_state=42),
        param_grid,
        cv=5,
        scoring='f1',
        n_jobs=-1
    )
    grid.fit(X_train_clean, y_train_clean)
    best_model = grid.best_estimator_
    y_pred = best_model.predict(X_test_clean)
    print(f"\nBest parameters for Dataset {name}: {grid.best_params_}")
    print(f"Classification Report for Dataset {name}:\n")
    print(classification_report(y_test_clean, y_pred, digits=4))



Best parameters for Dataset A: {'C': 1, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'saga'}
Classification Report for Dataset A:

              precision    recall  f1-score   support

           0     0.8175    0.9692    0.8869      4673
           1     0.6870    0.2381    0.3537      1327

    accuracy                         0.8075      6000
   macro avg     0.7522    0.6037    0.6203      6000
weighted avg     0.7886    0.8075    0.7690      6000


Best parameters for Dataset B: {'C': 100, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'saga'}
Classification Report for Dataset B:

              precision    recall  f1-score   support

           0     0.8184    0.9695    0.8875      4192
           1     0.6960    0.2452    0.3626      1195

    accuracy                         0.8088      5387
   macro avg     0.7572    0.6073    0.6251      5387
weighted avg     0.7912    0.8088    0.7711      5387


Best parameters for Dataset C: {'C': 10, 'max_iter': 1000, 'penalty': 'l2', 'so

# Interpretation
The results show that all four imputation strategies yield similar overall accuracy (around 81%), but there are important differences in how well each model identifies defaulters (class 1) versus non-defaulters (class 0).

### Key Observations

- **Accuracy** is high for all models (about 81%), but this is mostly driven by the large number of correctly predicted non-defaulters (class 0), which dominate the dataset.
- **Precision for class 1 (defaulters)** is moderate (about 0.69–0.73), meaning when the model predicts a default, it is correct about 69–74% of the time.
- **Recall for class 1** is low (about 0.24), indicating the model misses most actual defaulters. This is typical in imbalanced datasets, where the minority class is harder to detect.
- **F1-score for class 1** is also low (about 0.35–0.37), reflecting the trade-off between precision and recall for defaulters.
- **Macro and weighted averages** are higher for precision than recall, again showing the model is better at avoiding false positives than at catching all true positives.

### Imputation Strategy Comparison

- **Median Imputation (A)**, **Linear Regression (B)**, and **Non-Linear Regression (C)** all perform similarly, with only slight differences in precision and recall for defaulters.
- **Listwise Deletion (D)** has the highest precision for defaulters (0.73) and slightly better overall accuracy (0.814), but recall remains low, meaning many defaulters are still missed.
- The best hyperparameters vary slightly, but all models favor strong regularization (C=1, 10, or 100) and the 'saga' solver.


- All models are good at identifying non-defaulters but struggle to catch actual defaulters, which is a common challenge in credit risk modeling with imbalanced data.
- Imputation method does not dramatically change the outcome, but listwise deletion slightly improves precision for defaulters, possibly by removing ambiguous cases.
- For practical use, improving recall for defaulters (class 1) would be important—potentially by using resampling techniques, adjusting class weights, or exploring alternative models.

### Comment

These results highlight the importance of looking beyond accuracy in imbalanced classification problems. Precision, recall, and F1-score for the minority class (defaulters) provide a clearer picture of model effectiveness, and suggest that further work is needed to improve detection of risky clients.


# Comparitive analysis

| Model      | Accuracy | Precision (1) | Recall (1) | F1-score (1) | Macro F1 | Weighted F1 |
|------------|----------|---------------|------------|-------------|----------|-------------|
| Median     | 0.8075   | 0.6870        | 0.2381     | 0.3537      | 0.6203   | 0.7690      |
| Linear     | 0.8088   | 0.6960        | 0.2452     | 0.3626      | 0.6251   | 0.7711      |
| Non-Linear | 0.8086   | 0.6952        | 0.2444     | 0.3616      | 0.6245   | 0.7708      |
| Listwise   | 0.8140   | 0.7337        | 0.2474     | 0.3700      | 0.6304   | 0.7759      |

- **Listwise Deletion (D)** has the highest precision and F1-score for defaulters, but all models struggle with recall for class 1.
- **Median, Linear, and Non-Linear Imputation** perform similarly, with only minor differences in metrics.
- The overall accuracy is high, but the F1-score for defaulters is low, highlighting the challenge of imbalanced data in credit risk modeling.



Listwise Deletion (Model D) and Imputation (Models A, B, C) represent two fundamentally different approaches to handling missing data, each with its own trade-offs.

### Listwise Deletion (Model D)
- **Pros:**  
  - Simple and universally applicable; ensures all data used for modeling is complete.
  - If data is Missing Completely at Random (MCAR), estimates are unbiased.
- **Cons:**  
  - Discards any row with missing values, which can lead to a substantial reduction in sample size and loss of information.
  - If missingness is not MCAR (e.g., MAR or MNAR), this can introduce bias and reduce the diversity of the data, making the model less generalizable.
  - Larger standard errors and less statistical power due to smaller sample size.

### Imputation (Models A, B, C)
- **Pros:**  
  - Preserves sample size by filling in missing values, allowing the model to leverage all available data.
  - Can reduce bias and improve model robustness, especially when missingness is MAR and imputation is done appropriately.
- **Cons:**  
  - Imputed values are estimates, not true observations, which can introduce their own bias if the imputation model is misspecified.
  - Simple imputation methods (mean/median) may not capture the true variability or relationships in the data.
  - More complex imputation (regression, non-linear) can improve accuracy but may still struggle if missingness is MNAR.

### Why Might Model D Perform Poorly?
- **Loss of Data:** By removing all incomplete cases, Model D may lose valuable information, especially if missingness is related to important features or outcomes.
- **Bias:** If the missing data is not MCAR, listwise deletion can bias the model by excluding non-random subsets of the data.
- **Reduced Power:** Smaller sample size means less power to detect true effects, leading to less reliable predictions.
- **Imputation Models May Perform Worse:** If imputation is poorly specified or the missingness mechanism is complex, imputed models can introduce their own errors. However, they generally retain more information and diversity, which can help the model generalize better, especially in real-world scenarios where missingness is rarely MCAR.

### Summary
- **Listwise Deletion** is simple but risky unless missingness is truly random; it can lead to bias and reduced model performance due to loss of data.
- **Imputation** methods, while imperfect, usually offer better use of available data and can improve model robustness, especially when missingness is related to observed variables (MAR).[3][6]
- The best approach depends on the nature of the missing data and the modeling context, but in most practical cases, imputation is preferred over listwise deletion for maintaining data integrity and predictive power.


# Which regression method performed better?
Linear and non-linear regression imputation performed almost identically in your results. This suggests the relationship between the imputed feature and its predictors is mostly linear, so the extra flexibility of non-linear methods did not provide an advantage. If the relationship were more complex, non-linear methods could outperform linear imputation. In your case, linear regression was sufficient because the predictors and the imputed feature were linearly related.

# Conclusion and Recommendation

The best strategy for handling missing data in this scenario is **imputation using regression-based methods (linear or non-linear)** rather than listwise deletion.

### Justification

- **Classification Performance:**  
  All imputation models (A, B, C) achieved similar accuracy and F1-scores, but listwise deletion (Model D) only slightly outperformed them in precision and F1-score for defaulters. However, this came at the cost of discarding a significant portion of the data, which can reduce model generalizability and statistical power.

- **Conceptual Implications:**  
  Imputation preserves the full dataset, allowing the model to learn from more examples and maintain diversity. Listwise deletion risks bias and loss of information, especially if missingness is not completely random. Regression imputation is robust when the relationship between features is linear or moderately complex, as in your results, and avoids the pitfalls of both simple imputation and data loss.

### Recommendation

**Use regression-based imputation (linear or non-linear) for missing data.**  
This approach balances strong classification performance with conceptual soundness, retaining more data and reducing bias compared to listwise deletion. It is especially effective when the relationship between features is not highly non-linear, as shown by the similar results for both regression methods in your analysis.
