# Supervised Classification and Unsupervised Anomaly Detection Models

_Author: Maria Laura Borra - Date: 6/27/2025_

##  1. Objective


The goal of this notebook is to develop and evaluate various models to identify individuals who are at high risk of defaulting on credit payments. This involves comparing both unsupervised anomaly detection methods, which detect unusual behavior patterns without labeled examples of defaults, and supervised classification models, which leverage labeled data to predict the likelihood of default.

By systematically assessing the performance of these models using metrics such as ROC AUC, precision, and recall, we aim to select the most effective approach for accurately flagging high-risk customers and thereby improve credit risk management.


## 2. Dataset Description


The dataset underwent several preprocessing steps to ensure quality and compatibility with machine learning models:

- Loading and Cleaning: Raw text data was loaded and its columns were renamed using a provided mapping. A custom cleaning function addressed missing values, normalized text, dropped irrelevant columns, and engineered new features.

- Train/Validation/Test Split: Data was split into training, validation, and test sets to ensure unbiased model evaluation.

- Preprocessing Pipeline:

    1- Imputation: Median strategy for numeric features, most frequent for categorical.
   
    2- Feature Encoding: Applied encoding for categorical variables with controlled cardinality.
   
    3- Scaling: MinMaxScaler was applied

    4- Class Imbalance was > 70% but No SMOTE applied due to low performance detected on model training

### 2.1 Importing the necessary libraries

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Cambiar al directorio del proyecto
import os
os.chdir("/content/drive/My Drive/Learning/AnyoneAI/Final_Project/Models")  # change to your actual project root folder
print("Current dir:", os.getcwd())  # confirm
# Instalar dependencias desde requirements.txt
!pip install -r "/content/drive/My Drive/Learning/AnyoneAI/Final_Project/Models/requirements.txt"


In [1]:
# Standard library imports
import glob
import os
import sys

# Third-party library imports
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import pandas as pd
from ydata_profiling import ProfileReport

# Project-specific path setup
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

# Project imports - Configs
from src.config import ( RAW_DATA_DIR,EXTERNAL_DATA_DIR)

# Project imports - Data utilities
from src.data.data_utils import (
    get_feature_target,
    get_train_val_sets,
    df_to_csv,
    summarize_column_counts,
    load_txt_with_mapped_columns,
)
from src.data.cleaning_dataset import clean_dataset

# Project imports - Preprocessing
from src.preprocessing.preprocessing import simple_preprocess

# Project imports - Modeling
from src.modeling.models import (
    train_classification_pipeline,
    run_isolation_forest,
    run_one_class_svm,
)
#Modeling
from sklearn.metrics import classification_report, roc_auc_score, precision_score, recall_score


  from .autonotebook import tqdm as notebook_tqdm


[32m2025-06-28 19:07:08.868[0m | [1mINFO    [0m | [36msrc.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: C:\Users\Administrator\Desktop\CreditRiskAnalysisProject[0m


### 2.2 Collect the Data, convert and clean DF 

In [2]:
# Define the path to the raw data text file
txt_file_path = str(RAW_DATA_DIR / "PAKDD2010_Modeling_Data.txt")

# Define the path to the Excel file that contains the variable (column) name mappings
variables_path = str(EXTERNAL_DATA_DIR / 'PAKDD2010_VariablesList.xls')

# Load the raw text data and rename its columns using the mapping from the Excel file
# The function 'load_txt_with_mapped_columns'  reads the data and replaces column headers 
# based on the mapping provided in the Excel file
df = load_txt_with_mapped_columns(txt_file_path, variables_path)

# Clean the dataset using a custom function
# This includes normalizing text, handling missing values, dropping irrelevant columns, 
# creating new features (counts, sums, ratios), and renaming target column.
df = clean_dataset(df)


Starting dataset cleaning...
Normalized string columns (stripped and uppercased): ['CLERK_TYPE', 'APPLICATION_SUBMISSION_TYPE', 'SEX', 'STATE_OF_BIRTH', 'CITY_OF_BIRTH', 'RESIDENCIAL_STATE', 'RESIDENCIAL_CITY', 'RESIDENCIAL_BOROUGH', 'FLAG_RESIDENCIAL_PHONE', 'RESIDENCIAL_PHONE_AREA_CODE', 'FLAG_MOBILE_PHONE', 'COMPANY', 'PROFESSIONAL_STATE', 'PROFESSIONAL_CITY', 'PROFESSIONAL_BOROUGH', 'FLAG_PROFESSIONAL_PHONE', 'PROFESSIONAL_PHONE_AREA_CODE', 'FLAG_ACSP_RECORD', 'RESIDENCIAL_ZIP_3', 'PROFESSIONAL_ZIP_3']
Numeric columns changed to Category: ['PAYMENT_DAY', 'POSTAL_ADDRESS_TYPE', 'MARITAL_STATUS', 'EDUCATION_LEVEL', 'NACIONALITY', 'FLAG_VISA', 'FLAG_MASTERCARD', 'FLAG_DINERS', 'FLAG_AMERICAN_EXPRESS', 'FLAG_OTHER_CARDS', 'RESIDENCE_TYPE', 'PROFESSION_CODE', 'OCCUPATION_TYPE', 'MATE_PROFESSION_CODE', 'PRODUCT', 'RESIDENCIAL_ZIP_3', 'PROFESSIONAL_ZIP_3', 'FLAG_INCOME_PROOF', 'FLAG_CPF', 'FLAG_RG', 'FLAG_HOME_ADDRESS_DOCUMENT', 'FLAG_EMAIL', 'FLAG_RESIDENCIAL_PHONE', 'FLAG_MOBILE_PHONE',

  temp = df[cols_to_count].replace({'Y': 1, 'N': 0})
  temp = df[cols_to_count].replace({'Y': 1, 'N': 0})
  temp = df[cols_to_count].replace({'Y': 1, 'N': 0})
  temp = df[cols_to_count].replace({'Y': 1, 'N': 0})
  temp = df[cols_to_count].replace({'Y': 1, 'N': 0})


Created column 'NO_CREDIT_HISTORY' by counting 0s in: ['FLAG_VISA', 'FLAG_MASTERCARD', 'FLAG_DINERS', 'FLAG_AMERICAN_EXPRESS', 'FLAG_OTHER_CARDS', 'QUANT_BANKING_ACCOUNTS', 'QUANT_SPECIAL_BANKING_ACCOUNTS']
Created column 'TOTAL_INCOME' by summing: ['OTHER_INCOMES', 'PERSONAL_MONTHLY_INCOME']
Created column 'TOTAL_BANK_ACCOUNTS' by summing: ['QUANT_BANKING_ACCOUNTS', 'QUANT_SPECIAL_BANKING_ACCOUNTS']
Created column 'STABILITY_INDEX' by summing: ['MONTHS_IN_RESIDENCE', 'MONTHS_IN_THE_JOB']
Created ratio column 'MONTHS_IN_THE_JOB_DIV_AGE' as MONTHS_IN_THE_JOB / AGE
Created ratio column 'TOTAL_INCOME_DIV_QUANT_DEPENDANTS' as TOTAL_INCOME / QUANT_DEPENDANTS
Created ratio column 'PERSONAL_MONTHLY_INCOME_DIV_OTHER_INCOMES' as PERSONAL_MONTHLY_INCOME / OTHER_INCOMES
Created ratio column 'TOTAL_INCOME_DIV_AGE' as TOTAL_INCOME / AGE
Created ratio column 'TOTAL_INCOME_DIV_TOTAL_BANK_ACCOUNTS' as TOTAL_INCOME / TOTAL_BANK_ACCOUNTS
Created ratio column 'TOTAL_INCOME_DIV_TOTAL_CREDIT_CARDS' as TOTA

  df.replace([


Standardized missing values to NaN.
Dropped columns: ['QUANT_ADDITIONAL_CARDS', 'FLAG_RG', 'EDUCATION_LEVEL', 'FLAG_MOBILE_PHONE', 'FLAG_ACSP_RECORD', 'CLERK_TYPE', 'FLAG_CPF', 'FLAG_HOME_ADDRESS_DOCUMENT', 'FLAG_INCOME_PROOF']
Renamed column 'TARGET_LABEL_BAD=1' to 'TARGET'
Dropped 0 duplicate rows.
Dropped likely ID columns: ['ID_CLIENT']
Dataset cleaning completed.


### 2.3 Preprocessing

In [3]:
x_train, y_train, x_test, y_test = get_feature_target(df, df, target_column="TARGET")
print(" Separación realizada correctamente.")
print(f"x_train shape: {x_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"x_test shape: {x_test.shape}")
print(f"y_test shape: {y_test.shape}")

 Separación realizada correctamente.
x_train shape: (50000, 66)
y_train shape: (50000,)
x_test shape: (50000, 66)
y_test shape: (50000,)


In [4]:
x_train, x_val, y_train, y_val = get_train_val_sets(x_train,y_train) 
print(" División en train/val realizada correctamente.")
print(f"x_train shape: {x_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"x_val shape:   {x_val.shape}")
print(f"y_val shape:   {y_val.shape}")

 División en train/val realizada correctamente.
x_train shape: (40000, 66)
y_train shape: (40000,)
x_val shape:   (10000, 66)
y_val shape:   (10000,)


In [5]:
(train_x, train_y), (val_x, val_y), (test_x, test_y) = simple_preprocess(
    x_train, x_val, x_test,
    y_train, y_val, y_test,
    numeric_imputer_strategy='median', #Median imputation is appropriate when the distribution of the data is skewed
    categorical_imputer_strategy='most_frequent',
    apply_smote=False,
    cardinality_amount= 17,
    threshold_imbalanced=0.99,
    #sampling_strategy =  "auto",
)
print("Train:", train_x.shape, train_y.shape)
print("Validation:", val_x.shape, val_y.shape)
print("Test:", test_x.shape, test_y.shape) 



Dropped columns: ['POSTAL_ADDRESS_TYPE', 'FLAG_DINERS', 'FLAG_AMERICAN_EXPRESS', 'FLAG_OTHER_CARDS', 'MONTHS_IN_THE_JOB', 'HAS_PREMIUM_CARD', 'MONTHS_IN_THE_JOB_DIV_AGE']
[Step 1] Numeric columns imputed: ['QUANT_DEPENDANTS', 'MONTHS_IN_RESIDENCE', 'PERSONAL_MONTHLY_INCOME', 'OTHER_INCOMES', 'QUANT_BANKING_ACCOUNTS', 'QUANT_SPECIAL_BANKING_ACCOUNTS', 'PERSONAL_ASSETS_VALUE', 'QUANT_CARS', 'AGE', 'TOTAL_CREDIT_CARDS', 'DOC_CONFIRMATION', 'CONTACTABILITY', 'IS_NOT_FOREIGNER', 'HAS_ASSETS&HAS_PREMIUM_CARD', 'NO_CREDIT_HISTORY', 'TOTAL_INCOME', 'TOTAL_BANK_ACCOUNTS', 'STABILITY_INDEX', 'TOTAL_INCOME_DIV_QUANT_DEPENDANTS', 'PERSONAL_MONTHLY_INCOME_DIV_OTHER_INCOMES', 'TOTAL_INCOME_DIV_AGE', 'TOTAL_INCOME_DIV_TOTAL_BANK_ACCOUNTS', 'TOTAL_INCOME_DIV_TOTAL_CREDIT_CARDS', 'MONTHS_IN_THE_JOB_DIV_MONTHS_IN_RESIDENCE', 'MONTHS_IN_RESIDENCE_DIV_AGE', 'TOTAL_CREDIT_CARDS_DIV_TOTAL_INCOME', 'PERSONAL_ASSETS_VALUE_DIV_TOTAL_CREDIT_CARDS']
[Custom Feature] INCOME_BELOW_AVG added using avg from training

## 3. Models Evaluated

The models were developed using a pipeline that included feature selection and hyperparameter tuning. Feature selection was performed with SelectKBest. Hyperparameter optimization was carried out using GridSearchCV to sistematically explore parameter combinations and identify the best configuration for each model.
Three models were developed and evaluated to detect high-risk credit customers:

- Logistic Regression (Supervised) : A classical supervised model used to predict the probability of default.

- Isolation Forest (Unsupervised): Detects anomalies without needing labeled data.

- One-Class SVM (Unsupervised): Learns the boundary of normal data; anything outside is flagged as an anomaly.


### 3.1 Isolation Forest

In [6]:
model, y_pred, auc_if, prec_if, rec_if = run_isolation_forest(train_x, train_y, val_x, val_y,contamination=0.1,max_features=10)


Isolation Forest Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.89      0.81      7392
           1       0.23      0.09      0.13      2608

    accuracy                           0.68     10000
   macro avg       0.48      0.49      0.47     10000
weighted avg       0.60      0.68      0.63     10000

ROC AUC: 0.5000


### 3.1 One-Class SVM

In [7]:
model_, y_pred, auc_svm, prec_svm, rec_svm = run_one_class_svm(train_x, train_y, val_x, val_y)


One-Class SVM Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.90      0.81      7392
           1       0.28      0.11      0.15      2608

    accuracy                           0.69     10000
   macro avg       0.51      0.50      0.48     10000
weighted avg       0.62      0.69      0.64     10000

ROC AUC: 0.5071


### 3.3 Logistic Regression

In [8]:
model, y_pred,auc_lr, prec_lr, rec_lr = train_classification_pipeline(train_x, train_y, val_x, val_y)


Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.58      0.67      7392
           1       0.33      0.58      0.42      2608

    accuracy                           0.58     10000
   macro avg       0.56      0.58      0.54     10000
weighted avg       0.67      0.58      0.60     10000

ROC AUC: 0.6127


## 4. Comparison Summary

The models were compared using ROC AUC, Precision, and Recall metrics on the validation set:

In [9]:
model_results = []

# Run Isolation Forest
#_, _, auc_if, prec_if, rec_if = run_isolation_forest(train_x, train_y, val_x, val_y)
model_results.append({
    'Model': 'Isolation Forest',
    'ROC AUC': round(auc_if, 4),
    'Precision': round(prec_if, 4),
    'Recall': round(rec_if, 4),
    'Notes': 'Good unsupervised baseline'
})

# Run One-Class SVM
#_, _, auc_svm, prec_svm, rec_svm = run_one_class_svm(train_x, train_y, val_x, val_y)
model_results.append({
    'Model': 'One-Class SVM',
    'ROC AUC': round(auc_svm, 4),
    'Precision': round(prec_svm, 4),
    'Recall': round(rec_svm, 4),
    'Notes': 'More sensitive to scaling'
})

# Run Logistic Regression
#_, _, auc_lr, prec_lr, rec_lr = train_classification_pipeline(train_x, train_y, val_x, val_y)
model_results.append({
    'Model': 'Logistic Regression',
    'ROC AUC': round(auc_lr, 4),
    'Precision': round(prec_lr, 4),
    'Recall': round(rec_lr, 4),
    'Notes': 'Supervised, best performance'
})

results_df = pd.DataFrame(model_results)
results_df


Unnamed: 0,Model,ROC AUC,Precision,Recall,Notes
0,Isolation Forest,0.5,0.2274,0.0897,Good unsupervised baseline
1,One-Class SVM,0.5071,0.2764,0.1074,More sensitive to scaling
2,Logistic Regression,0.6127,0.3272,0.5824,"Supervised, best performance"


## 5. Final Model Selection & try on Test Set

After evaluating all models, Logistic Regression was selected as the final model due to its superior performance on validation metrics and its interpretability.

The selected model was then tested on a separate test set, confirming its generalization ability.

In [10]:
final_model,y_pred,auc_lr, prec_lr, rec_lr = train_classification_pipeline(train_x, train_y, val_x, val_y)

# Now test on a separate test set
y_test_pred = final_model.predict(test_x)
y_test_prob = final_model.predict_proba(test_x)[:, 1]

print("Final Evaluation on Test Set:")
print(classification_report(test_y, y_test_pred))
print(f"Test ROC AUC: {roc_auc_score(test_y, y_test_prob):.4f}")


Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.58      0.67      7392
           1       0.33      0.58      0.42      2608

    accuracy                           0.58     10000
   macro avg       0.56      0.58      0.54     10000
weighted avg       0.67      0.58      0.60     10000

ROC AUC: 0.6127
Final Evaluation on Test Set:
              precision    recall  f1-score   support

           0       0.80      0.58      0.67     36959
           1       0.33      0.59      0.42     13041

    accuracy                           0.58     50000
   macro avg       0.57      0.59      0.55     50000
weighted avg       0.68      0.58      0.61     50000

Test ROC AUC: 0.6187


## 6. Next Steps

- Additional machine learning models will be tested to further evaluate performance across different metrics.

- Once the best-performing model is selected, it will be fine-tuned.

- The selected model will then be prepared for deployment in a production enviroment.