# Credit Risk Analysis and Prediction

## Overview

This project focuses on Credit Risk Analysis and Prediction using customer transaction and demographic data. The dataset contains information on customer card payment history and demographic details, which are used to classify customers as either high risk or low risk for specific banking products. This project is ideal for budding data scientists and data analysts to experiment with machine learning and statistical modeling concepts.

## Source

This dataset is available on kaggle in the following link:

> https://www.kaggle.com/datasets/praveengovi/credit-risk-classification-dataset

## Data Dictionary

The project utilizes two main datasets:

### 1. `payment_data.csv`
This dataset contains customer card payment history. The features are:

- **id**: This is the unique ID for a customer. Numerical Data.
- **OVD_t1**: Number of times overdue type 1 for a customer.  Numerical Data.
- **OVD_t2**: Number of times overdue type 2 for a customer.  Numerical Data.
- **OVD_t3**: Number of times overdue type 3 for a customer.  Numerical Data.
- **OVD_sum**: Total overdue days for a customer.  Numerical Data.
- **pay_normal**: Number of times normal payment was made by the customer. Numerical Data.
- **prod_code**: Credit product code. Numerical Data.
- **prod_limit**: Credit limit of the product. Numerical Data.
- **update_date**: Account update date of a customer. Date Data.
- **new_balance**: Current balance of the product of a customer. Numerical Data.
- **highest_balance**: Highest balance in history of a customer. Numerical Data.
- **report_date**: Date of the recent payment of a customer. Date Data.

### 2. `customer_data.csv`
This dataset contains customer demographic data and category attributes, which have been encoded. The features include:

- **Category features**: `fea_1`, `fea_3`, `fea_5`, `fea_6`, `fea_7`, `fea_9`
- **Numerical features**: `fea_2`, `fea_4`, `fea_8`, `fea_10`, `fea_11`
- **label**: This is the output feature indicating credit risk of a customer. It is a **binary** class and contains the following values: 
  - `1`: High credit risk customer
  - `0`: Low credit risk customer

## Problem Statements

1. **Model Training**: The objective of model training is to train the model with the data to predict the risk in credit.
2. **Model Evaluation**: Evaluate the trained model using the metrics like accuracy, precision, recall and F1 score.
3. **Model Optimization**: Find the optimal model with hyperparameter tunig get best performacnce from the model to predict the risk in credit.

### Load the libraries

In [25]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Models and evaluation metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


### Settings

In [26]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
model_path = "../models"
# csv_path = os.path.join(data_path, "credit_imputed.csv")
csv_path = os.path.join(data_path, "credit_nodup.csv")

### Load Data

In [18]:
df = pd.read_csv(csv_path)

In [19]:
# Check data
df.head()

Unnamed: 0,id,OVD_t1,OVD_t2,OVD_t3,OVD_sum,pay_normal,prod_code,prod_limit,new_balance,highest_balance,...,fea_8,fea_9,fea_10,fea_11,update_year,update_month,update_day,report_year,report_month,report_day
0,58987402,0,0,0,0,1,10,16500.0,0.0,219202.725928,...,95,4,60023,1.0,2016,12,4,2013,6,17
1,58995151,0,0,0,0,1,5,85789.70122,588720.0,491100.0,...,115,4,450028,224.267697,2016,12,4,2013,6,17
2,58997200,0,0,0,0,2,5,85789.70122,840000.0,700500.0,...,110,4,60000,219.248717,2016,12,4,2016,4,22
3,54988608,0,0,0,0,3,10,37400.0,8425.2,7520.0,...,108,4,151300,158.113883,2016,12,3,2016,4,25
4,54987763,0,0,0,0,2,10,85789.70122,15147.6,219202.725928,...,88,5,151300,233.520877,2016,12,3,2016,4,26


### Preprocessing

1. Separate Input and output features
2. Split train and test
3. Scale the tran and test

In [20]:
# Separate Input and output features
X = df.drop(columns = ["label"], axis= 1)
y  = df["label"]

In [21]:
# Split Train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [22]:
# Scale the data to standardize it
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

In [23]:
# Train the model with train dataset and evaluate metrics
def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predict train and test
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # Evaluate metrics
    print("=" * 60)
    print("Traing Scores")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred) : .2f}")
    print(f"Precision: {precision_score(y_train, y_train_pred) : .2f}")
    print(f"Recall: {recall_score(y_train, y_train_pred) : .2f}")
    print(f"F1: {f1_score(y_train, y_train_pred) : .2f}")

    print("=" * 60)
    print("Testing Scores")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred) : .2f}")
    print(f"Precision: {precision_score(y_test, y_test_pred) : .2f}")
    print(f"Recall: {recall_score(y_test, y_test_pred) : .2f}")
    print(f"F1: {f1_score(y_test, y_test_pred) : .2f}")

In [24]:
# Train with Random Forest
rf = RandomForestClassifier()
train_evaluate(rf)

Traing Scores
Accuracy:  1.00
Precision:  1.00
Recall:  1.00
F1:  1.00
Testing Scores
Accuracy:  0.95
Precision:  1.00
Recall:  0.68
F1:  0.81


### Conclusion

With default parameter we found the best model score.

In [28]:
# Save the model
trained_model_path = os.path.join(model_path, "credit_risk_rf.pkl")
with open(trained_model_path, "wb") as model_file:
    pickle.dump(rf, model_file)