## Credit risk assessment using machine learning and LightningChart Python

### 1. Introduction

#### 1.1 What is credit risk assessment?
Credit risk assessment is the evaluation of the likelihood that a borrower will default on their loan obligations. It is a crucial part of financial risk management, helping institutions minimize losses due to bad loans.

#### 1.2 How have financial institutions benefited from machine learning applications for credit risk assessment?
Traditional methods often rely on manual evaluation and simple statistical models, which can be time-consuming and less accurate. Machine learning models, such as logistic regression and random forests, automate the process, providing faster and more accurate predictions by analyzing large datasets and identifying complex patterns.

#### 1.3 CatBoost Model for Credit Risk Assessment
We will use several machine learning models, including CatBoost Classifier, LGBM Classifier, Random Forest Classifier, XGB Classifier, and Stacking Classifier for their simplicity, interpretability, and their ability to handle large datasets and complex relationships. However, the best model is the CatBoost Classifier, which will be the primary focus in this project.

CatBoost Classifier is a machine learning (ML) model based on the gradient boosting decision tree (GBDT) framework. This framework is a popular ML technique used for both classification and regression tasks. The CatBoost Classifier stands out due to its powerful, versatile, and efficient performance in classification tasks, especially where the dataset includes categorical data. One of its key advantages is the ability to process categorical data directly without extensive preprocessing, making it particularly useful for credit risk modeling where such data is prevalent.

### 2. LightningChart Python 

#### 2.1 Overview of LightningChart Python
LightningChart is a high-performance charting library designed for real-time data visualization. Its Python wrapper allows for seamless integration with data analysis and machine learning workflows.

#### 2.2 Features and Chart Types to be Used in the Project
LightningChart Python offers a variety of chart types, each designed to handle specific types of data visualization needs. In this project, we use the following chart types to visualize stock price prediction data:

- **Bar Chart**: Used for visualizing categorical data as bars, making it easy to compare different categories side by side.
- **Stacked Bar Chart**: Allows for visualizing the composition of each category, showing how individual parts contribute to the whole.
- **Grouped Bar Chart**: Similar to the bar chart, but groups bars together based on additional categories, facilitating comparison within groups.
- **Pie Chart**: This kind of chart visualizes proportions and percentages between categories by dividing a circle into proportional segments, providing a clear view of category distribution.
- **Box Plot**: This chart type is used for visualizing data groups through quartiles. It is used to visualize the distribution of data based on statistical measures like quartiles, median, and outliers, providing insights into the data spread and variability.

![LighteningChart](./images/charts.png)

#### 2.3 Performance Characteristics
LightningChart's performance is unmatched, handling millions of data points with ease and maintaining smooth user interactions. One of the standout aspects of LightningChart Python is its performance. The library is optimized for handling large volumes of data with minimal latency, which is crucial for financial applications where data needs to be processed and visualized in real-time to inform trading decisions.

### 3. Setting Up Python Environment

#### 3.1 Installing Python and Necessary Libraries
Install Python from the [official website](https://www.python.org/downloads/) and use pip to install necessary libraries including LightningChart Python from PyPI. To get the [documentation](https://lightningchart.com/python-charts/docs/) and the [license](https://lightningchart.com/python-charts/), please visit [LightningChart Website](https://lightningchart.com/).

In [1]:
# pip install lightningcharts random numpy pandas scikit-learn

In [2]:
# Importing the libraries and LighteningChart license 
import lightningchart as lc
import random
lc.set_license('my-license-key')

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier, StackingClassifier
from scipy.stats import probplot
from feature_engine.outliers import Winsorizer
from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier, XGBRFClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import BorderlineSMOTE
from collections import Counter
from yellowbrick.classifier import ClassPredictionError
from feature_engine.selection import DropCorrelatedFeatures

#### 3.2 Overview of Libraries Used
- **LightningChart**: Advanced data visualization.
- **NumPy**: Numerical computation.
- **Pandas**: Data manipulation and analysis.
- **Scikit-learn**: Data mining and data analysis.

#### 3.3 Setting Up Your Development Environment
Recommended IDEs include Jupyter Notebook, PyCharm, or Visual Studio Code.

### 4. Loading and Processing Data

#### 4.1 How to Load the Data Files
Data can be sourced from well-known databases like Kaggle, the world's largest data science and machine learning community.

In [3]:
# Load and preprocess the dataset
import pandas as pd

# Loading the dataset
data = pd.read_csv("./credit_risk_dataset.csv")
df.head()

#### 4.2 Handling and preprocessing the data
Preprocessing involves cleaning the data and handling missing values to make it suitable for machine learning models. 

In [4]:
# Removimg duplicate and handling missing data
df = df.drop_duplicates()
df = df.dropna()
df.isnull().sum()

In [None]:
# Initial classification based on data type
def grab_col_names(dataframe, cat_th=10, car_th=20):

    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]

    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                dataframe[col].dtypes != "O"]

    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                dataframe[col].dtypes == "O"]

    # Updating categorical columns list
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    # Defining numerical columns excluding numeric but categorical
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f'cat_cols: {len(cat_cols)}')
    print(f'num_cols: {len(num_cols)}')
    print(f'cat_but_car: {len(cat_but_car)}')
    print(f'num_but_cat: {len(num_but_cat)}')
    return cat_cols, cat_but_car, num_cols

cat_cols, cat_but_car, num_cols = grab_col_names(df)

In [None]:
def high_correlated_cols(dataframe, display_table=False, corr_th=0.70):
    # Selecting only the numeric columns from the DataFrame
    numeric_dataframe = dataframe.select_dtypes(include=['number'])
    
    # Calculating the absolute correlation matrix
    corr = numeric_dataframe.corr().abs()
    
    # Create an upper triangle matrix to identify high correlations
    upper_triangle_matrix = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
    drop_list = [col for col in upper_triangle_matrix.columns if any(upper_triangle_matrix[col] > corr_th)]
    
    if display_table:
        # Displaying the correlation matrix
        print("Correlation Matrix:")
        display(corr)  # This uses IPython.display.display to show the DataFrame in Jupyter
        
    return drop_list

drop_list = high_correlated_cols(df, display_table=True)

#### 4.3 Validation of the Study
In this project, we initially used five different models to assess their validity and performance in predicting credit risk. After evaluating the results, we selected the CatBoost Classifier as the best model for making predictions. Below are the performance metrics for each model tested:

![validation](./images/validation.png)

### 5. Visualizing Data with LightningChart

#### 5.1 Introduction to LightningChart for Python
LightningChart Python allows for the creation of highly interactive and customizable charts.

#### 5.2 Creating the charts
To visualize the data, you can create various charts using LightningChart Python.

#### 5.3 Customizing visualizations
LightningChart offers extensive customization options. You can change the theme and colors, add markers, hide or sort some features or integrate real-time data updates to enhance the visualization.

In [None]:
import lightningchart as lc
import random

# Initialize LightningChart and set the license key
lc.set_license('my-license-key')

# Create a BarChart with stacked data using LightningChart
chart = lc.BarChart(
    vertical=True, 
    theme=lc.Themes.White, 
    title='Age Distribution by Loan Status'
)
chart.set_data_stacked(
    categories,
    [
        {'subCategory': 'Good Loan Status', 'values': hist_good},
        {'subCategory': 'Bad Loan Status', 'values': hist_bad},
        {'subCategory': 'Overall Loan Status', 'values': hist_overall}
    ]
)
chart.set_value_label_display_mode('hidden')
chart.add_legend().add(chart)
chart.open()

### Some results' images

![Stacked Bar Chart](./images/stacked%20bar%20chart.png)
![Pie Chart](./images/pie%20chart.png)
![Bar Chart](./images/bar%20chart%201.png)
![Bar Chart](./images/bar%20chart%202.png)
![Bar Chart](./images/bar%20chart%203.png)
![Grouped Bar Chart](./images/grouped%20bar%20chart.png)
![Box Plot](./images/Box%20plot%201.png)
![Box Plot](./images/Box%20plot%202.png)

### 6. Conclusion

#### 6.1 Recap of creating the application and its usefulness
This project demonstrated how to build a credit risk model using Python and visualize the results with LightningChart. The use of machine learning models improved prediction accuracy, while LightningChart provided high-quality data visualizations.

#### 6.2 Benefits of using LightningChart Python for visualizing data
LightningChart's advanced features and performance make it an excellent choice for financial data visualization, offering clear insights and aiding in decision-making processes.

### The Predictions

In [None]:
# Predictions and statistics
# Loading the data
data = pd.read_csv("./credit_risk_dataset.csv")
data.drop_duplicates(inplace=True)
data.dropna(inplace=True)

# Mapping loan status directly in the original data for clarity
data['loan_status'] = data['loan_status'].map({0: 'Non-Default', 1: 'Default'})

# Original loan status count
original_status_counts = data['loan_status'].value_counts()
print("Original Dataset Counts:")
print(original_status_counts.to_string())
print("Sum --------->", original_status_counts.sum())

# Defining categorical and numerical columns
base_cat_cols = ['person_home_ownership', 'loan_intent', 'cb_person_default_on_file']
num_cols = ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_percent_income']

# Adding 'loan_grade' if it exists in the dataset
if 'loan_grade' in data.columns:
    base_cat_cols.append('loan_grade')

# Creating 'income_group'
data['income_group'] = pd.cut(data['person_income'],
                            bins=[0, 25000, 50000, 75000, 100000, float('inf')],
                            labels=['low', 'low-middle', 'middle', 'high-middle', 'high'])
base_cat_cols.append('income_group')

data_encoded = pd.get_dummies(data, columns=base_cat_cols, drop_first=True)

# Scaling numerical features
scaler = StandardScaler()
data_encoded[num_cols] = scaler.fit_transform(data_encoded[num_cols])

# Splitting data into features and target
X = data_encoded.drop('loan_status', axis=1)
y = data_encoded['loan_status']

# Splitting data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Defining and training the CatBoost model
best_model = CatBoostClassifier(silent=True)
best_model.fit(X_train, y_train) 

# Predicting using the trained model on both datasets
original_predictions = best_model.predict(X_test)
new_data = pd.read_csv("./dataset_for_prediction.csv")
new_data.drop_duplicates(inplace=True)
new_data.dropna(inplace=True)

# Processing new_data as done with the original data
new_data['income_group'] = pd.cut(new_data['person_income'],
                                bins=[0, 25000, 50000, 75000, 100000, float('inf')],
                                labels=['low', 'low-middle', 'middle', 'high-middle', 'high'])
new_cat_cols = [col for col in base_cat_cols if col in new_data.columns]

new_data_encoded = pd.get_dummies(new_data, columns=new_cat_cols, drop_first=True)
new_data_encoded[num_cols] = scaler.transform(new_data_encoded[num_cols])  

# Aligning new data columns with the training features
missing_cols = set(X.columns) - set(new_data_encoded.columns)
for c in missing_cols:
    new_data_encoded[c] = 0
new_data_encoded = new_data_encoded[X.columns]  

new_predictions = best_model.predict(new_data_encoded)

# Counting predictions
original_prediction_counts = pd.Series(original_predictions).value_counts()
new_prediction_counts = pd.Series(new_predictions).value_counts()

print("\n------------------------------------------------")
print("\nOriginal Dataset Prediction:")
print(original_prediction_counts.to_string())
print("Sum --------->", original_prediction_counts.sum())
print("\n------------------------------------------------")
print("\nNew Dataset Prediction:")
print(new_prediction_counts.to_string())
print("Sum --------->", new_prediction_counts.sum())

In [None]:
# Preparing data for the Bar Chart
# Converting prediction counts to int for JSON serialization
original_non_default_status_count = int(original_status_counts.get('Non-Default', 0))
original_default_status_count = int(original_status_counts.get('Default', 0))
original_non_default_count = int(original_prediction_counts.get('Non-Default', 0))
original_default_count = int(original_prediction_counts.get('Default', 0))
new_non_default_count = int(new_prediction_counts.get('Non-Default', 0))
new_default_count = int(new_prediction_counts.get('Default', 0))

# Initializing the chart
chart = lc.BarChart(vertical=True, theme=lc.Themes.White, title='Credit Risk Predictions Comparison')

# Configuring the data for the chart, ensuring values are native Python integers
chart.set_data_grouped(
    ['1. Original Dataset Count', '2. Original Dataset Prediction', '3. New Dataset Prediction'],
    [
        {'subCategory': 'Non-Default', 'values': [ original_non_default_status_count, original_non_default_count, new_non_default_count]},
        {'subCategory': 'Default', 'values': [original_default_status_count, original_default_count, new_default_count]}
    ]
)

# Sorting the chart
chart.set_sorting('alphabetical')

# Adding a legend to the chart
legend = chart.add_legend().add(chart)

# Opening the chart
chart.open() 