# **Data Analysis**

## Objectives

*   Answer **business requirement 1**: 
    * The client is interested to extract hidden relationship patterns among the variables in loan default dataset, so the client can learn the most relevant variables that can affect default event.

## Inputs

* inputs/datasets/loan_data.csv.

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App.
* whenever applicable, evaluate the validity of the aforementioned hypotheses.


---

## **Change working directory**

* Change the working directory from its current folder to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

* Make the parent of the current directory the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

* Confirm the new current directory.

In [None]:
current_dir = os.getcwd()
current_dir

## **Dataset Loading**

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/row/LoanDefaultDataset.csv"))
df.head(3)

## **Data Exploration**

* Get the YData Profiling Report.
* Observe variable types and any potential transformation are needed. 
* Observer an missing cells in any of the variables in the dataset.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

In [None]:
df.info()

> Result:

- Variable Types:
  - 9 Numeric.
  - 5 Text (Categorical - Object).
- No Missing Cells.
- The Target variable **loan_status** is unbalanced.

## **Correlation Study**

* Implement OneHotEncoder class to perform categorical feature transformation.

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

In [8]:
%matplotlib inline


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman", numeric_only=True)
    df_corr_pearson = df.corr(method="pearson", numeric_only=True)

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df_ohe)

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.2, PPS_Threshold =0.1,
                  figsize=(12,8), font_annot=6)

* Implement Spearman and Pearson correlation methods respectively.

In [11]:
corr_pearson = df_corr_pearson['loan_status'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman = df_corr_spearman['loan_status'].sort_values(key=abs, ascending=False)[1:].head(10)

* Extract the top five strongest correlation levels.

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

* Create a list out of the titles of the qualified features for further investigation.

In [None]:
vars_to_study = ['loan_int_rate','loan_percent_income', 'person_home_ownership','person_income', 'previous_loan_defaults_on_file']
vars_to_study

## **Implementation of EDA on selected variables**

* Create an Exploratory Data list out of the qualified features for further investigation.

In [None]:
df_eda = df.filter(vars_to_study + ['loan_status'])
df_eda.head(3)

## **Variables Distribution against Loan Status**

  * Evaluate important feature distribution.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'loan_status'
for col in vars_to_study:
    if df_eda[col].dtype == 'object':
        plot_categorical(df_eda, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")

## **Parallel Plot**

* Extract the max and min values of loan interest rate to create a rate map.

In [None]:
df_eda.info()
print("-----")
max_interest_rate = df_eda['loan_int_rate'].max()
min_interest_rate = df_eda['loan_int_rate'].min()
print("The maximum interest rate:", max_interest_rate)
print("The minimum interest rate:", min_interest_rate)

In [None]:
from feature_engine.discretisation import ArbitraryDiscretiser
import numpy as np
rate_map = [-np.Inf, 5, 7.5, 10, 12.5, 15, 17.5, 20, np.Inf]
disc = ArbitraryDiscretiser(binning_dict={'loan_int_rate': rate_map})
df_parallel = disc.fit_transform(df_eda)
df_parallel.head()

In [None]:
disc.binner_dict_['loan_int_rate']

In [None]:
n_classes = len(rate_map) - 1
classes_ranges = disc.binner_dict_['loan_int_rate'][1:-1]

labels_map = {}
for n in range(0, n_classes):
    if n == 0:
        labels_map[n] = f"<{classes_ranges[0]}"
    elif n == n_classes-1:
        labels_map[n] = f"+{classes_ranges[-1]}"
    else:
        labels_map[n] = f"{classes_ranges[n-1]} to {classes_ranges[n]}"

labels_map

In [None]:
df_parallel['loan_int_rate'] = df_parallel['loan_int_rate'].replace(labels_map)
df_parallel.head()

In [None]:
import plotly.express as px
fig = px.parallel_categories(df_parallel, color="loan_status")
fig.show(renderer='jupyterlab')

## Conclusions

From the correlation study the following conclusions are extracted:

* Pervious loan default has positive and moderate correlation to loan approval.
* Loan-to-income ratio has positive and weak correlation to loan approval.
* Interest Rate has positive and weak correlation to loan approval.
* Home ownership has weak correlation to loan approval. This correlation exercises two duality depending on home ownership status (i.e. rent, own and mortgage)
* Income has negative and very weak correlation to loan approval.