# Exploratory Data Analysis for the Taiwan Default Credit data set 

## Imports 

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Reading the data 

In [2]:
default_credit_df = pd.read_csv('../data/raw/default_credit_card_clients.csv')

default_credit_df = default_credit_df.set_index('ID').rename(
    columns = {'default payment next month': 'default_payment_next_month'})

-----

## Summary of the data set

## Partition the data set into training and test sets

Before proceeding further, we will split our data set into train and test set. $20$ % of the observations will be included in the test data and $80$ % in the train data set. Overall `default_of_credit_card_clients` has $30,000$ observations, thus the test set should have enough examples to provide good affirmation for the model: more precisely, the train set will have $24000$ observations, and test set $6000$.

Also, throughout the data analysis `random_state=123` will be used to make sure the results are consistent.

Worthy to note that no EDA will be performed on test data set, as test set is going to serve for the generalization of the model (unseen data for our model).

In [3]:
# splitting the dataset into train and test sets
train_df, test_df = train_test_split(default_credit_df, test_size=0.2, random_state=123)

In [4]:
# printing the number of observations for train and test sets
print('The number of observations for train set: ', train_df['default_payment_next_month'].shape[0])
print('The number of observations for test set: ', test_df['default_payment_next_month'].shape[0])

The number of observations for train set:  24000
The number of observations for test set:  6000


In [5]:
# percentage of zeros and ones in default column train set
train_percent_defaults = train_df['default_payment_next_month'].value_counts(normalize=True) * 100
train_percent_defaults.name = 'Default Count Percent'

# count of observations were default is one or zero in train set 
train_yes_default = len(train_df[train_df['default_payment_next_month'] == 1])
train_no_default = len(train_df[train_df['default_payment_next_month'] == 0])

In [6]:
# convert to a dataframe and make column names readable
default_percent_df = pd.DataFrame(train_percent_defaults)
default_percent_df = default_percent_df.rename(index = {0: 'No (0)', 1: 'Yes (1)'})

# make a dictionary of classes count values 
count_dic_no = {"Count": train_no_default 
               }

count_dic_yes = {"Count": train_yes_default
                }

# make a list from classes default payment counts 
list_default = [count_dic_no, count_dic_yes]

# convert to a dataframe
default_count = pd.DataFrame(list_default, index = ['No (0)', 'Yes (1)'])

# join two dataframes
df_default = default_percent_df.join(default_count)
df_default

Unnamed: 0,Default Count Percent,Count
No (0),77.783333,18668
Yes (1),22.216667,5332


The count, as well as percentage of overall distribution of classes indicates that there is an imbalance between `No (0)` and `Yes (1)` classes. It is certainly something we have to take into account for later on analysis, however the difference does not seem to be significant enough to start our analysis with over or under sampling assumption. Thus, we will start our analysis without any assumption, and if the confusion matrix or any other indicators during tuning will show that the model makes a lot more mistakes on class `1` (minority class), we will accordingly adjust the model to get the best results.

## Exploratory analysis on the training data set