## Exploratory Data Analysis

### Import

In [None]:
import pandas as pd
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
import warnings

# ignore warning
warnings.filterwarnings('ignore')

# Save a vega-lite spec and a PNG blob for each plot in the notebook
alt.renderers.enable('mimetype')
# Handle large data sets without embedding them in the notebook
alt.data_transformers.enable('data_server')

### Read raw data set

In [None]:
# skip the first row, and make id column as index
credit_df = pd.read_excel("../data/raw/credit_default_data.xlsx", index_col=0, skiprows=1)

# change a column name
credit_df = credit_df.rename(columns={'default payment next month': 'default_payment_next_month'})

# change target data type
credit_df["default_payment_next_month"] = credit_df["default_payment_next_month"].astype("category")

### Summary of the data set

The goal of the project is to predict weather a person is going to default on credit card by some feature provided to us. There are 30,000 observations in the data set and 23 features in the data set and a target. There are in total 24 columns. There are no missing values in the data set. The target in the data is whether the client make a default payment next month.

### The data

In [None]:
credit_df

### Table 1 information table

There is no missing value in our data set, and there is 30000 observations for each row.

In [None]:
credit_df.info()

### Table 2 describe table

The scale of our numeric features are vary. For example, 'LIMIT_BAL' has mean of 167484 and standard diviation of 129747. Additionally, for feature 'AGE', the mean is 35.485500 and standard diviation is 9.217904. Therefore, we may need to change there scale when we are doing the model training.

In [None]:
credit_df.describe()

### Splitting data

To carry out the EDA, we split the data in to 20% test set and 80% train set. Additionally, we are using random_state=522 to keep the results consist.

In [None]:
train_df, test_df = train_test_split(credit_df, test_size=0.2, random_state=522)

In [None]:
train_df

In [None]:
test_df

### Table 3 Target count table

As shown in the table below, The target (default payment next month) is an imbalanced feature. There are more cases of not default than default. We may need to apply class-weight or other method to solve the problem.

In [None]:
Target_df = pd.DataFrame(credit_df['default_payment_next_month'].value_counts())
Target_df

### Comparing the numeric features in the two classes

As shown below, for each feature there is no overlap between the  two target classes (this may due to the imbalance class distribution). However, we can clearly see that both class have a approximately the same shape. For example, the distribution of 'AGE' for both classes is right skewed. Moreover, in fact most of our numeric features are right skewed. We may need to consider that when fit the model.

In [None]:
num_cols = ["LIMIT_BAL", "AGE", "BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", 
            "BILL_AMT5", "BILL_AMT6", "PAY_AMT1",  "PAY_AMT2",  "PAY_AMT3", 
            "PAY_AMT4","PAY_AMT5", "PAY_AMT6"]

alt.Chart(train_df).mark_bar().encode(
     alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=30)),
     y='count()',
     color='default_payment_next_month'
).properties(
    width=200,
    height=150
).repeat(
    num_cols,
    columns=3
)

### Comparing the categorical and ordinal features in the two classes

In [None]:
cat_col = ["EDUCATION", "MARRIAGE", "SEX", "PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6"]

alt.Chart(train_df).mark_bar().encode(
     alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=10)),
     y='count()',
     color='default_payment_next_month'
).properties(
    width=200,
    height=150
).repeat(
    cat_col,
    columns=3
)

#### Correlation matrix

By the correlation matrix, we can see that the 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', and 'PAY_6' have a quite high correlation with each other. In addition, 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', and 'BILL_AMT6' have the same issue. We may need to consider about those features when training the model. Additionally, for these features have strong correlation, we may need to consider drop one of the pairs. For example, 'BILL_AMT1' and 'BILL_AMT2' have a correlation of 0.95, which is quite high.

In [None]:
train_df.corr().style.background_gradient(cmap='coolwarm')