# 🧾 Table of Contents

- [3. Exploring the Data](#3.-Exploring-the-Data)
  - [Preparing our toolbox](#Preparing-our-toolbox)
  - [3.1 Application_train.csv](#3.1-Application_train.csv)
    - [Exploring: Data Types, Missing Values, Noisiness, and Distribution](#Exploring:-Data-Types,-Missing-Values,-Noisiness,-and-Distribution)
      - [Highlights](#Highlights)
    - [Investigating](#Investigating)
      - [Target](#Target)
      - [Categorical Features for Transformation](#Categorical-Features-for-Transformation)
      - [Highlights](#Highlights)
    - [Decisions](#Decisions)

# 3. Exploring the Data

Exploratory Data Analysis aiming to gain insights for a **first iteration** of data preparation.

## Preparing our toolbox

In [85]:
%load_ext autoreload
%autoreload 2

from src.data.explore_data import (
        list_datasets, 
        describe_feature, 
        overview_data, 
        create_dataframe, 
        describe_features, 
        create_exploratory_dataset,
        create_decision_dataset
)
import pandas as pd
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 3.1 Application_train.csv

In [3]:
create_exploratory_dataset()
df_exploratory = create_dataframe('interim', 'application_exploratory.csv')

### Exploring: Data Types, Missing Values, Noisiness, and Distribution

In [4]:
with pd.option_context('display.max_rows', len(df_exploratory)):
    display(df_exploratory)

Unnamed: 0,Column,Description,NanPercentage,DataType,FeatureType,LowerOutliers,UpperOutliers,Z3Outliers,NormalDistribution,Skewness,Kurtosis,CorrWithTarget
0,SK_ID_CURR,ID of loan in our sample,0.0,int64,id,,,,,,,-0.0
1,TARGET,Target variable (1 - client with payment diffi...,0.0,int64,target,,,,,,,1.0
2,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,0.0,object,categorical,,,,,,,
3,CODE_GENDER,Gender of the client,0.0,object,categorical,,,,,,,
4,FLAG_OWN_CAR,Flag if the client owns a car,0.0,object,categorical,,,,,,,
5,FLAG_OWN_REALTY,Flag if client owns a house or flat,0.0,object,categorical,,,,,,,
6,CNT_CHILDREN,Number of children the client has,0.0,int64,numeric,,,,,,,0.02
7,AMT_INCOME_TOTAL,Income of the client,0.0,float64,numeric,0.0,12626.0,209.0,no,164881.5,369.58,-0.0
8,AMT_CREDIT,Credit amount of the loan,0.0,float64,numeric,0.0,5278.0,2630.0,no,1.96,1.24,-0.03
9,AMT_ANNUITY,Loan annuity,0.0,float64,numeric,0.0,6012.0,0.0,yes,7.76,1.58,-0.01


#### Highlights
- The correlation (given by Pearson's R) of the features with the target worried me. I could not find a column with a clear correlation - I was expecting to select a few ones for data visualization. Also, I guess the ```FLAG_DOCUMENT__``` columns might be grouped into a single feature - due to their small correlation.
- Many columns with high NaN records. I believe all columns with a ```NaNPercentage``` higher than 30% might be removed. The remaining will be replaced by the **mean** or **most_frequent**.
- Every **object** and **categorical** column will be transformed by using **OneHotEncoder**. Except for ```NAME_EDUCATION_TYPE``` and ```WEEKDAY_APPR_PROCESS_START```, which will be using **OrdinalEncoder**. I am also going to check the value_counts() of these columns.
- Every record with a feature with an absolute z-score value superior to 3 (outlier) will be removed.

### Investigating

Based on the Highlights, seems like a good idea to check for imbalancement in the target  (which is almost always the case),the values of categorical features, and possible new features.

In [15]:
df_exploratory['Column'][(df_exploratory['DataType']=='object') 
                         & (df_exploratory['FeatureType']=='categorical')
                         & (df_exploratory['NanPercentage']<30)]

2             NAME_CONTRACT_TYPE
3                    CODE_GENDER
4                   FLAG_OWN_CAR
5                FLAG_OWN_REALTY
11               NAME_TYPE_SUITE
12              NAME_INCOME_TYPE
13           NAME_EDUCATION_TYPE
14            NAME_FAMILY_STATUS
15             NAME_HOUSING_TYPE
32    WEEKDAY_APPR_PROCESS_START
40             ORGANIZATION_TYPE
Name: Column, dtype: object

In [6]:
 df = create_dataframe('raw', 'application_train.csv')

#### Target

In [7]:
df['TARGET'].value_counts(normalize=True, dropna=False)

0    0.918824
1    0.081176
Name: TARGET, dtype: float64

#### Categorical Features for Transformation

In [18]:
categorical_object_features = df_exploratory['Column'][(df_exploratory['DataType']=='object') 
                                                       & (df_exploratory['FeatureType']=='categorical')
                                                       & (df_exploratory['NanPercentage']<30)].to_list()

for feature in categorical_object_features:
    print(f'{df[feature].value_counts(normalize=True, dropna=False)}\n')

Cash loans         0.904938
Revolving loans    0.095062
Name: NAME_CONTRACT_TYPE, dtype: float64

F      0.657975
M      0.342013
XNA    0.000012
Name: CODE_GENDER, dtype: float64

N    0.659958
Y    0.340042
Name: FLAG_OWN_CAR, dtype: float64

Y    0.693209
N    0.306791
Name: FLAG_OWN_REALTY, dtype: float64

Unaccompanied      0.807803
Family             0.130666
Spouse, partner    0.037076
Children           0.010626
Other_B            0.005886
NaN                0.004260
Other_A            0.002801
Group of people    0.000882
Name: NAME_TYPE_SUITE, dtype: float64

Working                 0.515914
Commercial associate    0.233005
Pensioner               0.180437
State servant           0.070473
Unemployed              0.000069
Student                 0.000049
Businessman             0.000041
Maternity leave         0.000012
Name: NAME_INCOME_TYPE, dtype: float64

Secondary / secondary special    0.710672
Higher education                 0.243212
Incomplete higher                0.03

In [1]:
# Feature Creation

#### Highlights
- Target is imbalanced.
- Some hidden NaNs in the categorical data:
  - ```CODE_GENDER```: XNA.
  - ```NAME_TYPE_SUITE```: NaN.
  - ```NAME_FAMILY_STATUS```: Unknown.
  - ```ORGANIZATION_TYPE```: XNA.

### Decisions
How the dataFrame will be manipulated in the first iteration.

In [86]:
create_decision_dataset()
df_decision = create_dataframe('interim', 'application_decision.csv')

In [87]:
with pd.option_context('display.max_rows', len(df_decision)):
    display(df_decision)

Unnamed: 0,Column,NanDecision,TypeDecision,OutliersDecision,CorrDecision
0,SK_ID_CURR,,,,
1,TARGET,,sample,,
2,NAME_CONTRACT_TYPE,,OneHotEncoder,,
3,CODE_GENDER,,clean and then OneHotEncoder,,
4,FLAG_OWN_CAR,,OneHotEncoder,,
5,FLAG_OWN_REALTY,,OneHotEncoder,,
6,CNT_CHILDREN,,,,
7,AMT_INCOME_TOTAL,,,remove,
8,AMT_CREDIT,,,remove,
9,AMT_ANNUITY,,,,


*Next notebook: [4.0-ejk-](3.0-ejk-eda-applications.ipynb).*