# This notebook covers exploratory data analysis on the application_train.csv dataset.

## EDA steps:
- Investigating the shape of the dataframe
- Investigating the default indicator feature ("TARGET")
- Investigating columns with high missingness
- Investigating columns with sentinel values ('365243')
- Investigating middle statistics for significant features (Income, age, credit amount)

In [2]:
import pandas as pd
from pathlib import Path

In [3]:
root = Path.cwd().parent

path = root / "data" / "interim" / "application_train.csv"

df = pd.read_csv(path)

In [4]:
# Viewing the general shape of the dataframe (307511 rows, 122 columns)
df.shape

(307511, 122)

In [5]:
# Viewing the general distribution of the default indicating feature ("TARGET")
df['TARGET'].value_counts()

TARGET
0    282686
1     24825
Name: count, dtype: int64

In [6]:
# Viewing the normalized distribution of the default indicating feature ("TARGET")
df['TARGET'].value_counts(normalize = True)

TARGET
0    0.919271
1    0.080729
Name: proportion, dtype: float64

In [7]:
# Determining a list of the columns with the highest missingness values
missing = df.isna().mean().sort_values(ascending = False)

In [8]:
# Visualizing the highest missingness columns
missing.head(60)

COMMONAREA_AVG                  0.698723
COMMONAREA_MODE                 0.698723
COMMONAREA_MEDI                 0.698723
NONLIVINGAPARTMENTS_MEDI        0.694330
NONLIVINGAPARTMENTS_MODE        0.694330
NONLIVINGAPARTMENTS_AVG         0.694330
FONDKAPREMONT_MODE              0.683862
LIVINGAPARTMENTS_AVG            0.683550
LIVINGAPARTMENTS_MEDI           0.683550
LIVINGAPARTMENTS_MODE           0.683550
FLOORSMIN_MODE                  0.678486
FLOORSMIN_AVG                   0.678486
FLOORSMIN_MEDI                  0.678486
YEARS_BUILD_AVG                 0.664978
YEARS_BUILD_MODE                0.664978
YEARS_BUILD_MEDI                0.664978
OWN_CAR_AGE                     0.659908
LANDAREA_MEDI                   0.593767
LANDAREA_AVG                    0.593767
LANDAREA_MODE                   0.593767
BASEMENTAREA_MODE               0.585160
BASEMENTAREA_MEDI               0.585160
BASEMENTAREA_AVG                0.585160
EXT_SOURCE_1                    0.563811
NONLIVINGAREA_MO

In [9]:
# Getting a value for how many columns will likely be dropped in a later processing step (50%+ missingness)
(missing > .5).sum()

np.int64(41)

After performing some very prelimenary EDA on the application_train.csv dataset, we have discovered some interesting insights. Most importantly, we discovered that there is abouta 0.080729 (8.1%) of the individuals that default ('TARGET' being the defaulting indication feature and designates 1 for defaulting 0 for not). Additionally, we discovered that there are quite a few rows with high missingness values. This is to be expected. Not all of the columns are mandatory entries in the application_train dataset. For example, cols 47-93 represent housing features like "is there an elevator present". Not all applicants will have an elevator in their home. This means I will have to dig deeper to discover what features are important to the default-rate when doing early logistic regression models. If many of these high missingness features have low p-value scores on multiple regression models, they will additionally be dropped. This is both for model optimization and for also removing statistical noise for when I begin tweaking the cutoff points for approval.

In [10]:
# Investigating rows with the '365243' sentinel value present.
(df == 365243).sum().sort_values(ascending=False).head(30)

DAYS_EMPLOYED                 55374
SK_ID_CURR                        1
NAME_CONTRACT_TYPE                0
CODE_GENDER                       0
FLAG_OWN_CAR                      0
FLAG_OWN_REALTY                   0
CNT_CHILDREN                      0
AMT_INCOME_TOTAL                  0
AMT_CREDIT                        0
TARGET                            0
AMT_ANNUITY                       0
AMT_GOODS_PRICE                   0
NAME_INCOME_TYPE                  0
NAME_TYPE_SUITE                   0
NAME_FAMILY_STATUS                0
NAME_HOUSING_TYPE                 0
REGION_POPULATION_RELATIVE        0
NAME_EDUCATION_TYPE               0
DAYS_BIRTH                        0
DAYS_REGISTRATION                 0
DAYS_ID_PUBLISH                   0
OWN_CAR_AGE                       0
FLAG_MOBIL                        0
FLAG_EMP_PHONE                    0
FLAG_WORK_PHONE                   0
FLAG_CONT_MOBILE                  0
FLAG_PHONE                        0
FLAG_EMAIL                  

In [11]:
df['DAYS_EMPLOYED'].head(30)

0       -637
1      -1188
2       -225
3      -3039
4      -3038
5      -1588
6      -3130
7       -449
8     365243
9      -2019
10      -679
11    365243
12     -2717
13     -3028
14      -203
15     -1157
16     -1317
17      -191
18     -7804
19     -2038
20     -4286
21     -1652
22     -4306
23    365243
24      -746
25     -3494
26     -2628
27     -1234
28     -1796
29     -1010
Name: DAYS_EMPLOYED, dtype: int64

Checking the replacement value Home Credit uses (365243), one column, "DAYS_EMPLOYED" has 55734 instances of the sentinel value appearing. That value appears once as in SK_ID_CURR which is likely just the random generated ID-key for one column. During data processing these sentinel values will have to be adjusted to NaN values.

In [12]:
# Middle statistics for income
df["AMT_INCOME_TOTAL"].describe().apply(lambda x: format(x, 'f'))

count       307511.000000
mean        168797.919297
std         237123.146279
min          25650.000000
25%         112500.000000
50%         147150.000000
75%         202500.000000
max      117000000.000000
Name: AMT_INCOME_TOTAL, dtype: object

It looks like there may be some large outliers in this dataset, but they aren't necessarily bad. Some association between lower income and default-odds may be revealed, but I will have to remove or handle outliers to get useful middle-statistics from this column.

In [13]:
# Middle statistics for credit (loan) amount
df["AMT_CREDIT"].describe().apply(lambda x: format(x, 'f'))

count     307511.000000
mean      599025.999706
std       402490.776996
min        45000.000000
25%       270000.000000
50%       513531.000000
75%       808650.000000
max      4050000.000000
Name: AMT_CREDIT, dtype: object

Similar to income, in general. Middle statistics are currently unhelpful due to outliers

In [14]:
# Middle statistics for age
df["DAYS_BIRTH"].describe().apply(lambda x: format(x, 'f'))

count    307511.000000
mean     -16036.995067
std        4363.988632
min      -25229.000000
25%      -19682.000000
50%      -15750.000000
75%      -12413.000000
max       -7489.000000
Name: DAYS_BIRTH, dtype: object

Noteworthy is that due to the nature of being negatively counted, "max" is the youngest individual (20.5 years) and "min" is the oldest (69.1 years).

The middle statistics here are pretty useful. Due to the nature of aging, the distribution of this is likely fine and there doesn't appear to be outliers.

In [15]:
# Median income values for rows that default or not
df.groupby("TARGET")["AMT_INCOME_TOTAL"].median()

TARGET
0    148500.0
1    135000.0
Name: AMT_INCOME_TOTAL, dtype: float64

In general, this tells us that there may be some correlation between lower median income and defaulting

In [16]:
# Median credit (loan amount) values for accounts that default or not
df.groupby("TARGET")["AMT_CREDIT"].median()

TARGET
0    517788.0
1    497520.0
Name: AMT_CREDIT, dtype: float64

This is related in so far as, lower income individuals are likely taking (and being approved) smaller loans in general

In [17]:
# Median age values for rows that default or not
df.groupby("TARGET")["DAYS_BIRTH"].median()

TARGET
0   -15877.0
1   -14282.0
Name: DAYS_BIRTH, dtype: float64

Tends to suggest younger individuals are more likely to default on loans

Based off these 3, we are beginning to see some patterns that are expected. Younger, lower-income individuals are defaulting on loans more often. This was the base assumption, though. While these 3 features will likely hold some of the most weight during logistic regression there may be other features that affect default probability. These "less-significant" features may end up being the ones that can be tweaked to change approval odds and increase loan-profits and decrease default-rate.