## Problem One: Confirmation of competition contents

* What to learn and what to predict?

* What kind of file to create and submit to Kaggle?

* What kind of index value will be used to evaluate the submissions?





What to learn and what to predict
• To predict how capable each applicant is of repaying a loan
What kind of file to create and submit to Kaggle?
• A file predicting the probability for the TARGET variable. And should contain a header of both SK_ID_CURR and TARGET
What kind of index value will be used to evaluate the submissions?
• Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection[8] in machine learning.

## Problem Two: Learning and verification

In [144]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

In [145]:
# Loading the dataset
df_train = pd.read_csv("application_train.csv")
df_test = pd.read_csv("application_test.csv")

df_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [146]:
display(df_train.info())
display(df_test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB


None

In [147]:
display(df_train.shape)
display(df_test.shape)

(307511, 122)

(48744, 121)

* The test dataset is missing the target column because target is the variable we're trying to predict.
* We have 16 columns of data type object

In [148]:
obj_cols = df_train.columns[df_train.dtypes == "object"]
obj_cols

Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
       'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
      dtype='object')

In [149]:
len(df_test.columns[df_test.dtypes == "object"])

16

In [150]:
(df_train.isna().sum() > 0).value_counts()

True     67
False    55
dtype: int64

In [151]:
(df_train.isnull().sum(axis=1) > 20).value_counts()

True     184207
False    123304
dtype: int64

In [152]:
# Get train set without missing values
no_missing_train = df_train.columns[df_train.isna().sum() == 0]
no_missing_test = df_test.columns[df_test.isna().sum() == 0]

len(no_missing_train)

55

In [153]:
# Function to check the correlation between a quantitative variable and the target
def compare_corr(features):
    features.append("TARGET")
    return df_train[features].corr()


def compute_chi(cat_variable):
    # Create a contingency table
    contingency_table = pd.crosstab(df_train[cat_variable], df_train["TARGET"])

    # Calculate the chi-square statistic and p-value
    chi2, p, dof, expected = chi2_contingency(contingency_table)

    # Print the results
    print("Chi-square statistic:", chi2)
    print("p-value:", p)

In [154]:
# Getting the percentage of missing values in the train dataset

percent_missing_train = (df_train.isna().sum() / len(df_train) * 100).round(1)

col_missing_train = df_train.columns[(percent_missing_train != 0)]

percent_missing_train[percent_missing_train > 0].sort_values(ascending=False).head(60)

COMMONAREA_AVG                  69.9
COMMONAREA_MODE                 69.9
COMMONAREA_MEDI                 69.9
NONLIVINGAPARTMENTS_MODE        69.4
NONLIVINGAPARTMENTS_AVG         69.4
NONLIVINGAPARTMENTS_MEDI        69.4
LIVINGAPARTMENTS_MODE           68.4
LIVINGAPARTMENTS_MEDI           68.4
LIVINGAPARTMENTS_AVG            68.4
FONDKAPREMONT_MODE              68.4
FLOORSMIN_MEDI                  67.8
FLOORSMIN_AVG                   67.8
FLOORSMIN_MODE                  67.8
YEARS_BUILD_MODE                66.5
YEARS_BUILD_MEDI                66.5
YEARS_BUILD_AVG                 66.5
OWN_CAR_AGE                     66.0
LANDAREA_MEDI                   59.4
LANDAREA_AVG                    59.4
LANDAREA_MODE                   59.4
BASEMENTAREA_AVG                58.5
BASEMENTAREA_MODE               58.5
BASEMENTAREA_MEDI               58.5
EXT_SOURCE_1                    56.4
NONLIVINGAREA_MEDI              55.2
NONLIVINGAREA_AVG               55.2
NONLIVINGAREA_MODE              55.2
E

Most of the features with missing values are related with the building or home of the client, we will investigate their individual correlation with the target later on. For now, my hypothesis is that there should be some relationship between the OCCUPATION_TYPE feature and the rate of default, so we shall test this theory in order to decide if to bother with imputation of missing variables for OCCUPATION_TYPE.

In [155]:
# Checking if there's any relationship between OCCUPATION_TYPE and rate of repayment
(1 - df_train.groupby("OCCUPATION_TYPE").mean().round(2)["TARGET"]).sort_values(ascending=False)

OCCUPATION_TYPE
Accountants              0.95
Core staff               0.94
HR staff                 0.94
High skill tech staff    0.94
IT staff                 0.94
Managers                 0.94
Secretaries              0.93
Medicine staff           0.93
Private service staff    0.93
Realty agents            0.92
Sales staff              0.90
Cooking staff            0.90
Cleaning staff           0.90
Security staff           0.89
Waiters/barmen staff     0.89
Laborers                 0.89
Drivers                  0.89
Low-skill Laborers       0.83
Name: TARGET, dtype: float64

We can see that there is some change in the data as we move from corporate jobs to manual jobs, this information might be useful for our model, so we shall try to fill in the missing values OCCUPATION_TYPE and further investigate if it has any impact on our model.

In [156]:
# Checking the rate of repayment for clients with missing occupation types
1 - df_train[df_train.OCCUPATION_TYPE.isna()][["TARGET"]].mean()

TARGET    0.934869
dtype: float64

In [157]:
compute_chi("OCCUPATION_TYPE")

Chi-square statistic: 1402.8467961927515
p-value: 3.7844998567642684e-288


In [158]:
# replacing the missing values in OCCUPATION_TYPE with the most frequent observation
most_frequent_occupation = df_train["OCCUPATION_TYPE"].mode().iloc[0]

# Replace missing values with the most frequent occupation
df_train["OCCUPATION_TYPE"].fillna(most_frequent_occupation, inplace=True)

In [159]:
# Checking if the imputation drastically affects the correlation between OCCUPATION_TYPE and the target 
compute_chi("OCCUPATION_TYPE")

Chi-square statistic: 1193.3947394399675
p-value: 3.4203518801583733e-243


There is only a slight decrease in correlation between the two variables after mode imputation so we shall accept this imputation technique and also keep OCCUPATION_TYPE.

In [160]:
# Getting and displaying the correlation of all missing variables and the target
missing_val_corr = compare_corr(percent_missing_train[percent_missing_train != 0].index.to_list())

# We take the absolute values of the correlation coefficients to account for negative correlation
missing_val_corr.TARGET.apply(lambda x: abs(x)).sort_values(ascending=False)

TARGET                          1.000000
EXT_SOURCE_3                    0.178919
EXT_SOURCE_2                    0.160472
EXT_SOURCE_1                    0.155317
FLOORSMAX_AVG                   0.044003
FLOORSMAX_MEDI                  0.043768
FLOORSMAX_MODE                  0.043226
AMT_GOODS_PRICE                 0.039645
OWN_CAR_AGE                     0.037612
ELEVATORS_AVG                   0.034199
ELEVATORS_MEDI                  0.033863
FLOORSMIN_AVG                   0.033614
FLOORSMIN_MEDI                  0.033394
LIVINGAREA_AVG                  0.032997
LIVINGAREA_MEDI                 0.032739
FLOORSMIN_MODE                  0.032698
TOTALAREA_MODE                  0.032596
DEF_30_CNT_SOCIAL_CIRCLE        0.032248
ELEVATORS_MODE                  0.032131
DEF_60_CNT_SOCIAL_CIRCLE        0.031276
LIVINGAREA_MODE                 0.030685
APARTMENTS_AVG                  0.029498
APARTMENTS_MEDI                 0.029184
APARTMENTS_MODE                 0.027284
LIVINGAPARTMENTS

Most of the features have very poor correlation power. We shall only consider EXT_SOURCE_3, EXT_SOURCE_2, and EXT_SOURCE_1 because although they have poor correlation with the target, their correlation coefficients surpass all other features, and domain knowledge suggests that external sources play a huge role in credit allocation.

**Feature Engineering Approach:** We shall take the mean the values of all three variables and see how that affects the predictive power of our model.

In [161]:
new = df_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
new = pd.concat([new, df_train["TARGET"]], axis=1, join="inner")
new.corr()


# Handling missing data in EXT_SOURCE_3, EXT_SOURCE_2, and EXT_SOURCE_1

df_train["EXT_SOURCE_MEAN"] = df_train[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].mean(axis=1)
df_train[["EXT_SOURCE_MEAN", "TARGET"]].corr()

Unnamed: 0,EXT_SOURCE_MEAN,TARGET
EXT_SOURCE_MEAN,1.0,-0.222052
TARGET,-0.222052,1.0


The result of the correlation between EXT_SOURCE_MEAN, which is the mean of EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3, and TARGET show an inverse increament when compared to the correlation between TARGET and each individual feature.

In [169]:
features_to_drop = [col for col in col_missing_train if col != "OCCUPATION_TYPE"]
len(features_to_drop)

63

In [19]:
# dropping features with extremely high missingness and weak correlation with the target
df_train.drop(features_to_drop, axis=1, inplace=True)
df_test.drop(features_to_drop, axis=1, inplace=True)
percent_missing_train.drop(features_to_drop, inplace=True)
percent_missing_train[percent_missing_train != 0]

EXT_SOURCE_1                  56.4
AMT_REQ_CREDIT_BUREAU_WEEK    13.5
AMT_REQ_CREDIT_BUREAU_MON     13.5
AMT_REQ_CREDIT_BUREAU_QRT     13.5
AMT_REQ_CREDIT_BUREAU_YEAR    13.5
NAME_TYPE_SUITE                0.4
OBS_30_CNT_SOCIAL_CIRCLE       0.3
DEF_30_CNT_SOCIAL_CIRCLE       0.3
OBS_60_CNT_SOCIAL_CIRCLE       0.3
DEF_60_CNT_SOCIAL_CIRCLE       0.3
EXT_SOURCE_2                   0.2
AMT_GOODS_PRICE                0.1
dtype: float64

In [215]:
for col in (percent_missing_train[percent_missing_train > 0] < 5)[(percent_missing_train[percent_missing_train > 0] < 5)].index[1:3]:
    df_clean = df_train.copy()
    df_clean[col] = df_train[col].replace(np.nan, df_train[col].mean())
    print("missing values: ", df_clean[col].isna().sum())

    
    
# Get train set without missing values
no_missing_train = df_clean.columns[df_clean.isna().sum() == 0]
no_missing_test = df_test.columns[df_test.isna().sum() == 0]

print(len(no_missing_train))

missing values:  0
missing values:  0
56


In [195]:
# Splitting and Training for Feature Selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()

for i in obj_cols:
    df_train[i] = le.fit_transform(df_train[i])
    df_test[i] = le.fit_transform(df_test[i])
    

def feat_train(feat, target):

    feat, test, train_pred, test_pred = train_test_split(feat, target, random_state=42)
    fea_scaler = StandardScaler()
    feat = fea_scaler.fit_transform(feat)
    test = fea_scaler.fit_transform(test)
    dt = DecisionTreeClassifier()
    dt.fit(feat, train_pred)
    pred = dt.predict(test)
    print(roc_auc_score(test_pred, pred))
    print(f"Accuracy: {accuracy_score(test_pred, pred)}")
    


feat_train(df_train[no_missing_train.drop("TARGET").to_numpy()], df_train["TARGET"])

0.5192856655500524
Accuracy: 0.8470693826582377


In [6]:
missing_corr = df_train.isnull().corr()
missing_corr

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR,,,,,,,,,,,...,,,,,,,,,,
TARGET,,,,,,,,,,,...,,,,,,,,,,
NAME_CONTRACT_TYPE,,,,,,,,,,,...,,,,,,,,,,
CODE_GENDER,,,,,,,,,,,...,,,,,,,,,,
FLAG_OWN_CAR,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AMT_REQ_CREDIT_BUREAU_DAY,,,,,,,,,,-0.000945,...,,,,,1.0,1.0,1.0,1.0,1.0,1.0
AMT_REQ_CREDIT_BUREAU_WEEK,,,,,,,,,,-0.000945,...,,,,,1.0,1.0,1.0,1.0,1.0,1.0
AMT_REQ_CREDIT_BUREAU_MON,,,,,,,,,,-0.000945,...,,,,,1.0,1.0,1.0,1.0,1.0,1.0
AMT_REQ_CREDIT_BUREAU_QRT,,,,,,,,,,-0.000945,...,,,,,1.0,1.0,1.0,1.0,1.0,1.0


## Problem Three: Estimation on test data

## Problem Four: Feature engineering