## Problem One: Confirmation of competition contents

* What to learn and what to predict?

* What kind of file to create and submit to Kaggle?

* What kind of index value will be used to evaluate the submissions?





What to learn and what to predict
• To predict how capable each applicant is of repaying a loan
What kind of file to create and submit to Kaggle?
• A file predicting the probability for the TARGET variable. And should contain a header of both SK_ID_CURR and TARGET
What kind of index value will be used to evaluate the submissions?
• Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection[8] in machine learning.

## Problem Two: Learning and verification

In [105]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

In [106]:
# Loading the dataset
df_train = pd.read_csv("application_train.csv")
df_test = pd.read_csv("application_test.csv")

df_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [107]:
display(df_train.info())
display(df_test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB


None

In [108]:
display(df_train.shape)
display(df_test.shape)

(307511, 122)

(48744, 121)

* The test dataset is missing the target column because target is the variable we're trying to predict.
* We have 16 columns of data type object

In [109]:
(df_train.isna().sum() > 0).value_counts()

True     67
False    55
dtype: int64

In [110]:
df_test.columns[(df_test.isna().sum() > 0)]

Index(['AMT_ANNUITY', 'NAME_TYPE_SUITE', 'OWN_CAR_AGE', 'OCCUPATION_TYPE',
       'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG',
       'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG',
       'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG',
       'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG',
       'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG',
       'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE',
       'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE',
       'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE',
       'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE',
       'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI',
       'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI',
       'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI',
       'LANDAREA_MEDI', 'LIVINGAPART

In [111]:
# Function to check the correlation between a quantitative variable and the target
def compare_corr(features):
    features.append("TARGET")
    return df_train[features].corr()


def compute_chi(cat_variable):
    """
    Function to compute the correlation of a categorical variable and the target
    """
    # Create a contingency table
    contingency_table = pd.crosstab(df_train[cat_variable], df_train["TARGET"])

    # Calculate the chi-square statistic and p-value
    chi2, p, dof, expected = chi2_contingency(contingency_table)

    # Print the results
    print("Chi-square statistic:", chi2)
    print("p-value:", p)

In [112]:
# Getting the percentage of missing values in the train dataset

percent_missing_train = df_train.isna().sum() / len(df_train) * 100

col_missing_train = df_train.columns[(percent_missing_train != 0)]

percent_missing_train[percent_missing_train > 0].sort_values(ascending=False).tail(60)

LIVINGAPARTMENTS_MODE           68.354953
LIVINGAPARTMENTS_MEDI           68.354953
LIVINGAPARTMENTS_AVG            68.354953
FLOORSMIN_MODE                  67.848630
FLOORSMIN_MEDI                  67.848630
FLOORSMIN_AVG                   67.848630
YEARS_BUILD_MODE                66.497784
YEARS_BUILD_MEDI                66.497784
YEARS_BUILD_AVG                 66.497784
OWN_CAR_AGE                     65.990810
LANDAREA_AVG                    59.376738
LANDAREA_MEDI                   59.376738
LANDAREA_MODE                   59.376738
BASEMENTAREA_MEDI               58.515956
BASEMENTAREA_AVG                58.515956
BASEMENTAREA_MODE               58.515956
EXT_SOURCE_1                    56.381073
NONLIVINGAREA_MEDI              55.179164
NONLIVINGAREA_MODE              55.179164
NONLIVINGAREA_AVG               55.179164
ELEVATORS_MEDI                  53.295980
ELEVATORS_MODE                  53.295980
ELEVATORS_AVG                   53.295980
WALLSMATERIAL_MODE              50

Most of the features with missing values are related with the building or home of the client, we will investigate their individual correlation with the target later on. For now, my hypothesis is that there should be some relationship between the OCCUPATION_TYPE feature and the rate of default, so we shall test this theory in order to decide if to bother with imputation of missing variables for OCCUPATION_TYPE.

In [113]:
# Checking if there's any relationship between OCCUPATION_TYPE and rate of repayment
(1 - df_train.groupby("OCCUPATION_TYPE").mean().round(2)["TARGET"]).sort_values(ascending=False)

OCCUPATION_TYPE
Accountants              0.95
Core staff               0.94
HR staff                 0.94
High skill tech staff    0.94
IT staff                 0.94
Managers                 0.94
Secretaries              0.93
Medicine staff           0.93
Private service staff    0.93
Realty agents            0.92
Sales staff              0.90
Cooking staff            0.90
Cleaning staff           0.90
Security staff           0.89
Waiters/barmen staff     0.89
Laborers                 0.89
Drivers                  0.89
Low-skill Laborers       0.83
Name: TARGET, dtype: float64

We can see that there is some change in the data as we move from corporate jobs to manual jobs, this information might be useful for our model, so we shall try to fill in the missing values OCCUPATION_TYPE and further investigate if it has any impact on our model.

In [114]:
# Checking the rate of repayment for clients with missing occupation types
1 - df_train[df_train.OCCUPATION_TYPE.isna()][["TARGET"]].mean()

TARGET    0.934869
dtype: float64

In [115]:
compute_chi("OCCUPATION_TYPE")

Chi-square statistic: 1402.8467961927515
p-value: 3.7844998567642684e-288


In [116]:
# replacing the missing values in OCCUPATION_TYPE with the most frequent observation
most_frequent_occupation = df_train["OCCUPATION_TYPE"].mode().iloc[0]

# Replace missing values with the most frequent occupation
df_train["OCCUPATION_TYPE"].fillna(most_frequent_occupation, inplace=True)
df_test["OCCUPATION_TYPE"].fillna(df_test["OCCUPATION_TYPE"].mode().iloc[0], inplace=True)

In [117]:
# Checking if the imputation drastically affects the correlation between OCCUPATION_TYPE and the target 
compute_chi("OCCUPATION_TYPE")

Chi-square statistic: 1193.3947394399675
p-value: 3.4203518801583733e-243


There is only a slight decrease in correlation between the two variables after mode imputation so we shall accept this imputation technique and also keep OCCUPATION_TYPE.

In [118]:
# Getting and displaying the correlation of all missing variables and the target
missing_val_corr = compare_corr(percent_missing_train[percent_missing_train != 0].index.to_list())

# We take the absolute values of the correlation coefficients to account for negative correlation
missing_val_corr.TARGET.apply(lambda x: abs(x)).sort_values(ascending=False).head(60)

TARGET                          1.000000
EXT_SOURCE_3                    0.178919
EXT_SOURCE_2                    0.160472
EXT_SOURCE_1                    0.155317
DAYS_LAST_PHONE_CHANGE          0.055218
FLOORSMAX_AVG                   0.044003
FLOORSMAX_MEDI                  0.043768
FLOORSMAX_MODE                  0.043226
AMT_GOODS_PRICE                 0.039645
OWN_CAR_AGE                     0.037612
ELEVATORS_AVG                   0.034199
ELEVATORS_MEDI                  0.033863
FLOORSMIN_AVG                   0.033614
FLOORSMIN_MEDI                  0.033394
LIVINGAREA_AVG                  0.032997
LIVINGAREA_MEDI                 0.032739
FLOORSMIN_MODE                  0.032698
TOTALAREA_MODE                  0.032596
DEF_30_CNT_SOCIAL_CIRCLE        0.032248
ELEVATORS_MODE                  0.032131
DEF_60_CNT_SOCIAL_CIRCLE        0.031276
LIVINGAREA_MODE                 0.030685
APARTMENTS_AVG                  0.029498
APARTMENTS_MEDI                 0.029184
APARTMENTS_MODE 

Most of the features have very poor correlation power. We shall only consider EXT_SOURCE_3, EXT_SOURCE_2, and EXT_SOURCE_1 because although they have poor correlation with the target, their correlation coefficients surpass all other features, and domain knowledge suggests that external sources play a huge role in credit allocation.

**Feature Engineering Approach:** We shall take the mean the values of all three variables and see how that affects the predictive power of our model.

In [119]:
# Handling missing data in EXT_SOURCE_3, EXT_SOURCE_2, and EXT_SOURCE_1

df_train["EXT_SOURCE_MEAN"] = df_train[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].mean(axis=1)
df_test["EXT_SOURCE_MEAN"] = df_test[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].mean(axis=1)

df_train[["EXT_SOURCE_MEAN", "TARGET"]].corr()

Unnamed: 0,EXT_SOURCE_MEAN,TARGET
EXT_SOURCE_MEAN,1.0,-0.222052
TARGET,-0.222052,1.0


The result of the correlation between EXT_SOURCE_MEAN, which is the mean of EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3, and TARGET show an inverse increament when compared to the correlation between TARGET and each individual feature.

In [120]:
features_to_drop = [col for col in col_missing_train if col != "OCCUPATION_TYPE"]
len(features_to_drop)

66

In [121]:
len(df_train.columns)

123

In [122]:
# dropping features
df_train.drop(features_to_drop, axis=1, inplace=True)
df_test.drop(features_to_drop, axis=1, inplace=True)
percent_missing_train.drop(features_to_drop, inplace=True)
df_train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
       'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START',
       'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'FLAG_DOCUMENT_2',
       'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
       'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
       'FLAG_DO

In [123]:
print(len(df_train.columns), len(df_test.columns))

57 56


Train dataset still has its target column, we shall extract that in a bit.

In [124]:
# Instantiate a LabelEncoder object
le = LabelEncoder()

obj_cols = df_train.columns[df_train.dtypes == "object"]

# Encoding all features of object type 
for i in obj_cols:
    df_train[i] = le.fit_transform(df_train[i])
    df_test[i] = le.fit_transform(df_test[i])
    
# Checking that all object data type have been encoded
len(df_train.select_dtypes("object").columns)

0

We've succeeded in filling the missing values and encoding the columns of data type object.

In [125]:
display(df_train.columns[df_train.isna().sum() > 0])
display(df_test.columns[df_test.isna().sum() > 0])

Index(['EXT_SOURCE_MEAN'], dtype='object')

Index(['EXT_SOURCE_MEAN'], dtype='object')

In [126]:
# Replacing the few missing values in EXT_SOURCE_MEAN with the entire feature's mean
df_train.EXT_SOURCE_MEAN.fillna(df_train["EXT_SOURCE_MEAN"].mean(), inplace=True)
df_test.EXT_SOURCE_MEAN.fillna(df_test["EXT_SOURCE_MEAN"].mean(), inplace=True)

In [127]:
# Confirming that there is no missing values in the dataset
display(df_train.columns[df_train.isna().sum() > 0])
display(df_test.columns[df_test.isna().sum() > 0])

Index([], dtype='object')

Index([], dtype='object')

### Training and Evaluation of Test Data

In [128]:
X = df_train.drop(["SK_ID_CURR", "TARGET"], axis=1)
y = df_train["TARGET"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

scaler = StandardScaler()
X_train_transformed = scaler.fit_transform(X_train)
X_test_transformed = scaler.transform(X_test)

SK_ID_CURR was dropped because it poses the risk of data leakage. Including it in the model could leak information about past applications for the same borrower, violating the assumption that data points in the test set are independent and unseen by the model during training. This can lead to inflated accuracy and misleading generalization performance.

In [129]:
# Training the baseline models
dtree = DecisionTreeClassifier()
dtree.fit(X_train_transformed, y_train)
y_pred_dtree = dtree.predict(X_test_transformed)
print(f"ROC for Decision Tree is {roc_auc_score(y_test, y_pred_dtree)}")
print(f"Accuracy for Decision Tree is {accuracy_score(y_test, y_pred_dtree)}\n\n")


rand_forest = RandomForestClassifier()
rand_forest.fit(X_train_transformed, y_train)
y_pred_rf = rand_forest.predict(X_test_transformed)
print("-" * 100)
print(f"ROC for Random Forest is {roc_auc_score(y_test, y_pred_rf)}")
print(f"Accuracy for Random Forest is {accuracy_score(y_test, y_pred_rf)}\n\n")

log_reg = LogisticRegression()
log_reg.fit(X_train_transformed, y_train)
y_pred_log_reg = log_reg.predict(X_test_transformed)
print("-" * 100)
print(f"ROC for Logistic Regression is {roc_auc_score(y_test, y_pred_log_reg)}")
print(f"Accuracy for Logistic Regression is {accuracy_score(y_test, y_pred_log_reg)}")

ROC for Decision Tree is 0.5369579454921836
Accuracy for Decision Tree is 0.848942480293452


----------------------------------------------------------------------------------------------------
ROC for Random Forest is 0.5003472244938048
Accuracy for Random Forest is 0.9194307864408543


----------------------------------------------------------------------------------------------------
ROC for Logistic Regression is 0.5019359690416059
Accuracy for Logistic Regression is 0.9192356721038528


All models have very low ROC AUC values and high accuracy values, this contrast can be explained by the imbalance of the classes in the dataset. Whilst Decision Tree has the highest ROC AUC value, if we consider its accuracy score, we can infer that it may prone to overfitting.

## Problem Three: Estimation on test data

In [134]:
scaled_test_data = scaler.transform(df_test.drop("SK_ID_CURR", axis=1))

In [135]:
pred_test = rand_forest.predict(scaled_test_data)
pred_test

array([0, 0, 0, ..., 0, 0, 0])

In [136]:
df_test["TARGET"] = pred_test

In [137]:
submit_01 = df_test[["SK_ID_CURR", "TARGET"]]
submit_01

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0
1,100005,0
2,100013,0
3,100028,0
4,100038,0
...,...,...
48739,456221,0
48740,456222,0
48741,456223,0
48742,456224,0


In [30]:
submit_01.to_csv("submit_01.csv", index=False)

## Problem Four: Feature engineering

In [140]:
selected_features = pd.DataFrame(dtree.feature_importances_, index=X.columns).sort_values(0, ascending=False)[:20].index
selected_features

Index(['EXT_SOURCE_MEAN', 'DAYS_BIRTH', 'DAYS_ID_PUBLISH', 'DAYS_REGISTRATION',
       'DAYS_EMPLOYED', 'AMT_CREDIT', 'REGION_POPULATION_RELATIVE',
       'AMT_INCOME_TOTAL', 'HOUR_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_FAMILY_STATUS',
       'CNT_CHILDREN', 'NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE', 'FLAG_PHONE',
       'FLAG_OWN_REALTY', 'FLAG_WORK_PHONE', 'REGION_RATING_CLIENT'],
      dtype='object')

In [141]:
x1 = df_train[selected_features[:2]]
x2 = df_train[selected_features[:4]]
x3 = df_train[selected_features[:8]]
x4 = df_train[selected_features[:10]]
x5 = df_train[selected_features[:16]]
x6 = df_train[selected_features[:20]]
x7 = df_train[selected_features]

In [159]:
def feat_train(X, var_name):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    
    
    scaler = StandardScaler()
    X_train_transformed = scaler.fit_transform(X_train)
    X_test_transformed = scaler.transform(X_test)


    rand_forest = RandomForestClassifier()
    rand_forest.fit(X_train_transformed, y_train)
    y_pred_rf = rand_forest.predict(X_test_transformed)
    print("-" * 100)
    print(f"ROC for {var_name} is {roc_auc_score(y_test, y_pred_rf)}")

feature_dict = {"x1": x1, "x2": x2, "x3": x3, "x4": x4, "x5": x5, "x6": x6, "x7": x7}

for key in feature_dict:
    feat_train(feature_dict[key], key)

----------------------------------------------------------------------------------------------------
ROC for x1 is 0.5106027864147802
----------------------------------------------------------------------------------------------------
ROC for x2 is 0.5056399350037227
----------------------------------------------------------------------------------------------------
ROC for x3 is 0.5022761200989937
----------------------------------------------------------------------------------------------------
ROC for x4 is 0.5017532239382994
----------------------------------------------------------------------------------------------------
ROC for x5 is 0.5010628937906647
----------------------------------------------------------------------------------------------------
ROC for x6 is 0.5008247254395555
----------------------------------------------------------------------------------------------------
ROC for x7 is 0.5006944489876094


I made several subsets of the train dataset based on features of high importance according to the Decision Tree module and trained evaluated the impact of each feature subset on our baseline model. This gives us clues on the features wielding higher predictive power. We can see that the ROC AUC of the model decreases with the addition of each subset, with our engineered feature, EXT_SOURCE_MEAN, showing a lot of promise.

In [143]:
def feat_train(X, var_name):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    
    
    scaler = StandardScaler()
    X_train_transformed = scaler.fit_transform(X_train)
    X_test_transformed = scaler.transform(X_test)


    dtree = DecisionTreeClassifier()
    dtree.fit(X_train_transformed, y_train)
    y_pred_rf = dtree.predict(X_test_transformed)
    print("-" * 100)
    print(f"ROC for {var_name} is {roc_auc_score(y_test, y_pred_rf)}")
    
    
for key in feature_dict:
    feat_train(feature_dict[key], key)

In [57]:
# Confirming that EXT_SOURCE_MEAN is has a positive impact on the model
feat_train(df_train[selected_features[:1]], "EXT_SOURCE_MEAN")

----------------------------------------------------------------------------------------------------
ROC for EXT_SOURCE_MEAN is 0.5290996524071528


In [150]:
feat_train(x1, "x1")

KeyboardInterrupt: 

In [144]:
selected_features

Index(['EXT_SOURCE_MEAN', 'DAYS_BIRTH', 'DAYS_ID_PUBLISH', 'DAYS_REGISTRATION',
       'DAYS_EMPLOYED', 'AMT_CREDIT', 'REGION_POPULATION_RELATIVE',
       'AMT_INCOME_TOTAL', 'HOUR_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_FAMILY_STATUS',
       'CNT_CHILDREN', 'NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE', 'FLAG_PHONE',
       'FLAG_OWN_REALTY', 'FLAG_WORK_PHONE', 'REGION_RATING_CLIENT'],
      dtype='object')

In [61]:
### Feature Engineering
X.columns

Index(['SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START',
       'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'FLAG_DOCUMENT_2',
       'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
       'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
       'FLAG_DOCUMENT_9',

In [90]:
X["FLAG_DOCUMENT"] = X[X.columns[35:55]].sum(axis=1)
X.drop(X[X.columns[35:55]])

Index(['FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
       'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
       'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
       'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
       'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
       'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
       'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'],
      dtype='object')

In [158]:
def feat_train(X, var_name):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    
    
    scaler = StandardScaler()
    X_train_transformed = scaler.fit_transform(X_train)
    X_test_transformed = scaler.transform(X_test)


    log_reg = LogisticRegression()
    log_reg.fit(X_train_transformed, y_train)
    y_pred_log_reg = log_reg.predict(X_test_transformed)
    print("-" * 100)
    print(f"ROC for Logistic Regression is {roc_auc_score(y_test, y_pred_log_reg)}")
    print(f"Accuracy for Logistic Regression is {accuracy_score(y_test, y_pred_log_reg)}")
    print(f"ROC for {var_name} is {roc_auc_score(y_test, y_pred_log_reg)}")

df_train["FLAG_DOCUMENT"] = df_train[df_train.columns[35:55]].sum(axis=1)
df_test["FLAG_DOCUMENT"] = df_test[df_test.columns[35:55]].sum(axis=1)

feature_dict = {"x1": x1, "x2": x2, "x3": x3, "x4": x4, "x5": x5, "x6": x6, "x7": x7}


for key in feature_dict:
    feat_train(feature_dict[key], key)

----------------------------------------------------------------------------------------------------
ROC for Logistic Regression is 0.5
Accuracy for Logistic Regression is 0.9194698093082546
ROC for x1 is 0.5
----------------------------------------------------------------------------------------------------
ROC for Logistic Regression is 0.5000595420877774
Accuracy for Logistic Regression is 0.9194437940633211
ROC for x2 is 0.5000595420877774
----------------------------------------------------------------------------------------------------
ROC for Logistic Regression is 0.5
Accuracy for Logistic Regression is 0.9194698093082546
ROC for x3 is 0.5
----------------------------------------------------------------------------------------------------
ROC for Logistic Regression is 0.5
Accuracy for Logistic Regression is 0.9194698093082546
ROC for x4 is 0.5
----------------------------------------------------------------------------------------------------
ROC for Logistic Regression is 0.