<a href="https://colab.research.google.com/github/Yuji-ONUKI/GCI2020_Winter/blob/main/July021900_micro_model_174_features_0_8_auc_on_home_credit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## a Micro Model Study on Home Credit

The Home Credit Default Risk dataset on the Kaggle is subjected as a final project of my DS/ML bootcamp, and I have spent a period of three weeks on this project. I developed various models and quite a large number of them having AUC scores better than 0.8 ( highest one +0.804). Unfortunately, I could not run any full version of my models on Kaggle because of insufficient RAM issue even though datasets are zipped to almost 4 times by integer/float dtype conversion on my datasets. In addition, I made a bleend boosting study to acheive highest AUC score (0.81128, much highers possible) on Kaggle (https://www.kaggle.com/hikmetsezen/blend-boosting-for-home-credit-default-risk).

Here I would like to share my micro model study with you. This micro model has only 174 features and is able to reach better than 0.8 AUC score. Micro model is developed on my base model via successive feature elimination and addition procedure, which is developed by myself. My ambition is that tremendously increasing number of feature is not always necessary to improve performance of model! 

Mostly I use Colab Pro to compute LigthGBM calculations with 5-fold CV on GPUs. My models have 900-1800 features. 

I have a limited knowledge about the credit finance, therefore, I combined many Kaggle notebooks for expending number of features as much as I desire and/or acceptance of my LigthGBM models harvesting further enhance scores. I would like to thank these contributors. Some of them are listed here:
* https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features <=-- my models are based on this study
* https://www.kaggle.com/jsaguiar/lightgbm-7th-place-solution
* https://www.kaggle.com/sangseoseo/oof-all-home-credit-default-risk <=-- in most cases these hyperparameters are used
* https://www.kaggle.com/ashishpatel26/different-basic-blends-possible <=-- thank for blending idea
* https://www.kaggle.com/mathchi/home-credit-risk-with-detailed-feature-engineering
* https://www.kaggle.com/windofdl/kernelf68f763785
* https://www.kaggle.com/meraxes10/lgbm-credit-default-prediction
* https://www.kaggle.com/luudactam/hc-v500
* https://www.kaggle.com/aantonova/aggregating-all-tables-in-one-dataset
* https://www.kaggle.com/wanakon/kernel24647bb75c

In [60]:
# !pip install lightgbm==2.3.1
# import lightgbm
# lightgbm.__version__

In [61]:
# load libraries
import gc
import re
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold

In [62]:
# run functions and pre_settings
def one_hot_encoder(df, nan_as_category=True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns

def group(df_to_agg, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = df_to_agg.groupby(aggregate_by).agg(aggregations)
    agg_df.columns = pd.Index(['{}{}_{}'.format(prefix, e[0], e[1].upper())
                               for e in agg_df.columns.tolist()])
    return agg_df.reset_index()

def group_and_merge(df_to_agg, df_to_merge, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = group(df_to_agg, prefix, aggregations, aggregate_by= aggregate_by)
    return df_to_merge.merge(agg_df, how='left', on= aggregate_by)

def do_sum(dataframe, group_cols, counted, agg_name):
    gp = dataframe[group_cols + [counted]].groupby(group_cols)[counted].sum().reset_index().rename(columns={counted: agg_name})
    dataframe = dataframe.merge(gp, on=group_cols, how='left')
    return dataframe

def reduce_mem_usage(dataframe):
    m_start = dataframe.memory_usage().sum() / 1024 ** 2
    for col in dataframe.columns:
        col_type = dataframe[col].dtype
        if col_type != object:
            c_min = dataframe[col].min()
            c_max = dataframe[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    dataframe[col] = dataframe[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    dataframe[col] = dataframe[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    dataframe[col] = dataframe[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    dataframe[col] = dataframe[col].astype(np.int64)
            elif str(col_type)[:5] == 'float':
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    dataframe[col] = dataframe[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    dataframe[col] = dataframe[col].astype(np.float32)
                else:
                    dataframe[col] = dataframe[col].astype(np.float64)

    m_end = dataframe.memory_usage().sum() / 1024 ** 2
    return dataframe

nan_as_category = True

In [63]:
def application():
    from google.colab import drive
    drive.mount('/content/drive')

    df = pd.read_csv("/content/drive/MyDrive/GCI/02.（公開）コンペ2-20220621T094535Z-001.zip (Unzipped Files)/02.（公開）コンペ2/input/train.csv")
    test_df = pd.read_csv("/content/drive/MyDrive/GCI/02.（公開）コンペ2-20220621T094535Z-001.zip (Unzipped Files)/02.（公開）コンペ2/input/test.csv")

    df = df.append(test_df).reset_index()

    # general cleaning procedures
    df = df[df['CODE_GENDER'] != 'XNA']
    df = df[df['AMT_INCOME_TOTAL'] < 20000000] # remove a outlier 117M
    # NaN values for DAYS_EMPLOYED: 365.243 -> nan
    df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True) # set null value
    df['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True) # set null value
    #
    df['REGION_POPULATION_RELATIVE_0.04622']=0
    df['REGION_POPULATION_RELATIVE'==0.4622,'REGION_POPULATION_RELATIVE_0.04622']=1
    df['REGION_POPULATION_RELATIVE'==0.4622,'REGION_POPULATION_RELATIVE']=np.nan

    df['REGION_POPULATION_RELATIVE_0.072508']=0
    df['REGION_POPULATION_RELATIVE'==0.072508,'REGION_POPULATION_RELATIVE_0.072508']=1
    df['REGION_POPULATION_RELATIVE'==0.072508,'REGION_POPULATION_RELATIVE']=np.nan

    df['OWN_CAR_AGE_64']=0
    df['OWN_CAR_AGE'==64,'OWN_CAR_AGE_64']=1
    df['OWN_CAR_AGE'==64,'OWN_CAR_AGE']=np.nan

    df['OWN_CAR_AGE_65']=0
    df['OWN_CAR_AGE'==65,'OWN_CAR_AGE_65']=1
    df['OWN_CAR_AGE'==65,'OWN_CAR_AGE']=np.nan

    # Categorical features with Binary encode (0 or 1; two categories)
    for bin_feature in ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']:
        df[bin_feature], uniques = pd.factorize(df[bin_feature])
    
    # Categorical features with One-Hot encode
    df, cat_cols = one_hot_encoder(df, nan_as_category)

    # Flag_document features - count and kurtosis
    docs = [f for f in df.columns if 'FLAG_DOC' in f]
    df['DOCUMENT_COUNT'] = df[docs].sum(axis=1)
    df['NEW_DOC_KURT'] = df[docs].kurtosis(axis=1)

    def get_age_label(days_birth):
        """ Return the age group label (int). """
        age_years = -days_birth / 365
        if age_years < 27: return 1
        elif age_years < 40: return 2
        elif age_years < 50: return 3
        elif age_years < 65: return 4
        elif age_years < 99: return 5
        else: return 0
    # Categorical age - based on target=1 plot
    df['AGE_RANGE'] = df['DAYS_BIRTH'].apply(lambda x: get_age_label(x))

    # New features based on External sources
    df['EXT_SOURCES_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']
    df['EXT_SOURCES_WEIGHTED'] = df.EXT_SOURCE_1 * 2 + df.EXT_SOURCE_2 * 1 + df.EXT_SOURCE_3 * 3
    np.warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')
    for function_name in ['min', 'max', 'mean', 'nanmedian', 'var']:
        feature_name = 'EXT_SOURCES_{}'.format(function_name.upper())
        df[feature_name] = eval('np.{}'.format(function_name))(
            df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']], axis=1)

    # Some simple new features (percentages)
    df['DAYS_EMPLOYED_PERC'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']
    df['INCOME_CREDIT_PERC'] = df['AMT_INCOME_TOTAL'] / df['AMT_CREDIT']
    df['INCOME_PER_PERSON'] = df['AMT_INCOME_TOTAL'] / df['CNT_FAM_MEMBERS']
    df['ANNUITY_INCOME_PERC'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']
    df['PAYMENT_RATE'] = df['AMT_ANNUITY'] / df['AMT_CREDIT']

    # Credit ratios
    df['CREDIT_TO_GOODS_RATIO'] = df['AMT_CREDIT'] / df['AMT_GOODS_PRICE']
    
    # Income ratios
    df['INCOME_TO_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED']
    df['INCOME_TO_BIRTH_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_BIRTH']
    
    # Time ratios
    df['ID_TO_BIRTH_RATIO'] = df['DAYS_ID_PUBLISH'] / df['DAYS_BIRTH']
    df['CAR_TO_BIRTH_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_BIRTH']
    df['CAR_TO_EMPLOYED_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_EMPLOYED']
    df['PHONE_TO_BIRTH_RATIO'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_BIRTH']

    # EXT_SOURCE_X FEATURE
    df['APPS_EXT_SOURCE_MEAN'] = df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
    df['APPS_EXT_SOURCE_STD'] = df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis=1)
    df['APPS_EXT_SOURCE_STD'] = df['APPS_EXT_SOURCE_STD'].fillna(df['APPS_EXT_SOURCE_STD'].mean())
    df['APP_SCORE1_TO_BIRTH_RATIO'] = df['EXT_SOURCE_1'] / (df['DAYS_BIRTH'] / 365.25)
    df['APP_SCORE2_TO_BIRTH_RATIO'] = df['EXT_SOURCE_2'] / (df['DAYS_BIRTH'] / 365.25)
    df['APP_SCORE3_TO_BIRTH_RATIO'] = df['EXT_SOURCE_3'] / (df['DAYS_BIRTH'] / 365.25)
    df['APP_SCORE1_TO_EMPLOY_RATIO'] = df['EXT_SOURCE_1'] / (df['DAYS_EMPLOYED'] / 365.25)
    df['APP_EXT_SOURCE_2*EXT_SOURCE_3*DAYS_BIRTH'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['DAYS_BIRTH']
    df['APP_SCORE1_TO_FAM_CNT_RATIO'] = df['EXT_SOURCE_1'] / df['CNT_FAM_MEMBERS']
    df['APP_SCORE1_TO_GOODS_RATIO'] = df['EXT_SOURCE_1'] / df['AMT_GOODS_PRICE']
    df['APP_SCORE1_TO_CREDIT_RATIO'] = df['EXT_SOURCE_1'] / df['AMT_CREDIT']
    df['APP_SCORE1_TO_SCORE2_RATIO'] = df['EXT_SOURCE_1'] / df['EXT_SOURCE_2']
    df['APP_SCORE1_TO_SCORE3_RATIO'] = df['EXT_SOURCE_1'] / df['EXT_SOURCE_3']
    df['APP_SCORE2_TO_CREDIT_RATIO'] = df['EXT_SOURCE_2'] / df['AMT_CREDIT']
    df['APP_SCORE2_TO_REGION_RATING_RATIO'] = df['EXT_SOURCE_2'] / df['REGION_RATING_CLIENT']
    df['APP_SCORE2_TO_CITY_RATING_RATIO'] = df['EXT_SOURCE_2'] / df['REGION_RATING_CLIENT_W_CITY']
    df['APP_SCORE2_TO_POP_RATIO'] = df['EXT_SOURCE_2'] / df['REGION_POPULATION_RELATIVE']
    df['APP_SCORE2_TO_PHONE_CHANGE_RATIO'] = df['EXT_SOURCE_2'] / df['DAYS_LAST_PHONE_CHANGE']
    df['APP_EXT_SOURCE_1*EXT_SOURCE_2'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2']
    df['APP_EXT_SOURCE_1*EXT_SOURCE_3'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_3']
    df['APP_EXT_SOURCE_2*EXT_SOURCE_3'] = df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']
    df['APP_EXT_SOURCE_1*DAYS_EMPLOYED'] = df['EXT_SOURCE_1'] * df['DAYS_EMPLOYED']
    df['APP_EXT_SOURCE_2*DAYS_EMPLOYED'] = df['EXT_SOURCE_2'] * df['DAYS_EMPLOYED']
    df['APP_EXT_SOURCE_3*DAYS_EMPLOYED'] = df['EXT_SOURCE_3'] * df['DAYS_EMPLOYED']

    # AMT_INCOME_TOTAL : income
    # CNT_FAM_MEMBERS  : the number of family members
    df['APPS_GOODS_INCOME_RATIO'] = df['AMT_GOODS_PRICE'] / df['AMT_INCOME_TOTAL']
    df['APPS_CNT_FAM_INCOME_RATIO'] = df['AMT_INCOME_TOTAL'] / df['CNT_FAM_MEMBERS']
    
    # DAYS_BIRTH : Client's age in days at the time of application
    # DAYS_EMPLOYED : How many days before the application the person started current employment
    df['APPS_INCOME_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED']

    # other feature from better than 0.8
    df['CREDIT_TO_GOODS_RATIO_2'] = df['AMT_CREDIT'] / df['AMT_GOODS_PRICE']
    df['APP_AMT_INCOME_TOTAL_12_AMT_ANNUITY_ratio'] = df['AMT_INCOME_TOTAL'] / 12. - df['AMT_ANNUITY']
    df['APP_INCOME_TO_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED']
    df['APP_DAYS_LAST_PHONE_CHANGE_DAYS_EMPLOYED_ratio'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_EMPLOYED']
    df['APP_DAYS_EMPLOYED_DAYS_BIRTH_diff'] = df['DAYS_EMPLOYED'] - df['DAYS_BIRTH']

    print('"Application_Train_Test" final shape:', df.shape)
    return df

In [65]:
df = application()
df = reduce_mem_usage(df)
print('data types are converted for a reduced memory usage')
df = df.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '_', x))
print('names of feature are renamed')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
"Application_Train_Test" final shape: (232697, 215)
data types are converted for a reduced memory usage
names of feature are renamed


In [66]:
print('===============================================', '\n', '##### the ML in processing...')

    # loading predicted result 
df_sub = df.loc[df['TARGET'].isnull(),['SK_ID_CURR', 'TARGET']]


    # split train, and test datasets
train_df = df[df['TARGET'].notnull()]
test_df = df[df['TARGET'].isnull()]
del df

    # Expand train dataset with two times of test dataset including predicted results
test_df.TARGET = np.where(df_sub.TARGET > 0.75, 1, 0)
train_df = pd.concat([train_df, test_df], axis=0)
train_df = pd.concat([train_df, test_df], axis=0)
print(f'Train shape: {train_df.shape}, test shape: {test_df.shape} are loaded.')


    # Cross validation model
folds = KFold(n_splits=5, shuffle=True, random_state=2020)

    # Create arrays and dataframes to store results
oof_preds = np.zeros(train_df.shape[0])
sub_preds = np.zeros(test_df.shape[0])


    # limit number of feature to only 174!!!
feats = ['index', 'ORGANIZATION_TYPE_Industry_type_5', 'NAME_EDUCATION_TYPE_Higher_education','REGION_RATING_CLIENT_W_CITY', 'NAME_HOUSING_TYPE_House_apartment', 'ANNUITY_INCOME_PERC', 'ORGANIZATION_TYPE_Services', 'ORGANIZATION_TYPE_Cleaning', 'ORGANIZATION_TYPE_Military',  'ORGANIZATION_TYPE_School',    'DAYS_BIRTH',  'OCCUPATION_TYPE_High_skill_tech_staff',  'OCCUPATION_TYPE_Private_service_staff',  'OCCUPATION_TYPE_HR_staff',  'CODE_GENDER','ORGANIZATION_TYPE_Advertising', 'EXT_SOURCE_3', 'OCCUPATION_TYPE_Managers', 'FLAG_OWN_REALTY',  'AMT_CREDIT', 'INCOME_PER_PERSON', 'ORGANIZATION_TYPE_Police', 'FLAG_WORK_PHONE', 'ORGANIZATION_TYPE_University', 'ORGANIZATION_TYPE_Medicine', 'ORGANIZATION_TYPE_Telecom', 'ORGANIZATION_TYPE_Housing', 'FLAG_CONT_MOBILE', 'FLAG_EMAIL',  'REGION_POPULATION_RELATIVE', 'ORGANIZATION_TYPE_Electricity', 'REGION_RATING_CLIENT',  'DAYS_ID_PUBLISH', 'EXT_SOURCE_1', 'ORGANIZATION_TYPE_Realtor', 'OCCUPATION_TYPE_Laborers', 'ORGANIZATION_TYPE_Security', 'AMT_INCOME_TOTAL',  'PAYMENT_RATE', 'FLAG_OWN_CAR',  'ORGANIZATION_TYPE_Mobile', 'DAYS_EMPLOYED_PERC', 'INCOME_CREDIT_PERC',  'ORGANIZATION_TYPE_Postal', 'ORGANIZATION_TYPE_Insurance', 'OCCUPATION_TYPE_Accountants',  'ORGANIZATION_TYPE_Agriculture', 'EXT_SOURCE_2',  'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'ORGANIZATION_TYPE_Construction','REGION_POPULATION_RELATIVE_0.04622','REGION_POPULATION_RELATIVE_0.072508','OWN_CAR_AGE_64','OWN_CAR_AGE_65']

    # print final shape of dataset to evaluate by LightGBM
print(f'only {len(feats)} features from a total {train_df.shape[1]} features are used for ML analysis')

for n_fold, (train_idx, valid_idx) in enumerate(folds.split(train_df[feats], train_df['TARGET'])):
        train_x, train_y = train_df[feats].iloc[train_idx], train_df['TARGET'].iloc[train_idx]
        valid_x, valid_y = train_df[feats].iloc[valid_idx], train_df['TARGET'].iloc[valid_idx]
        clf = LGBMClassifier(nthread=-1,
                            n_estimators=5000,
                            learning_rate=0.01,
                            max_depth=11,
                            num_leaves=58,
                            colsample_bytree=0.613,
                            subsample=0.708,
                            max_bin=407,
                            reg_alpha=3.564,
                            reg_lambda=4.930,
                            min_child_weight=6,
                            min_child_samples=165,
                            silent=-1,
                            verbose=-1,)

        clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], eval_metric='auc', verbose=500, early_stopping_rounds=500)

        oof_preds[valid_idx] = clf.predict_proba(valid_x, num_iteration=clf.best_iteration_)[:, 1]
        sub_preds += clf.predict_proba(test_df[feats], num_iteration=clf.best_iteration_)[:, 1] / folds.n_splits

        print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(valid_y, oof_preds[valid_idx])))
        del clf, train_x, train_y, valid_x, valid_y

print('Full AUC score %.6f' % roc_auc_score(train_df['TARGET'], oof_preds))

    # create submission file
test_df['TARGET'] = sub_preds
test_df[['SK_ID_CURR', 'TARGET']].to_csv('submission.csv', index=False)
print('a submission file is created')

 ##### the ML in processing...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Train shape: (294196, 215), test shape: (61499, 215) are loaded.
only 51 features from a total 215 features are used for ML analysis
Training until validation scores don't improve for 500 rounds.
[500]	training's auc: 0.812665	training's binary_logloss: 0.157258	valid_1's auc: 0.768862	valid_1's binary_logloss: 0.164571
[1000]	training's auc: 0.839508	training's binary_logloss: 0.150563	valid_1's auc: 0.772146	valid_1's binary_logloss: 0.163966
[1500]	training's auc: 0.85918	training's binary_logloss: 0.145512	valid_1's auc: 0.772376	valid_1's binary_logloss: 0.164009
Early stopping, best iteration is:
[1127]	training's auc: 0.844955	training's binary_logloss: 0.149204	valid_1's auc: 0.772235	valid_1's binary_logloss: 0.163951
Fold  1 AUC : 0.772235
Training until validation scores don't improve for 500 rounds.
[500]	training's auc: 0.81082	training's binary_logloss: 0.157292	valid_1's auc: 0.782701	valid_1's binary_logloss: 0.164659
[1000]	training's auc: 0.838183	training's binary_lo