# Home Credit Default Risk
In this notebook I explore the datasets provided for the home credit default risk kaggle challenge.  I will cover the following learning objectives here: 
- Working with structured data
- Encoding of categorical variables 
- Handling missing values 

Some of this notebook follows the helpful kaggle kernel created by Will Koehrsen hosted [here](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction).  Our task is to create a model that predicts an applicants risk of default based on the provided datasets.

In [1]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

%matplotlib inline

In [2]:
import glob 
path_to_data = '../data/raw/'

for datafile in glob.glob(path_to_data + '*.csv'):
    print(datafile)

../data/raw/application_test.csv
../data/raw/HomeCredit_columns_description.csv
../data/raw/POS_CASH_balance.csv
../data/raw/credit_card_balance.csv
../data/raw/installments_payments.csv
../data/raw/application_train.csv
../data/raw/bureau.csv
../data/raw/previous_application.csv
../data/raw/bureau_balance.csv
../data/raw/sample_submission.csv


### Load Data 
The first thing I want to do is take a look at the structure of each dataset provided.  I will check the size of each dataset, the number of missing values, and the number of categorical features to be encoded or otherwise dealt with.  

In [3]:
def check_structure(data):
    
    # Check shape
    print('Dataset has shape (samples, features): ', data.shape)
    
    # Check missing values (percentage of total)
    missing_values = data.isnull().sum() / len(data) * 100.0
    missing_values.sort_values(ascending=False, inplace=True)
    
    print('\nDataset missing values: ')
    print(missing_values.head(12))
    
    # Check type of variables 
    print('\nDataset value types: ')
    print(data.dtypes.value_counts())
    
    # List categorical features and their number of unique values
    print('\nDataset categorical summary: ')
    print(data.select_dtypes('object').apply(pd.Series.nunique, axis=0))

In [4]:
app_train = pd.read_csv(path_to_data + 'application_train.csv')
installment_df = pd.read_csv(path_to_data + 'installments_payments.csv')
credit_card_df = pd.read_csv(path_to_data + 'credit_card_balance.csv')
bureau_df = pd.read_csv(path_to_data + 'bureau.csv')
cash_df = pd.read_csv(path_to_data + 'POS_CASH_balance.csv')
bureau_balance_df = pd.read_csv(path_to_data + 'bureau_balance.csv')
prev_app_df = pd.read_csv(path_to_data + 'previous_application.csv')

In [5]:
description = pd.read_csv(path_to_data + 'HomeCredit_columns_description.csv')

### Application Data 
This is the main dataset provided with the application, that contains 122 features/fields (one of those is the target).  It contains plenty of entries with missing data, and 16 features that are categorical in nature.  

In [6]:
check_structure(app_train)

('Dataset has shape (samples, features): ', (307511, 122))

Dataset missing values: 
COMMONAREA_MEDI             69.872297
COMMONAREA_AVG              69.872297
COMMONAREA_MODE             69.872297
NONLIVINGAPARTMENTS_MODE    69.432963
NONLIVINGAPARTMENTS_MEDI    69.432963
NONLIVINGAPARTMENTS_AVG     69.432963
FONDKAPREMONT_MODE          68.386172
LIVINGAPARTMENTS_MEDI       68.354953
LIVINGAPARTMENTS_MODE       68.354953
LIVINGAPARTMENTS_AVG        68.354953
FLOORSMIN_MEDI              67.848630
FLOORSMIN_MODE              67.848630
dtype: float64

Dataset value types: 
float64    65
int64      41
object     16
dtype: int64

Dataset categorical summary: 
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKD

In [7]:
app_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


### Installment Data 
This dataset provides information on previous home credit payments.  Each payment made or missed is a unique row.  The structure of this dataset is simple, containing very few missing values and no categorical variables.  It contains 8 features.

In [8]:
check_structure(installment_df)

('Dataset has shape (samples, features): ', (13605401, 8))

Dataset missing values: 
AMT_PAYMENT               0.021352
DAYS_ENTRY_PAYMENT        0.021352
AMT_INSTALMENT            0.000000
DAYS_INSTALMENT           0.000000
NUM_INSTALMENT_NUMBER     0.000000
NUM_INSTALMENT_VERSION    0.000000
SK_ID_CURR                0.000000
SK_ID_PREV                0.000000
dtype: float64

Dataset value types: 
float64    5
int64      3
dtype: int64

Dataset categorical summary: 
Series([], dtype: float64)


### Credit Card Data
Information on credit cards previously held with home credit.  Contains 23 features, some missing data, and one categorical variable.

In [9]:
check_structure(credit_card_df)

('Dataset has shape (samples, features): ', (3840312, 23))

Dataset missing values: 
AMT_PAYMENT_CURRENT           19.998063
AMT_DRAWINGS_OTHER_CURRENT    19.524872
CNT_DRAWINGS_POS_CURRENT      19.524872
CNT_DRAWINGS_OTHER_CURRENT    19.524872
CNT_DRAWINGS_ATM_CURRENT      19.524872
AMT_DRAWINGS_ATM_CURRENT      19.524872
AMT_DRAWINGS_POS_CURRENT      19.524872
CNT_INSTALMENT_MATURE_CUM      7.948208
AMT_INST_MIN_REGULARITY        7.948208
SK_DPD_DEF                     0.000000
SK_ID_CURR                     0.000000
MONTHS_BALANCE                 0.000000
dtype: float64

Dataset value types: 
float64    15
int64       7
object      1
dtype: int64

Dataset categorical summary: 
NAME_CONTRACT_STATUS    7
dtype: int64


### Bureau Data 
Information about the clients credit standings with other financial institutions.  This dataset contains 17 features, a fair amount of missing data in 4-5 of the fields, and contains three categorical variables.

In [10]:
check_structure(bureau_df)

('Dataset has shape (samples, features): ', (1716428, 17))

Dataset missing values: 
AMT_ANNUITY               71.473490
AMT_CREDIT_MAX_OVERDUE    65.513264
DAYS_ENDDATE_FACT         36.916958
AMT_CREDIT_SUM_LIMIT      34.477415
AMT_CREDIT_SUM_DEBT       15.011932
DAYS_CREDIT_ENDDATE        6.149573
AMT_CREDIT_SUM             0.000757
CREDIT_TYPE                0.000000
AMT_CREDIT_SUM_OVERDUE     0.000000
CNT_CREDIT_PROLONG         0.000000
DAYS_CREDIT_UPDATE         0.000000
CREDIT_DAY_OVERDUE         0.000000
dtype: float64

Dataset value types: 
float64    8
int64      6
object     3
dtype: int64

Dataset categorical summary: 
CREDIT_ACTIVE       4
CREDIT_CURRENCY     4
CREDIT_TYPE        15
dtype: int64


### Cash Dataset 
Point of sale cash and loans information.  This dataset contains eight features, one of which is categorical, and almost no missing entries. 

In [11]:
check_structure(cash_df)

('Dataset has shape (samples, features): ', (10001358, 8))

Dataset missing values: 
CNT_INSTALMENT_FUTURE    0.260835
CNT_INSTALMENT           0.260675
SK_DPD_DEF               0.000000
SK_DPD                   0.000000
NAME_CONTRACT_STATUS     0.000000
MONTHS_BALANCE           0.000000
SK_ID_CURR               0.000000
SK_ID_PREV               0.000000
dtype: float64

Dataset value types: 
int64      5
float64    2
object     1
dtype: int64

Dataset categorical summary: 
NAME_CONTRACT_STATUS    9
dtype: int64


### Bureau Balance Data 
This dataset details the payment information from the bureau.  

In [12]:
check_structure(bureau_balance_df)

('Dataset has shape (samples, features): ', (27299925, 3))

Dataset missing values: 
STATUS            0.0
MONTHS_BALANCE    0.0
SK_ID_BUREAU      0.0
dtype: float64

Dataset value types: 
int64     2
object    1
dtype: int64

Dataset categorical summary: 
STATUS    8
dtype: int64


### Previous Application Data 
This dataset contains the applicants previous applications for home credit.  It seems to be the second largest, after the applcation, and has 16 categorical variables.

In [13]:
check_structure(prev_app_df)

('Dataset has shape (samples, features): ', (1670214, 37))

Dataset missing values: 
RATE_INTEREST_PRIVILEGED     99.643698
RATE_INTEREST_PRIMARY        99.643698
RATE_DOWN_PAYMENT            53.636480
AMT_DOWN_PAYMENT             53.636480
NAME_TYPE_SUITE              49.119754
DAYS_TERMINATION             40.298129
NFLAG_INSURED_ON_APPROVAL    40.298129
DAYS_FIRST_DRAWING           40.298129
DAYS_FIRST_DUE               40.298129
DAYS_LAST_DUE_1ST_VERSION    40.298129
DAYS_LAST_DUE                40.298129
AMT_GOODS_PRICE              23.081773
dtype: float64

Dataset value types: 
object     16
float64    15
int64       6
dtype: int64

Dataset categorical summary: 
NAME_CONTRACT_TYPE              4
WEEKDAY_APPR_PROCESS_START      7
FLAG_LAST_APPL_PER_CONTRACT     2
NAME_CASH_LOAN_PURPOSE         25
NAME_CONTRACT_STATUS            4
NAME_PAYMENT_TYPE               4
CODE_REJECT_REASON              9
NAME_TYPE_SUITE                 7
NAME_CLIENT_TYPE                4
NAME_GOODS_CATEGO

### Aggregate Data 
In the following section, the data are aggregated by the applicant identification number `SK_ID_CURR`.

In [14]:
installment_df.columns

Index([u'SK_ID_PREV', u'SK_ID_CURR', u'NUM_INSTALMENT_VERSION',
       u'NUM_INSTALMENT_NUMBER', u'DAYS_INSTALMENT', u'DAYS_ENTRY_PAYMENT',
       u'AMT_INSTALMENT', u'AMT_PAYMENT'],
      dtype='object')

In [15]:
for c in installment_df.columns:
    print(c, description[description.Row == c].Description.values)

('SK_ID_PREV', array([], dtype=object))
('SK_ID_CURR', array(['ID of loan in our sample',
       'ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau ',
       'ID of loan in our sample', 'ID of loan in our sample',
       'ID of loan in our sample', 'ID of loan in our sample'],
      dtype=object))
('NUM_INSTALMENT_VERSION', array(['Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed'],
      dtype=object))
('NUM_INSTALMENT_NUMBER', array(['On which installment we observe payment'], dtype=object))
('DAYS_INSTALMENT', array(['When the installment of previous credit was supposed to be paid (relative to application date of current loan)'],
      dtype=object))
('DAYS_ENTRY_PAYMENT', array(['When was the installments of previous credit paid actually (relative to application date of current loan

In [16]:
# Add features to installment dataset here.
installment_df['PAYMENT_DIFF'] = installment_df.AMT_PAYMENT - installment_df.AMT_INSTALMENT
installment_df['PAYMENT_PERC'] = installment_df.AMT_PAYMENT / installment_df.AMT_INSTALMENT
installment_df['DPD'] = installment_df.DAYS_INSTALMENT - installment_df.DAYS_ENTRY_PAYMENT

# Define aggregation methods for float/int cols
aggregations = {
    'NUM_INSTALMENT_VERSION': ['mean', 'min', 'max'],
    'NUM_INSTALMENT_NUMBER': ['mean', 'min', 'max'],
    'DAYS_INSTALMENT': ['mean', 'min', 'max'],
    'DAYS_ENTRY_PAYMENT': ['mean', 'min', 'max'],
    'AMT_INSTALMENT': ['mean', 'min', 'max'],
    'AMT_PAYMENT': ['mean', 'min', 'max'],
    'PAYMENT_DIFF': ['min', 'max', 'mean'],
    'PAYMENT_PERC': ['min', 'max', 'mean'],
    'DPD': ['min', 'max', 'mean']
               }

install_agg = installment_df.groupby('SK_ID_CURR').aggregate(aggregations)

colname = lambda x, y: 'INSTALL_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(install_agg.columns)]
install_agg.columns = new_cols

In [17]:
install_agg.head()

Unnamed: 0_level_0,INSTALL_DAYS_ENTRY_PAYMENT_MEAN,INSTALL_DAYS_ENTRY_PAYMENT_MIN,INSTALL_DAYS_ENTRY_PAYMENT_MAX,INSTALL_AMT_PAYMENT_MEAN,INSTALL_AMT_PAYMENT_MIN,INSTALL_AMT_PAYMENT_MAX,INSTALL_DPD_MIN,INSTALL_DPD_MAX,INSTALL_DPD_MEAN,INSTALL_AMT_INSTALMENT_MEAN,...,INSTALL_PAYMENT_DIFF_MEAN,INSTALL_NUM_INSTALMENT_VERSION_MEAN,INSTALL_NUM_INSTALMENT_VERSION_MIN,INSTALL_NUM_INSTALMENT_VERSION_MAX,INSTALL_DAYS_INSTALMENT_MEAN,INSTALL_DAYS_INSTALMENT_MIN,INSTALL_DAYS_INSTALMENT_MAX,INSTALL_PAYMENT_PERC_MIN,INSTALL_PAYMENT_PERC_MAX,INSTALL_PAYMENT_PERC_MEAN
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100001,-2195.0,-2916.0,-1628.0,5885.132143,3951.0,17397.9,-11.0,36.0,7.285714,5885.132143,...,0.0,1.142857,1.0,2.0,-2187.714286,-2916.0,-1619.0,1.0,1.0,1.0
100002,-315.421053,-587.0,-49.0,11559.247105,9251.775,53093.745,12.0,31.0,20.421053,11559.247105,...,0.0,1.052632,1.0,2.0,-295.0,-565.0,-25.0,1.0,1.0,1.0
100003,-1385.32,-2324.0,-544.0,64754.586,6662.97,560835.36,1.0,14.0,7.16,64754.586,...,0.0,1.04,1.0,2.0,-1378.16,-2310.0,-536.0,1.0,1.0,1.0
100004,-761.666667,-795.0,-727.0,7096.155,5357.25,10573.965,3.0,11.0,7.666667,7096.155,...,0.0,1.333333,1.0,2.0,-754.0,-784.0,-724.0,1.0,1.0,1.0
100005,-609.555556,-736.0,-470.0,6240.205,4813.2,17656.245,-1.0,37.0,23.555556,6240.205,...,0.0,1.111111,1.0,2.0,-586.0,-706.0,-466.0,1.0,1.0,1.0


In [18]:
for c in credit_card_df.columns:
    print(c, description[description.Row == c].Description.values)

('SK_ID_PREV', array([], dtype=object))
('SK_ID_CURR', array(['ID of loan in our sample',
       'ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau ',
       'ID of loan in our sample', 'ID of loan in our sample',
       'ID of loan in our sample', 'ID of loan in our sample'],
      dtype=object))
('MONTHS_BALANCE', array(['Month of balance relative to application date (-1 means the freshest balance date)',
       'Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly )',
       'Month of balance relative to application date (-1 means the freshest balance date)'],
      dtype=object))
('AMT_BALANCE', array(['Balance during the month of previous credit'], dtype=object))
('AMT_CREDIT_LIMIT_ACTUAL', array(['Credit card limit duri

In [19]:
from sklearn.preprocessing import LabelEncoder

In [20]:
cat_cols = [credit_card_df.columns[i] for i, c in enumerate(credit_card_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

credit_card_df['BALANCE_PERC'] =  credit_card_df.AMT_BALANCE / credit_card_df.AMT_CREDIT_LIMIT_ACTUAL

# Map the unique strings to unique integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    credit_card_df[cat] = encoder.fit_transform(credit_card_df[cat])
    
# Add features that make sense here.    
#aggregations = {
#    'AMT_BALANCE': ['min', 'mean', 'max'],
#    'BALANCE_PERC': ['min', 'max', 'mean']
#}

# Do more custom modification.
#credit_card_agg = credit_card_df.groupby('SK_ID_CURR').aggregate(aggregations)

# Do general aggregations.
credit_card_agg = credit_card_df.groupby('SK_ID_CURR').aggregate(['min', 'max', 'sum', 'var', 'mean'])

colname = lambda x, y: 'CC_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(credit_card_agg.columns)]
credit_card_agg.columns = new_cols

('Categoricals: ', ['NAME_CONTRACT_STATUS'])


In [21]:
credit_card_agg.head()

Unnamed: 0_level_0,CC_SK_ID_PREV_MIN,CC_SK_ID_PREV_MAX,CC_SK_ID_PREV_SUM,CC_SK_ID_PREV_VAR,CC_SK_ID_PREV_MEAN,CC_MONTHS_BALANCE_MIN,CC_MONTHS_BALANCE_MAX,CC_MONTHS_BALANCE_SUM,CC_MONTHS_BALANCE_VAR,CC_MONTHS_BALANCE_MEAN,...,CC_SK_DPD_DEF_MIN,CC_SK_DPD_DEF_MAX,CC_SK_DPD_DEF_SUM,CC_SK_DPD_DEF_VAR,CC_SK_DPD_DEF_MEAN,CC_BALANCE_PERC_MIN,CC_BALANCE_PERC_MAX,CC_BALANCE_PERC_SUM,CC_BALANCE_PERC_VAR,CC_BALANCE_PERC_MEAN
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100006,1489396,1489396,8936376,0.0,1489396.0,-6,-1,-21,3.5,-3.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100011,1843384,1843384,136410416,0.0,1843384.0,-75,-2,-2849,462.5,-38.5,...,0,0,0,0.0,0.0,0.0,1.05,22.398201,0.143251,0.302678
100013,2038692,2038692,195714432,0.0,2038692.0,-96,-1,-4656,776.0,-48.5,...,0,1,1,0.010417,0.010417,0.0,1.02489,11.068903,0.075363,0.115301
100021,2594025,2594025,44098425,0.0,2594025.0,-18,-2,-170,25.5,-10.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100023,1499902,1499902,11999216,0.0,1499902.0,-11,-4,-60,6.0,-7.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
for c in bureau_df.columns:
    print(c, description[description.Row == c].Description.values)

('SK_ID_CURR', array(['ID of loan in our sample',
       'ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau ',
       'ID of loan in our sample', 'ID of loan in our sample',
       'ID of loan in our sample', 'ID of loan in our sample'],
      dtype=object))
('SK_ID_BUREAU', array([], dtype=object))
('CREDIT_ACTIVE', array(['Status of the Credit Bureau (CB) reported credits'], dtype=object))
('CREDIT_CURRENCY', array(['Recoded currency of the Credit Bureau credit'], dtype=object))
('DAYS_CREDIT', array(['How many days before current application did client apply for Credit Bureau credit'],
      dtype=object))
('CREDIT_DAY_OVERDUE', array(['Number of days past due on CB credit at the time of application for related loan in our sample'],
      dtype=object))
('DAYS_CREDIT_ENDDATE', array(['Remaining duration of CB credit (in days) at the time of application in Home Credit'],
      dtype=object))
('DAYS_ENDDATE_FACT', array(

In [23]:
cat_cols = [bureau_df.columns[i] for i, c in enumerate(bureau_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

# Change the strings to integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    bureau_df[cat] = encoder.fit_transform(bureau_df[cat])
    
# Add features that make sense here.    
#aggregations = {
#    'DAYS_CREDIT': ['min', 'mean', 'max']
#}

bureau_agg = bureau_df.groupby('SK_ID_CURR').aggregate(['min', 'max', 'sum', 'mean', 'var'])
colname = lambda x, y: 'CC_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(bureau_agg.columns)]
bureau_agg.columns = new_cols

print('Bureau shape: ', bureau_df.shape)
print('Bureau (agg) shape: ', bureau_agg.shape)

('Categoricals: ', ['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE'])
('Bureau shape: ', (1716428, 17))
('Bureau (agg) shape: ', (305811, 80))


In [24]:
for c in cash_df.columns:
    print(c, description[description.Row == c].Description.values)

('SK_ID_PREV', array([], dtype=object))
('SK_ID_CURR', array(['ID of loan in our sample',
       'ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau ',
       'ID of loan in our sample', 'ID of loan in our sample',
       'ID of loan in our sample', 'ID of loan in our sample'],
      dtype=object))
('MONTHS_BALANCE', array(['Month of balance relative to application date (-1 means the freshest balance date)',
       'Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly )',
       'Month of balance relative to application date (-1 means the freshest balance date)'],
      dtype=object))
('CNT_INSTALMENT', array(['Term of previous credit (can change over time)'], dtype=object))
('CNT_INSTALMENT_FUTURE', array(['Installments left 

In [25]:
cat_cols = [cash_df.columns[i] for i, c in enumerate(cash_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

# Change the strings to integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    cash_df[cat] = encoder.fit_transform(cash_df[cat])
    
# Add features that make sense here.    
#aggregations = {
#    'SK_DPD': ['min', 'mean', 'max']
#}

cash_agg = cash_df.groupby('SK_ID_CURR').aggregate(['min', 'max', 'sum', 'var', 'mean'])
colname = lambda x, y: 'CC_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(cash_agg.columns)]
cash_agg.columns = new_cols

print('Cash shape: ', cash_df.shape)
print('Cash (agg) shape: ', cash_agg.shape)

('Categoricals: ', ['NAME_CONTRACT_STATUS'])
('Cash shape: ', (10001358, 8))
('Cash (agg) shape: ', (337252, 35))


In [26]:
for c in bureau_balance_df.columns:
    print(c, description[description.Row == c].Description.values)

('SK_ID_BUREAU', array([], dtype=object))
('MONTHS_BALANCE', array(['Month of balance relative to application date (-1 means the freshest balance date)',
       'Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly )',
       'Month of balance relative to application date (-1 means the freshest balance date)'],
      dtype=object))
('STATUS', array(['Status of Credit Bureau loan during the month (active, closed, DPD0-30,\x85 [C means closed, X means status unknown, 0 means no DPD, 1 means maximal did during month between 1-30, 2 means DPD 31-60,\x85 5 means DPD 120+ or sold or written off ] )'],
      dtype=object))


In [36]:
test_id = np.random.choice(bureau_balance_df['SK_ID_BUREAU'])
print('Testing with SK_ID_BUREAU = %d' % test_id)

test_df = bureau_balance_df.loc[bureau_balance_df['SK_ID_BUREAU'] == test_id]
print('Test df:', test_df)

Testing with SK_ID_BUREAU = 5011374
('Test df:',         SK_ID_BUREAU  MONTHS_BALANCE  STATUS
226562       5011374               0       6
226563       5011374              -1       6
226564       5011374              -2       6
226565       5011374              -3       6
226566       5011374              -4       6
226567       5011374              -5       6
226568       5011374              -6       6
226569       5011374              -7       6
226570       5011374              -8       6
226571       5011374              -9       6
226572       5011374             -10       6
226573       5011374             -11       6
226574       5011374             -12       6
226575       5011374             -13       6
226576       5011374             -14       6
226577       5011374             -15       6
226578       5011374             -16       6
226579       5011374             -17       6
226580       5011374             -18       6
226581       5011374             -19       6
226582

In [28]:
cat_cols = [bureau_balance_df.columns[i] for i, c in enumerate(bureau_balance_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

# Change the strings to integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    bureau_balance_df[cat] = encoder.fit_transform(bureau_balance_df[cat])
    
# Add features that make sense here.    
aggregations = {
    'SK_DPD': ['min', 'mean', 'max']
}

bureau_balance_agg = cash_df.groupby('SK_ID_CURR').aggregate(aggregations)
colname = lambda x, y: 'BB_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(bureau_agg.columns)]
bureau_balance_agg.columns = new_cols

print('Bureau balance shape: ', bureau_balance_df.shape)
print('Bureau balance (agg) shape: ', bureau_balance_agg.shape)

('Categoricals: ', [])


ValueError: Length mismatch: Expected axis has 3 elements, new values have 80 elements

In [29]:
for c in prev_app_df.columns:
    print(c, description[description.Row == c].Description.values)

('SK_ID_PREV', array([], dtype=object))
('SK_ID_CURR', array(['ID of loan in our sample',
       'ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau ',
       'ID of loan in our sample', 'ID of loan in our sample',
       'ID of loan in our sample', 'ID of loan in our sample'],
      dtype=object))
('NAME_CONTRACT_TYPE', array(['Identification if loan is cash or revolving',
       'Contract product type (Cash loan, consumer loan [POS] ,...) of the previous application'],
      dtype=object))
('AMT_ANNUITY', array(['Loan annuity', 'Annuity of the Credit Bureau credit',
       'Annuity of previous application'], dtype=object))
('AMT_APPLICATION', array(['For how much credit did client ask on the previous application'],
      dtype=object))
('AMT_CREDIT', array(['Credit amount of the loan',
       'Final credit amount on the previous application. This differs from AMT_APPLICATION in a way that the AMT_APPLICATION is the amoun

In [30]:
cat_cols = [prev_app_df.columns[i] for i, c in enumerate(prev_app_df.dtypes) if c == 'object']
print('Categoricals: ', cat_cols)

# Change the strings to integers.
for cat in cat_cols: 
    encoder = LabelEncoder()
    prev_app_df[cat] = encoder.fit_transform(prev_app_df[cat])
    
# Add features that make sense here.    
aggregations = {
    'CNT_PAYMENT': ['min', 'mean', 'max']
}

prev_app_agg = prev_app_df.groupby('SK_ID_CURR').aggregate(aggregations)
colname = lambda x, y: 'BB_' + x + '_' + y.upper() 
new_cols = [colname(c[0],c[1]) for c in list(prev_app_agg.columns)]
prev_app_agg.columns = new_cols

print('Previous application shape: ', prev_app_df.shape)
print('Previous application (agg) shape: ', prev_app_agg.shape)

('Categoricals: ', ['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE', 'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION'])
('Previous application shape: ', (1670214, 37))
('Previous application (agg) shape: ', (338857, 3))
