# Bureau and Bureau Balance data

<blockquote>*bureau.csv* data concerns client's earlier credits from other financial institutions. Some of the credits may be active and some are closed. Each previous (or ongoing) credit has its own row (only <u>one</u> row per credit) in *bureau* dataset. As a single client might have taken other loans from other financial institutions, for each row in the *application_train* data (ie *application_train.csv*) we can have multiple rows in this table. Feature explanations for this dataset are as below.</blockquote>

## Feature explanations
### Bureau table
<blockquote><p style="font-size:13px">
SK_ID_CURR: 	ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau <br>
SK_BUREAU_ID: 	Recoded ID of previous Credit Bureau credit related to our loan (unique coding for each loan application)<br>
CREDIT_ACTIVE: 	Status of the Credit Bureau (CB) reported credits<br>
CREDIT_CURRENCY: 	Recoded currency of the Credit Bureau credit<br>
DAYS_CREDIT: 	How many days before current application did client apply for Credit Bureau credit<br>
CREDIT_DAY_OVERDUE: 	Number of days past due on CB credit at the time of application for related loan in our sample<br>
DAYS_CREDIT_ENDDATE: 	Remaining duration of CB credit (in days) at the time of application in Home Credit<br>
DAYS_ENDDATE_FACT: 	Days since CB credit ended at the time of application in Home Credit (only for closed credit)<br>
AMT_CREDIT_MAX_OVERDUE: 	Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample)<br>
CNT_CREDIT_PROLONG: 	How many times was the Credit Bureau credit prolonged<br>
AMT_CREDIT_SUM: 	Current credit amount for the Credit Bureau credit<br>
AMT_CREDIT_SUM_DEBT: 	Current debt on Credit Bureau credit<br>
AMT_CREDIT_SUM_LIMIT: 	Current credit limit of credit card reported in Credit Bureau<br>
AMT_CREDIT_SUM_OVERDUE: 	Current amount overdue on Credit Bureau credit<br>
CREDIT_TYPE: 	Type of Credit Bureau credit (Car, cash,...)<br>
DAYS_CREDIT_UPDATE: 	How many days before loan application did last information about the Credit Bureau credit come<br>
AMT_ANNUITY: 	Annuity of the Credit Bureau credit<br>
    </p></blockquote>
    
### Bureau Balance table
<blockquote>SK_BUREAU_ID:	Recoded ID of Credit Bureau credit (unique coding for each application) - use this to join to CREDIT_BUREAU table<br>
MONTHS_BALANCE:	Month of balance relative to application date (-1 means the freshest balance date)	time only relative to the application<br>
STATUS:	Status of Credit Bureau loan during the month<br> 	
</blockquote>

In [1]:
import numpy as np
import pandas as pd
import gc
import warnings
from sklearn.preprocessing import LabelEncoder

warnings.filterwarnings('ignore')

In [2]:
# 2.0 One-hot encoding function. Uses pd.get_dummies()
#     i) To transform 'object' columns to dummies. 
#    ii) Treat NaN as one of the categories
#   iii) Returns transformed-data and new-columns created

def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df,
                        columns= categorical_columns,
                        dummy_na= nan_as_category       # Treat NaNs as category
                       )
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns


In [3]:
# 3.2 Read bureau data first
df_bureau = pd.read_csv('homecredit/bureau.csv')
df_bureau

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.00,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.00,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.50,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.00,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.00,,,0.0,Consumer credit,-21,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1716423,259355,5057750,Active,currency 1,-44,0,-30.0,,0.0,0,11250.00,11250.0,0.0,0.0,Microloan,-19,
1716424,100044,5057754,Closed,currency 1,-2648,0,-2433.0,-2493.0,5476.5,0,38130.84,0.0,0.0,0.0,Consumer credit,-2493,
1716425,100044,5057762,Closed,currency 1,-1809,0,-1628.0,-970.0,,0,15570.00,,,0.0,Consumer credit,-967,
1716426,246829,5057770,Closed,currency 1,-1878,0,-1513.0,-1513.0,,0,36000.00,0.0,0.0,0.0,Consumer credit,-1508,


In [4]:
# 3.2.3 In all, how many are categoricals?
df_bureau.dtypes.value_counts()

float64    8
int64      6
object     3
dtype: int64

In [5]:
# 3.4 Summary of active/closed cases from bureau
# We aggregate on these also
df_bureau['CREDIT_ACTIVE'].value_counts()

Closed      1079273
Active       630607
Sold           6527
Bad debt         21
Name: CREDIT_ACTIVE, dtype: int64

## Aggregation
<blockquote><i>bureau_balance</i> will be aggregated and merged with <i>bureau</i>. <i>bureau</i> will then be aggregated and merged with <i>'application_train'</i> data. <i>bureau</i> will be aggregated in three different ways. This aggregation will be by <i>SK_ID_CURR</i>. Finally, aggregated <i>bureau</i>, called <i>bureau_agg</i>, will be merged with  <i>'application_train'</i> over (<i>SK_ID_CURR</i>).<br>
Aggregation over time is one way to extract behaviour of client. All categorical data is first OneHotEncoded (OHE). What is unique about this OHE is that NaN values are treated as categories. 


In [6]:
# 4.0 OneHotEncode 'object' types in bureau
nan_as_category = True
df_bureau, df_bureau_cat = one_hot_encoder(df_bureau, nan_as_category)

In [7]:
# 4.1
df_bureau

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,...,CREDIT_TYPE_Loan for business development,CREDIT_TYPE_Loan for purchase of shares (margin lending),CREDIT_TYPE_Loan for the purchase of equipment,CREDIT_TYPE_Loan for working capital replenishment,CREDIT_TYPE_Microloan,CREDIT_TYPE_Mobile operator loan,CREDIT_TYPE_Mortgage,CREDIT_TYPE_Real estate loan,CREDIT_TYPE_Unknown type of loan,CREDIT_TYPE_nan
0,215354,5714462,-497,0,-153.0,-153.0,,0,91323.00,0.0,...,0,0,0,0,0,0,0,0,0,0
1,215354,5714463,-208,0,1075.0,,,0,225000.00,171342.0,...,0,0,0,0,0,0,0,0,0,0
2,215354,5714464,-203,0,528.0,,,0,464323.50,,...,0,0,0,0,0,0,0,0,0,0
3,215354,5714465,-203,0,,,,0,90000.00,,...,0,0,0,0,0,0,0,0,0,0
4,215354,5714466,-629,0,1197.0,,77674.5,0,2700000.00,,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1716423,259355,5057750,-44,0,-30.0,,0.0,0,11250.00,11250.0,...,0,0,0,0,1,0,0,0,0,0
1716424,100044,5057754,-2648,0,-2433.0,-2493.0,5476.5,0,38130.84,0.0,...,0,0,0,0,0,0,0,0,0,0
1716425,100044,5057762,-1809,0,-1628.0,-970.0,,0,15570.00,,...,0,0,0,0,0,0,0,0,0,0
1716426,246829,5057770,-1878,0,-1513.0,-1513.0,,0,36000.00,0.0,...,0,0,0,0,0,0,0,0,0,0


## bureau_balance
<blockquote>It is monthly data about the remaining balance of each one of the previous credits of clients that exist in dataset <i>bureau</i>. Each previous credit is identified by a unique ID, <i>SK_ID_BUREAU</i>, in dataset <i>bureau</i>. Each row in <i>bureau_balance</i> is one month of credit-due (from previous credit), and a single previous credit can have multiple rows, one for each month of the credit length.<br> In my personal view, it should be in decreasing order. That is, for every person identified by <i>SK_ID_BUREAU</i>, credits should be decreasing each passing month.</blockquote>

In [8]:
# 5.0 Read over bureau_balance data
#     and reduce memory usage through
#     conversion of data-types:

df_bb = pd.read_csv('homecredit/bureau_balance.csv')

In [9]:
# 5.0.1 Display few rows 
df_bb

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C
...,...,...,...
27299920,5041336,-47,X
27299921,5041336,-48,X
27299922,5041336,-49,X
27299923,5041336,-50,X


In [10]:
# 5.1 There is just one 'object' column
df_bb.dtypes.value_counts()

int64     2
object    1
dtype: int64

In [11]:
# 5.5 OK. So let us OneHotEncode bb
df_bb, df_bb_cat = one_hot_encoder(df_bb, nan_as_category)

In [12]:
# 5.6 Examine the results
df_bb

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS_0,STATUS_1,STATUS_2,STATUS_3,STATUS_4,STATUS_5,STATUS_C,STATUS_X,STATUS_nan
0,5715448,0,0,0,0,0,0,0,1,0,0
1,5715448,-1,0,0,0,0,0,0,1,0,0
2,5715448,-2,0,0,0,0,0,0,1,0,0
3,5715448,-3,0,0,0,0,0,0,1,0,0
4,5715448,-4,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...
27299920,5041336,-47,0,0,0,0,0,0,0,1,0
27299921,5041336,-48,0,0,0,0,0,0,0,1,0
27299922,5041336,-49,0,0,0,0,0,0,0,1,0
27299923,5041336,-50,0,0,0,0,0,0,0,1,0


## Performing aggregations in bb
<blockquote>There is one numeric feature: <i>'MONTHS_BALANCE'</i>. On this feature we will perform ['min', 'max', 'size']. And on the rest of the features,dummy features, we will perform [mean]. Aggregation is by unique bureau ID, <i>SK_ID_BUREAU</i>. Resulting dataset is called <i>bureau_agg</i>.</blockquote>  


In [13]:
# 6.0 Bureau balance: Perform aggregations and merge with bureau.csv
#     First prepare a dictionary listing operations to be performed
#     on various features:

df_bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size']}
for col in df_bb_cat:
    df_bb_aggregations[col] = ['mean']

# 6.0.1    
len(df_bb_aggregations)     # 10  

10

In [14]:
# 6.1 So what all aggregations to perform column-wise

df_bb_aggregations

{'MONTHS_BALANCE': ['min', 'max', 'size'],
 'STATUS_0': ['mean'],
 'STATUS_1': ['mean'],
 'STATUS_2': ['mean'],
 'STATUS_3': ['mean'],
 'STATUS_4': ['mean'],
 'STATUS_5': ['mean'],
 'STATUS_C': ['mean'],
 'STATUS_X': ['mean'],
 'STATUS_nan': ['mean']}

In [15]:
# 6.2 Perform aggregations now in bb:

grouped =  df_bb.groupby('SK_ID_BUREAU')
bb_agg = df_bb.groupby('SK_ID_BUREAU').agg(df_bb_aggregations)

In [16]:
bb_agg

Unnamed: 0_level_0,MONTHS_BALANCE,MONTHS_BALANCE,MONTHS_BALANCE,STATUS_0,STATUS_1,STATUS_2,STATUS_3,STATUS_4,STATUS_5,STATUS_C,STATUS_X,STATUS_nan
Unnamed: 0_level_1,min,max,size,mean,mean,mean,mean,mean,mean,mean,mean,mean
SK_ID_BUREAU,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
5001709,-96,0,97,0.000000,0.000000,0.0,0.0,0.0,0.0,0.886598,0.113402,0.0
5001710,-82,0,83,0.060241,0.000000,0.0,0.0,0.0,0.0,0.578313,0.361446,0.0
5001711,-3,0,4,0.750000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.250000,0.0
5001712,-18,0,19,0.526316,0.000000,0.0,0.0,0.0,0.0,0.473684,0.000000,0.0
5001713,-21,0,22,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,1.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
6842884,-47,0,48,0.187500,0.000000,0.0,0.0,0.0,0.0,0.416667,0.395833,0.0
6842885,-23,0,24,0.500000,0.000000,0.0,0.0,0.0,0.5,0.000000,0.000000,0.0
6842886,-32,0,33,0.242424,0.000000,0.0,0.0,0.0,0.0,0.757576,0.000000,0.0
6842887,-36,0,37,0.162162,0.000000,0.0,0.0,0.0,0.0,0.837838,0.000000,0.0


In [17]:
# 6.4 Rename bb_agg columns
bb_agg.columns = pd.Index([e[0] + "_" + e[1].upper() for e in bb_agg.columns.tolist()])

In [18]:
bb_agg

Unnamed: 0_level_0,MONTHS_BALANCE_MIN,MONTHS_BALANCE_MAX,MONTHS_BALANCE_SIZE,STATUS_0_MEAN,STATUS_1_MEAN,STATUS_2_MEAN,STATUS_3_MEAN,STATUS_4_MEAN,STATUS_5_MEAN,STATUS_C_MEAN,STATUS_X_MEAN,STATUS_nan_MEAN
SK_ID_BUREAU,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
5001709,-96,0,97,0.000000,0.000000,0.0,0.0,0.0,0.0,0.886598,0.113402,0.0
5001710,-82,0,83,0.060241,0.000000,0.0,0.0,0.0,0.0,0.578313,0.361446,0.0
5001711,-3,0,4,0.750000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.250000,0.0
5001712,-18,0,19,0.526316,0.000000,0.0,0.0,0.0,0.0,0.473684,0.000000,0.0
5001713,-21,0,22,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,1.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
6842884,-47,0,48,0.187500,0.000000,0.0,0.0,0.0,0.0,0.416667,0.395833,0.0
6842885,-23,0,24,0.500000,0.000000,0.0,0.0,0.0,0.5,0.000000,0.000000,0.0
6842886,-32,0,33,0.242424,0.000000,0.0,0.0,0.0,0.0,0.757576,0.000000,0.0
6842887,-36,0,37,0.162162,0.000000,0.0,0.0,0.0,0.0,0.837838,0.000000,0.0


In [19]:
# 6.4.1
bb_agg.columns.tolist()
bb_agg.head()

Unnamed: 0_level_0,MONTHS_BALANCE_MIN,MONTHS_BALANCE_MAX,MONTHS_BALANCE_SIZE,STATUS_0_MEAN,STATUS_1_MEAN,STATUS_2_MEAN,STATUS_3_MEAN,STATUS_4_MEAN,STATUS_5_MEAN,STATUS_C_MEAN,STATUS_X_MEAN,STATUS_nan_MEAN
SK_ID_BUREAU,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
5001709,-96,0,97,0.0,0.0,0.0,0.0,0.0,0.0,0.886598,0.113402,0.0
5001710,-82,0,83,0.060241,0.0,0.0,0.0,0.0,0.0,0.578313,0.361446,0.0
5001711,-3,0,4,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0
5001712,-18,0,19,0.526316,0.0,0.0,0.0,0.0,0.0,0.473684,0.0,0.0
5001713,-21,0,22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [20]:
# 6.5 Merge aggregated bb with bureau

bureau = df_bureau.join(bb_agg, how='left', on='SK_ID_BUREAU')

In [21]:
bureau

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,...,MONTHS_BALANCE_SIZE,STATUS_0_MEAN,STATUS_1_MEAN,STATUS_2_MEAN,STATUS_3_MEAN,STATUS_4_MEAN,STATUS_5_MEAN,STATUS_C_MEAN,STATUS_X_MEAN,STATUS_nan_MEAN
0,215354,5714462,-497,0,-153.0,-153.0,,0,91323.00,0.0,...,,,,,,,,,,
1,215354,5714463,-208,0,1075.0,,,0,225000.00,171342.0,...,,,,,,,,,,
2,215354,5714464,-203,0,528.0,,,0,464323.50,,...,,,,,,,,,,
3,215354,5714465,-203,0,,,,0,90000.00,,...,,,,,,,,,,
4,215354,5714466,-629,0,1197.0,,77674.5,0,2700000.00,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1716423,259355,5057750,-44,0,-30.0,,0.0,0,11250.00,11250.0,...,,,,,,,,,,
1716424,100044,5057754,-2648,0,-2433.0,-2493.0,5476.5,0,38130.84,0.0,...,,,,,,,,,,
1716425,100044,5057762,-1809,0,-1628.0,-970.0,,0,15570.00,,...,,,,,,,,,,
1716426,246829,5057770,-1878,0,-1513.0,-1513.0,,0,36000.00,0.0,...,,,,,,,,,,


In [22]:
# 6.6 Drop SK_ID_BUREAU as bb has finally merged.

bureau.drop(['SK_ID_BUREAU'], axis=1, inplace= True)

In [23]:
# We have three types of columns
# Categorical columns generated from bureau
# Categorical columns generated from bb
# Numerical columns

## Performing aggregations in bureau
<blockquote>Aggregate 14 original numeric columns, as: ['min', 'max', 'mean', 'var']<br>
Aggregate rest of the columns that is dummy columns as: [mean]. <br>
This constitutes one of the three aggretaions. Aggregation is by <i>SK_ID_CURR</i>. Resulting dataset is called <i>bureau_agg</i></blockquote>

In [24]:
bureau

Unnamed: 0,SK_ID_CURR,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,...,MONTHS_BALANCE_SIZE,STATUS_0_MEAN,STATUS_1_MEAN,STATUS_2_MEAN,STATUS_3_MEAN,STATUS_4_MEAN,STATUS_5_MEAN,STATUS_C_MEAN,STATUS_X_MEAN,STATUS_nan_MEAN
0,215354,-497,0,-153.0,-153.0,,0,91323.00,0.0,,...,,,,,,,,,,
1,215354,-208,0,1075.0,,,0,225000.00,171342.0,,...,,,,,,,,,,
2,215354,-203,0,528.0,,,0,464323.50,,,...,,,,,,,,,,
3,215354,-203,0,,,,0,90000.00,,,...,,,,,,,,,,
4,215354,-629,0,1197.0,,77674.5,0,2700000.00,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1716423,259355,-44,0,-30.0,,0.0,0,11250.00,11250.0,0.0,...,,,,,,,,,,
1716424,100044,-2648,0,-2433.0,-2493.0,5476.5,0,38130.84,0.0,0.0,...,,,,,,,,,,
1716425,100044,-1809,0,-1628.0,-970.0,,0,15570.00,,,...,,,,,,,,,,
1716426,246829,-1878,0,-1513.0,-1513.0,,0,36000.00,0.0,0.0,...,,,,,,,,,,


In [25]:
## Aggregation strategy
# 7.1 Numeric features
#     Columns: Bureau + bureau_balance numeric features
#              Last three columns are from bureau_balance
#              Total: 11 + 3 = 14

num_aggregations = {
                     'DAYS_CREDIT':             ['min', 'max', 'mean', 'var'],
                     'DAYS_CREDIT_ENDDATE':     ['min', 'max', 'mean'],
                     'DAYS_CREDIT_UPDATE':      ['mean'],
                     'CREDIT_DAY_OVERDUE':      ['max', 'mean'],
                     'AMT_CREDIT_MAX_OVERDUE':  ['mean'],
                     'AMT_CREDIT_SUM':          ['max', 'mean', 'sum'],
                     'AMT_CREDIT_SUM_DEBT':     ['max', 'mean', 'sum'],
                     'AMT_CREDIT_SUM_OVERDUE':  ['mean'],
                     'AMT_CREDIT_SUM_LIMIT':    ['mean', 'sum'],
                     'AMT_ANNUITY':             ['max', 'mean'],
                     'CNT_CREDIT_PROLONG':      ['sum'],
                     'MONTHS_BALANCE_MIN':      ['min'],
                     'MONTHS_BALANCE_MAX':      ['max'],
                     'MONTHS_BALANCE_SIZE':     ['mean', 'sum']
                   }

len(num_aggregations)   # 14

14

In [26]:
# 7.2 Bureau categorical features. Derived from:
#       'CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE', 
#        Total: 

cat_aggregations = {}
df_bureau_cat      # bureau_cat are newly created dummy columns
                #  but all are numerical columns

# 7.2.1    
len(df_bureau_cat) # 26    

26

In [27]:
# 7.2.2 For all these new dummy columns in bureau, we will
#       take mean
for cat in df_bureau_cat: cat_aggregations[cat] = ['mean']
cat_aggregations    

len(cat_aggregations)   # 26

26

In [28]:
# 7.3.1 In addition, we have in bureau. columns that merged
#        from 'bb' ie bb_cat
#         So here is our full list
df_bb_cat
len(df_bb_cat)             # 9

# 7.3.2
for cat in df_bb_cat: cat_aggregations[cat + "_MEAN"] = ['mean']
cat_aggregations 

len(cat_aggregations)   # 26 + 9 = 35

35

In [29]:
# 7.4 Have a look at bureau columns again
#      Just to compare above results with what
#       already exists

bureau.columns        # 51
len(bureau.columns)   # 35 (dummy) + 14 (num) + 1 (SK_ID_CURR) + 1 (DAYS_ENDDATE_FACT) = 51

51

In [30]:
# 7.5 Now that we have decided 
#     our aggregation strategy for each column
#      (except 2), let us now aggregate:
#         Note that SK_ID_CURR now becomes an index to data

grouped = bureau.groupby('SK_ID_CURR')
bureau_agg = grouped.agg({**num_aggregations, **cat_aggregations})

In [31]:
bureau_agg

Unnamed: 0_level_0,DAYS_CREDIT,DAYS_CREDIT,DAYS_CREDIT,DAYS_CREDIT,DAYS_CREDIT_ENDDATE,DAYS_CREDIT_ENDDATE,DAYS_CREDIT_ENDDATE,DAYS_CREDIT_UPDATE,CREDIT_DAY_OVERDUE,CREDIT_DAY_OVERDUE,...,CREDIT_TYPE_nan,STATUS_0_MEAN,STATUS_1_MEAN,STATUS_2_MEAN,STATUS_3_MEAN,STATUS_4_MEAN,STATUS_5_MEAN,STATUS_C_MEAN,STATUS_X_MEAN,STATUS_nan_MEAN
Unnamed: 0_level_1,min,max,mean,var,min,max,mean,mean,max,mean,...,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean
SK_ID_CURR,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
100001,-1572,-49,-735.000000,240043.666667,-1329.0,1778.0,82.428571,-93.142857,0,0.0,...,0.0,0.336651,0.007519,0.0,0.0,0.0,0.0,0.441240,0.214590,0.0
100002,-1437,-103,-874.000000,186150.000000,-1072.0,780.0,-349.000000,-499.875000,0,0.0,...,0.0,0.406960,0.255682,0.0,0.0,0.0,0.0,0.175426,0.161932,0.0
100003,-2586,-606,-1400.750000,827783.583333,-2434.0,1216.0,-544.500000,-816.000000,0,0.0,...,0.0,,,,,,,,,
100004,-1326,-408,-867.000000,421362.000000,-595.0,-382.0,-488.500000,-532.000000,0,0.0,...,0.0,,,,,,,,,
100005,-373,-62,-190.666667,26340.333333,-128.0,1324.0,439.333333,-54.333333,0,0.0,...,0.0,0.735043,0.000000,0.0,0.0,0.0,0.0,0.128205,0.136752,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456249,-2713,-483,-1667.076923,407302.243590,-2499.0,1363.0,-1232.333333,-1064.538462,0,0.0,...,0.0,,,,,,,,,
456250,-1002,-760,-862.000000,15724.000000,-272.0,2340.0,1288.333333,-60.333333,0,0.0,...,0.0,0.130259,0.000000,0.0,0.0,0.0,0.0,0.252525,0.617216,0.0
456253,-919,-713,-867.500000,10609.000000,-189.0,1113.0,280.500000,-253.250000,0,0.0,...,0.0,0.404906,0.000000,0.0,0.0,0.0,0.0,0.459677,0.135417,0.0
456254,-1104,-1104,-1104.000000,,-859.0,-859.0,-859.000000,-401.000000,0,0.0,...,0.0,0.216216,0.000000,0.0,0.0,0.0,0.0,0.783784,0.000000,0.0


In [32]:
# 7.7 Remove hierarchical index from bureau_agg
bureau_agg.columns       # 62 
bureau_agg.columns = pd.Index(['BURO_' + e[0] + "_" + e[1].upper() for e in bureau_agg.columns.tolist()])

In [33]:
bureau_agg.columns

Index(['BURO_DAYS_CREDIT_MIN', 'BURO_DAYS_CREDIT_MAX', 'BURO_DAYS_CREDIT_MEAN',
       'BURO_DAYS_CREDIT_VAR', 'BURO_DAYS_CREDIT_ENDDATE_MIN',
       'BURO_DAYS_CREDIT_ENDDATE_MAX', 'BURO_DAYS_CREDIT_ENDDATE_MEAN',
       'BURO_DAYS_CREDIT_UPDATE_MEAN', 'BURO_CREDIT_DAY_OVERDUE_MAX',
       'BURO_CREDIT_DAY_OVERDUE_MEAN', 'BURO_AMT_CREDIT_MAX_OVERDUE_MEAN',
       'BURO_AMT_CREDIT_SUM_MAX', 'BURO_AMT_CREDIT_SUM_MEAN',
       'BURO_AMT_CREDIT_SUM_SUM', 'BURO_AMT_CREDIT_SUM_DEBT_MAX',
       'BURO_AMT_CREDIT_SUM_DEBT_MEAN', 'BURO_AMT_CREDIT_SUM_DEBT_SUM',
       'BURO_AMT_CREDIT_SUM_OVERDUE_MEAN', 'BURO_AMT_CREDIT_SUM_LIMIT_MEAN',
       'BURO_AMT_CREDIT_SUM_LIMIT_SUM', 'BURO_AMT_ANNUITY_MAX',
       'BURO_AMT_ANNUITY_MEAN', 'BURO_CNT_CREDIT_PROLONG_SUM',
       'BURO_MONTHS_BALANCE_MIN_MIN', 'BURO_MONTHS_BALANCE_MAX_MAX',
       'BURO_MONTHS_BALANCE_SIZE_MEAN', 'BURO_MONTHS_BALANCE_SIZE_SUM',
       'BURO_CREDIT_ACTIVE_Active_MEAN', 'BURO_CREDIT_ACTIVE_Bad debt_MEAN',
       'BURO_CREDI

In [34]:
# 7.9 No duplicate index
bureau_agg.index.nunique()   # 305811
len(set(bureau_agg.index))   # 305811

305811

In [36]:
bureau_agg = bureau_agg.reset_index()
bureau_agg

Unnamed: 0,SK_ID_CURR,BURO_DAYS_CREDIT_MIN,BURO_DAYS_CREDIT_MAX,BURO_DAYS_CREDIT_MEAN,BURO_DAYS_CREDIT_VAR,BURO_DAYS_CREDIT_ENDDATE_MIN,BURO_DAYS_CREDIT_ENDDATE_MAX,BURO_DAYS_CREDIT_ENDDATE_MEAN,BURO_DAYS_CREDIT_UPDATE_MEAN,BURO_CREDIT_DAY_OVERDUE_MAX,...,BURO_CREDIT_TYPE_nan_MEAN,BURO_STATUS_0_MEAN_MEAN,BURO_STATUS_1_MEAN_MEAN,BURO_STATUS_2_MEAN_MEAN,BURO_STATUS_3_MEAN_MEAN,BURO_STATUS_4_MEAN_MEAN,BURO_STATUS_5_MEAN_MEAN,BURO_STATUS_C_MEAN_MEAN,BURO_STATUS_X_MEAN_MEAN,BURO_STATUS_nan_MEAN_MEAN
0,100001,-1572,-49,-735.000000,240043.666667,-1329.0,1778.0,82.428571,-93.142857,0,...,0.0,0.336651,0.007519,0.0,0.0,0.0,0.0,0.441240,0.214590,0.0
1,100002,-1437,-103,-874.000000,186150.000000,-1072.0,780.0,-349.000000,-499.875000,0,...,0.0,0.406960,0.255682,0.0,0.0,0.0,0.0,0.175426,0.161932,0.0
2,100003,-2586,-606,-1400.750000,827783.583333,-2434.0,1216.0,-544.500000,-816.000000,0,...,0.0,,,,,,,,,
3,100004,-1326,-408,-867.000000,421362.000000,-595.0,-382.0,-488.500000,-532.000000,0,...,0.0,,,,,,,,,
4,100005,-373,-62,-190.666667,26340.333333,-128.0,1324.0,439.333333,-54.333333,0,...,0.0,0.735043,0.000000,0.0,0.0,0.0,0.0,0.128205,0.136752,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
305806,456249,-2713,-483,-1667.076923,407302.243590,-2499.0,1363.0,-1232.333333,-1064.538462,0,...,0.0,,,,,,,,,
305807,456250,-1002,-760,-862.000000,15724.000000,-272.0,2340.0,1288.333333,-60.333333,0,...,0.0,0.130259,0.000000,0.0,0.0,0.0,0.0,0.252525,0.617216,0.0
305808,456253,-919,-713,-867.500000,10609.000000,-189.0,1113.0,280.500000,-253.250000,0,...,0.0,0.404906,0.000000,0.0,0.0,0.0,0.0,0.459677,0.135417,0.0
305809,456254,-1104,-1104,-1104.000000,,-859.0,-859.0,-859.000000,-401.000000,0,...,0.0,0.216216,0.000000,0.0,0.0,0.0,0.0,0.783784,0.000000,0.0


In [37]:
bureau_agg.to_csv('stat_bureau_bb.csv', index=False)