***Objective***

The objective of this competition is to create a machine learning model to predict which individuals are most likely to have or use a bank account. The models and solutions developed can provide an indication of the state of financial inclusion in Kenya, Rwanda, Tanzania and Uganda, while providing insights into some of the key demographic factors that might drive individuals’ financial outcomes.


In [1]:
#load libraries
import pandas as pd
import numpy as np
%matplotlib inline
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder  
from sklearn.preprocessing import StandardScaler  
from scipy import sparse
from catboost import CatBoostClassifier, Pool
import lightgbm as lgb
import xgboost as xgb
from math import sqrt
from sklearn.metrics import mean_squared_error
from scipy.stats import uniform, randint
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import TimeSeriesSplit, cross_val_score, GridSearchCV, RandomizedSearchCV
import random

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [14]:
#load data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sample_submission = pd.read_csv("sample_submission.csv")
description_data = pd.read_csv("VariableDescription.csv")

In [5]:
train.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [6]:
test.head()

Unnamed: 0,country,year,uniqueid,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_6056,Urban,Yes,3,30,Male,Head of Household,Married/Living together,Secondary education,Formally employed Government
1,Kenya,2018,uniqueid_6060,Urban,Yes,7,51,Male,Head of Household,Married/Living together,Vocational/Specialised training,Formally employed Private
2,Kenya,2018,uniqueid_6065,Rural,No,3,77,Female,Parent,Married/Living together,No formal education,Remittance Dependent
3,Kenya,2018,uniqueid_6072,Rural,No,6,39,Female,Head of Household,Married/Living together,Primary education,Remittance Dependent
4,Kenya,2018,uniqueid_6073,Urban,No,3,16,Male,Child,Single/Never Married,Secondary education,Remittance Dependent


In [7]:
sample_submission.head() 

Unnamed: 0,uniqueid,bank_account
0,uniqueid_7867 x Kenya,1.0
1,uniqueid_6722 x Kenya,0.0
2,uniqueid_6714 x Kenya,1.0
3,uniqueid_8103 x Kenya,1.0
4,uniqueid_8657 x Kenya,1.0


In [8]:
description_data.head()

Unnamed: 0,Variable Definitions,Unnamed: 1
0,country,Country interviewee is in.
1,year,Year survey was done in.
2,uniqueid,Unique identifier for each interviewee
3,location_type,"Type of location: Rural, Urban"
4,cellphone_access,"If interviewee has access to a cellphone: Yes, No"


In [9]:
train.isnull().sum()

country                   0
year                      0
uniqueid                  0
bank_account              0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

In [15]:
#uniqueid + " x " + country name
train['uniqueid'] = train['uniqueid'] + " x " + train['country']
test['uniqueid'] = test['uniqueid'] + " x " + test['country']

In [16]:
train.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1 x Kenya,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2 x Kenya,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3 x Kenya,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4 x Kenya,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5 x Kenya,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [None]:
#non categorical = age_of_respondent

In [18]:
test_uniqueid = test['uniqueid']
test_uniqueid.reset_index(drop=True, inplace=True)

In [25]:
train = train[['country','year','bank_account','location_type','cellphone_access','household_size',
               'age_of_respondent','gender_of_respondent','relationship_with_head','marital_status',
               'education_level','job_type']]

In [26]:
test = test[['country','year','location_type','cellphone_access','household_size',
               'age_of_respondent','gender_of_respondent','relationship_with_head','marital_status',
               'education_level','job_type']]

In [27]:
train_categorical = ['country','year','location_type','cellphone_access','household_size',
               'gender_of_respondent','relationship_with_head','marital_status',
               'education_level','job_type']
test_categorical = ['country','year','location_type','cellphone_access','household_size',
               'gender_of_respondent','relationship_with_head','marital_status',
               'education_level','job_type']

In [42]:
train["bank_account"] = train["bank_account"].astype('category')
train["bank_account"] = train["bank_account"].cat.codes

In [43]:
train.head()

Unnamed: 0,country,year,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,1,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,0,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,1,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,0,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,0,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [44]:
#Catboost
#split training data set
X_train1 = train.drop(['bank_account'], axis=1)
y_train1 = train.bank_account

#split test dataset
X_test1 = test
#y_test1 = df_test_cat.YIELD

def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols, query_cols, sorter=sidx)]

categorical_features_indices = column_index(X_train1, train_categorical)
categorical_features_indices1 = column_index(X_test1, test_categorical)


In [47]:
from catboost import CatBoostClassifier
from sklearn import metrics
#import cb as catboost

#def auc(m, train): 
#    return (metrics.roc_auc_score(y_train1,m.predict_proba(y_train)[:,1])#,
    #metrics.roc_auc_score(y_test,m.predict_proba(test)[:,1]))

params = {'depth': [4, 7, 10],
          'learning_rate' : [0.03, 0.1, 0.15],
         'l2_leaf_reg': [1,4,9],
         'iterations': [300]}

cb = CatBoostClassifier()
cb_model = GridSearchCV(cb, params, scoring="roc_auc", cv = 3)
#cb_model.fit(X_train1, y_train1)

#Without Categorical features
#clf = cb.CatBoostClassifier(eval_metric="AUC", depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
#clf.fit(X_train1,y_train1)
#auc(clf, train, test)

#With Categorical features
clf = CatBoostClassifier(eval_metric="AUC",one_hot_max_size=31, depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(X_train1,y_train1, cat_features= categorical_features_indices)
#auc(clf, X_train1, y_train1)

0:	total: 52.7ms	remaining: 26.3s
1:	total: 297ms	remaining: 1m 14s
2:	total: 527ms	remaining: 1m 27s
3:	total: 617ms	remaining: 1m 16s
4:	total: 661ms	remaining: 1m 5s
5:	total: 904ms	remaining: 1m 14s
6:	total: 1.14s	remaining: 1m 20s
7:	total: 1.37s	remaining: 1m 24s
8:	total: 1.58s	remaining: 1m 26s
9:	total: 1.62s	remaining: 1m 19s
10:	total: 1.76s	remaining: 1m 18s
11:	total: 1.91s	remaining: 1m 17s
12:	total: 2.14s	remaining: 1m 20s
13:	total: 2.28s	remaining: 1m 19s
14:	total: 2.33s	remaining: 1m 15s
15:	total: 2.4s	remaining: 1m 12s
16:	total: 2.46s	remaining: 1m 9s
17:	total: 2.53s	remaining: 1m 7s
18:	total: 2.68s	remaining: 1m 7s
19:	total: 2.76s	remaining: 1m 6s
20:	total: 2.8s	remaining: 1m 3s
21:	total: 2.85s	remaining: 1m 1s
22:	total: 2.9s	remaining: 1m
23:	total: 2.96s	remaining: 58.7s
24:	total: 3s	remaining: 57s
25:	total: 3.22s	remaining: 58.8s
26:	total: 3.26s	remaining: 57.2s
27:	total: 3.35s	remaining: 56.4s
28:	total: 3.38s	remaining: 54.9s
29:	total: 3.61s	rem

240:	total: 33s	remaining: 35.5s
241:	total: 33.4s	remaining: 35.6s
242:	total: 33.6s	remaining: 35.5s
243:	total: 34s	remaining: 35.7s
244:	total: 34.2s	remaining: 35.6s
245:	total: 34.6s	remaining: 35.7s
246:	total: 34.9s	remaining: 35.7s
247:	total: 34.9s	remaining: 35.5s
248:	total: 35.2s	remaining: 35.5s
249:	total: 35.3s	remaining: 35.3s
250:	total: 35.6s	remaining: 35.3s
251:	total: 35.6s	remaining: 35.1s
252:	total: 35.9s	remaining: 35.1s
253:	total: 36.1s	remaining: 35s
254:	total: 36.4s	remaining: 34.9s
255:	total: 36.6s	remaining: 34.9s
256:	total: 36.8s	remaining: 34.8s
257:	total: 37.1s	remaining: 34.8s
258:	total: 37.4s	remaining: 34.8s
259:	total: 37.6s	remaining: 34.7s
260:	total: 37.9s	remaining: 34.7s
261:	total: 38.1s	remaining: 34.7s
262:	total: 38.4s	remaining: 34.6s
263:	total: 38.5s	remaining: 34.4s
264:	total: 38.8s	remaining: 34.4s
265:	total: 39s	remaining: 34.3s
266:	total: 39.1s	remaining: 34.1s
267:	total: 39.3s	remaining: 34.1s
268:	total: 39.6s	remaining:

475:	total: 1m 19s	remaining: 4.03s
476:	total: 1m 20s	remaining: 3.86s
477:	total: 1m 20s	remaining: 3.69s
478:	total: 1m 20s	remaining: 3.53s
479:	total: 1m 20s	remaining: 3.36s
480:	total: 1m 20s	remaining: 3.19s
481:	total: 1m 21s	remaining: 3.03s
482:	total: 1m 21s	remaining: 2.86s
483:	total: 1m 21s	remaining: 2.69s
484:	total: 1m 21s	remaining: 2.52s
485:	total: 1m 21s	remaining: 2.35s
486:	total: 1m 22s	remaining: 2.19s
487:	total: 1m 22s	remaining: 2.02s
488:	total: 1m 22s	remaining: 1.85s
489:	total: 1m 22s	remaining: 1.68s
490:	total: 1m 22s	remaining: 1.51s
491:	total: 1m 22s	remaining: 1.34s
492:	total: 1m 22s	remaining: 1.18s
493:	total: 1m 23s	remaining: 1.01s
494:	total: 1m 23s	remaining: 840ms
495:	total: 1m 23s	remaining: 672ms
496:	total: 1m 23s	remaining: 503ms
497:	total: 1m 23s	remaining: 335ms
498:	total: 1m 23s	remaining: 167ms
499:	total: 1m 23s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x1a1a49a160>

In [56]:
y_pred_cat = clf.predict(X_test1)

In [52]:
#light gbm
df_train1 = train
df_test1 = test

In [53]:
label_encoder = preprocessing.LabelEncoder()
for col in train_categorical:
    df_train1[col] = label_encoder.fit_transform(df_train1[col].astype(str))
for col in test_categorical:
    df_test1[col] = label_encoder.fit_transform(df_test1[col].astype(str))

#split training data set
X_train3 = df_train1.drop(['bank_account'], axis=1)
y_train3 = df_train1.bank_account

#split test dataset
X_test3 = test

In [54]:
import lightgbm as lgb
from sklearn import metrics

#def auc2(m, train, test): 
#    return (metrics.roc_auc_score(y_train,m.predict(train)),
#                            metrics.roc_auc_score(y_test,m.predict(test)))

lg = lgb.LGBMClassifier(silent=False)
param_dist = {"max_depth": [25,50, 75],
              "learning_rate" : [0.01,0.05,0.1],
              "num_leaves": [300,900,1200],
              "n_estimators": [200]
             }
grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 3, scoring="roc_auc", verbose=5)
grid_search.fit(X_train3,y_train3)
grid_search.best_estimator_

d_train = lgb.Dataset(X_train3, label=y_train3)
params = {"max_depth": 50, "learning_rate" : 0.01, "num_leaves": 900,  "n_estimators": 300}

# Without Categorical Features
#model2 = lgb.train(params, d_train)
#auc2(model2, train, test)

#With Catgeorical Features
model2 = lgb.train(params, d_train, categorical_feature = train_categorical)
#auc2(model2, train, test)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:  3.1min finished


In [55]:
y_pred_lgbm = model2.predict(X_test3)

In [57]:
#xgboost
import xgboost as xgb
from sklearn import metrics

In [58]:
df_train3 = train
df_test3 = test

In [68]:
#train
label_encoder = preprocessing.LabelEncoder()
for col in train_categorical:
    df_train3[col] = label_encoder.fit_transform(df_train3[col].astype(str))
for col in test_categorical:
    df_test3[col] = label_encoder.fit_transform(df_test3[col].astype(str))

#split training data set
X_train3 = df_train3.drop(['bank_account'], axis=1)
y_train3 = df_train3.bank_account

#split test dataset
X_test3 = test

In [69]:
# Parameter Tuning
model = xgb.XGBClassifier()
param_dist = {"max_depth": [10,30,50],
              "min_child_weight" : [1,3,6],
              "n_estimators": [200],
              "learning_rate": [0.05, 0.1,0.16],}
grid_search = GridSearchCV(model, param_grid=param_dist, cv = 3, 
                                   verbose=10, n_jobs=-1)
grid_search.fit(X_train3, y_train3)

grid_search.best_estimator_

model = xgb.XGBClassifier(max_depth=50, min_child_weight=1,  n_estimators=200,\
                          n_jobs=-1 , verbose=1,learning_rate=0.16)
model.fit(X_train3,y_train3)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:  6.8min finished


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.16, max_delta_step=0, max_depth=50,
              min_child_weight=1, missing=None, n_estimators=200, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbose=1, verbosity=1)

In [71]:
y_pred_xgb = model.predict(X_test3)