## Stacked Generalization

In this notebook I will use the Stacked Generalization method to try to increase the accuracy of the models that I have previously used.

In [68]:
# loading a few of the classes that I will need
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder

Now I will read in the training data and split it into three groups that will be used to train the models and then train the stacking classifier.

In [69]:
train_path=os.path.join('..','data','processed','train.csv')

# to have all the columns displayed
pd.set_option('display.max_columns', 80)

train_data=pd.read_csv(train_path)
train_data

Unnamed: 0,id,hi_int_prncp_pd,acc_now_delinq,acc_open_past_24mths,addr_state,annual_inc,application_type,avg_cur_bal,bc_util,delinq_2yrs,delinq_amnt,dti,earliest_cr_line,emp_length,emp_title,grade,home_ownership,initial_list_status,int_rate,last_credit_pull_d,loan_amnt,mort_acc,num_sats,num_tl_120dpd_2m,num_tl_90g_dpd_24m,num_tl_30dpd,out_prncp_inv,pub_rec,pub_rec_bankruptcies,purpose,sub_grade,term,tot_coll_amt,tot_cur_bal,total_rec_late_fee,verification_status
0,3819,0,0,2,CT,21120.0,Individual,3662,86.1,0,0,33.69,2010,1.0,Security Guard,C,RENT,w,12.62,2019,7000,0,18,0,0,0,3828.54,0,0,debt_consolidation,C1,36,0,65921,0.00,1
1,3820,0,0,3,CA,32400.0,Individual,2973,30.2,0,0,15.78,1975,3.0,Security Guard,C,OWN,w,11.99,2016,16200,2,8,0,0,0,0.00,3,2,debt_consolidation,C1,36,0,17840,0.00,0
2,3821,0,0,3,NY,30251.0,Individual,1983,46.2,0,0,20.04,2010,2.0,Security Guard,B,RENT,w,11.49,2018,9600,0,10,0,0,0,0.00,0,0,debt_consolidation,B5,36,0,19829,0.00,1
3,3822,0,0,1,NJ,66976.0,Individual,1638,77.4,0,0,8.58,2012,1.0,Security Guard,C,OWN,w,12.62,2019,12000,0,6,0,0,0,9665.27,0,0,debt_consolidation,C1,60,0,9830,0.00,1
4,3823,0,0,2,MO,125000.0,Individual,577,0.0,0,0,2.34,1982,10.0,Administrator,A,MORTGAGE,w,6.49,2019,24000,4,13,0,0,0,11358.24,0,0,debt_consolidation,A2,60,0,6924,0.00,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5773,9592,1,0,10,VA,72000.0,Individual,12881,89.7,0,0,24.50,2008,3.0,Transportation Security Specialist,D,RENT,w,18.25,2019,20000,0,13,0,0,0,6070.29,0,0,debt_consolidation,D3,36,0,167450,253.96,1
5774,9593,1,0,2,CA,125000.0,Individual,104603,90.2,0,0,6.39,2005,10.0,Stationary Engineer,C,OWN,w,16.01,2019,15000,2,4,0,0,0,13250.14,0,0,other,C5,60,0,523017,0.00,0
5775,9594,1,0,3,MD,95000.0,Individual,22815,96.5,0,0,28.45,1993,10.0,Infrastructure Analyst,B,MORTGAGE,w,11.53,2019,32000,2,18,0,0,0,9188.48,0,0,debt_consolidation,B5,60,0,410676,0.00,1
5776,9595,1,0,5,FL,42000.0,Individual,2586,96.3,0,0,18.20,1999,10.0,Quality Supervisor,C,RENT,w,13.67,2019,12000,0,8,0,0,0,5791.18,0,0,credit_card,C3,60,0,20684,0.00,1


In [70]:
# seperating the target variable
y=train_data['hi_int_prncp_pd'].copy()

train_data=train_data.drop(columns=['hi_int_prncp_pd', 'addr_state', 'purpose', 'emp_title', 'id'], axis=1)

train_data['verification_status']=train_data['verification_status'].astype('object')

In [71]:
# creating a training, validation, and a test set
# splitting both using the default 25% split
x_train, x_test, y_train, y_test=train_test_split(train_data, y, stratify=y, random_state=402)
x_train.head()

Unnamed: 0,acc_now_delinq,acc_open_past_24mths,annual_inc,application_type,avg_cur_bal,bc_util,delinq_2yrs,delinq_amnt,dti,earliest_cr_line,emp_length,grade,home_ownership,initial_list_status,int_rate,last_credit_pull_d,loan_amnt,mort_acc,num_sats,num_tl_120dpd_2m,num_tl_90g_dpd_24m,num_tl_30dpd,out_prncp_inv,pub_rec,pub_rec_bankruptcies,sub_grade,term,tot_coll_amt,tot_cur_bal,total_rec_late_fee,verification_status
3571,0,8,250000.0,Individual,51612,39.5,2,0,9.78,2003,10.0,A,MORTGAGE,w,6.49,2018,21000,8,13,0,0,0,0.0,0,0,A2,60,0,670956,0.0,1
4881,0,6,49000.0,Individual,9842,38.3,1,0,29.24,2003,1.0,E,MORTGAGE,w,25.81,2019,16000,1,23,0,1,0,15305.16,0,0,E4,60,0,226374,0.0,1
181,0,1,51460.0,Individual,3460,61.1,0,0,18.54,1999,10.0,B,RENT,w,10.41,2019,15000,1,6,0,0,0,10508.39,0,0,B3,36,0,17302,0.0,1
4587,0,2,110000.0,Individual,8733,22.0,0,0,17.99,2002,2.0,A,MORTGAGE,w,5.32,2018,20000,3,19,0,0,0,0.0,0,0,A1,36,0,165926,0.0,0
2369,0,1,72400.0,Individual,32761,81.3,0,0,33.27,1995,1.0,D,MORTGAGE,w,15.61,2016,18925,2,16,0,0,0,0.0,0,0,D1,36,0,393132,0.0,0


### Making the Base Learner's Pipelines:

In [72]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4333 entries, 3571 to 3206
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   acc_now_delinq        4333 non-null   int64  
 1   acc_open_past_24mths  4333 non-null   int64  
 2   annual_inc            4333 non-null   float64
 3   application_type      4333 non-null   object 
 4   avg_cur_bal           4333 non-null   int64  
 5   bc_util               4333 non-null   float64
 6   delinq_2yrs           4333 non-null   int64  
 7   delinq_amnt           4333 non-null   int64  
 8   dti                   4333 non-null   float64
 9   earliest_cr_line      4333 non-null   int64  
 10  emp_length            4332 non-null   float64
 11  grade                 4333 non-null   object 
 12  home_ownership        4333 non-null   object 
 13  initial_list_status   4333 non-null   object 
 14  int_rate              4333 non-null   float64
 15  last_credit_pull_d

There are 7 float, 19 int, and 6 object.

All the categorical will be one-hot encoded for the SVM. The other numeric attributes will be scaled using the StandardScaler.

In [73]:
cat_attribs=list(x_train.select_dtypes(include=['object']))
cat_attribs

['application_type',
 'grade',
 'home_ownership',
 'initial_list_status',
 'sub_grade',
 'verification_status']

In [74]:
num_attribs=list(x_train.select_dtypes(include=['int64', 'float64']))
num_attribs

['acc_now_delinq',
 'acc_open_past_24mths',
 'annual_inc',
 'avg_cur_bal',
 'bc_util',
 'delinq_2yrs',
 'delinq_amnt',
 'dti',
 'earliest_cr_line',
 'emp_length',
 'int_rate',
 'last_credit_pull_d',
 'loan_amnt',
 'mort_acc',
 'num_sats',
 'num_tl_120dpd_2m',
 'num_tl_90g_dpd_24m',
 'num_tl_30dpd',
 'out_prncp_inv',
 'pub_rec',
 'pub_rec_bankruptcies',
 'term',
 'tot_coll_amt',
 'tot_cur_bal',
 'total_rec_late_fee']

Creating the SVM Pipeline

In [75]:
# creating the num pipeline in order to handle the missing values that may arise
from sklearn.svm import SVC

svm_num_pipeline=Pipeline([
    ('imputer', KNNImputer(n_neighbors=5)),
    ('num', StandardScaler())
])

svm_transformer=ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)
])

Creating the pipeline for the Extra Trees Classifier.

In [76]:
from sklearn.ensemble import ExtraTreesClassifier

ex_trees_num_pipeline=Pipeline([
    ('imputer', KNNImputer(n_neighbors=5))
])

ex_trees_transformer=ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)
])

Creating the KNN Classifier pipeline

In [77]:
from sklearn.neighbors import KNeighborsClassifier

knn_num_pipeline=Pipeline([
    ('imputer', KNNImputer(n_neighbors=5)),
    ('num', StandardScaler())
])

knn_transformer=ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)
])

Creating the XGBoost pipeline

In [78]:
from xgboost.sklearn import XGBClassifier

xgboost_pipeline=ColumnTransformer([
    ('num', KNNImputer(n_neighbors=5), num_attribs),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)
])

Creating the LightGBM pipeline:

In [79]:
from lightgbm.sklearn import LGBMClassifier

lgbm_pipeline=ColumnTransformer([
    ('num', KNNImputer(n_neighbors=5), num_attribs),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)
])

Here I will create the Stacking Classifier by first creating a list that contains the 5 different classifiers.

In [89]:
estimators=[
    ('svm', make_pipeline(svm_transformer,
                          SVC(kernel='rbf', random_state=402))),
    
    ('extra_trees', make_pipeline(ex_trees_transformer,
                                  ExtraTreesClassifier(n_estimators=100, random_state=402))),
    
    ('knn', make_pipeline(knn_transformer,
                          KNeighborsClassifier(n_neighbors=5, weights='distance'))),
    
    ('xgb', make_pipeline(xgboost_pipeline,
                          XGBClassifier(objective='binary:logistic', n_estimators=100, random_state=402))),
    
    ('lgbm', make_pipeline(lgbm_pipeline,
                           LGBMClassifier(objective='binary', n_estimators=100, random_state=402)))
]

Now I will fit the estimators using the fit method of the Stacking Classifier class.

I will use Logistic Regression as the final estimator to make the final predictions.

In [90]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

stack_clf=StackingClassifier(estimators=estimators,
                            final_estimator=LogisticRegression())

Finally here we will get the accuracy that the model achieves using the score method of the Classifier

In [91]:
%time stack_clf.fit(x_train, y_train).score(x_test, y_test)

CPU times: user 35.2 s, sys: 385 ms, total: 35.6 s
Wall time: 15.4 s


0.9107266435986159

We got an accuracy of 91.1%.

I will now load the test data into memory

In [98]:
test_path=os.path.join('..', 'data', 'processed', 'test.csv')

test_data=pd.read_csv(test_path)

id_col=test_data['id'].copy()

test_data.head()

Unnamed: 0,id,acc_now_delinq,acc_open_past_24mths,addr_state,annual_inc,application_type,avg_cur_bal,bc_util,delinq_2yrs,delinq_amnt,dti,earliest_cr_line,emp_length,emp_title,grade,home_ownership,initial_list_status,int_rate,last_credit_pull_d,loan_amnt,mort_acc,num_sats,num_tl_120dpd_2m,num_tl_90g_dpd_24m,num_tl_30dpd,out_prncp_inv,pub_rec,pub_rec_bankruptcies,purpose,sub_grade,term,tot_coll_amt,tot_cur_bal,total_rec_late_fee,verification_status
0,1,0,10,NY,45000.0,Individual,3125,40.6,0,0,27.6,1987,10.0,Security Guard,D,RENT,w,17.57,2018,10000,0,21,0,0,0,0.0,0,0,medical,D4,60,379,65615,0.0,1
1,2,0,1,NY,40000.0,Individual,1534,26.6,0,0,17.54,2010,3.0,Security Guard,B,OWN,w,8.24,2017,14400,0,10,0,0,0,0.0,0,0,credit_card,B1,36,0,13806,0.0,0
2,3,0,1,FL,113000.0,Individual,49748,82.6,0,0,19.67,1999,8.0,Administrator,A,MORTGAGE,w,7.49,2019,30000,4,9,0,0,0,8141.22,0,0,debt_consolidation,A4,36,0,447731,0.0,1
3,4,0,6,CA,105000.0,Individual,42642,81.8,0,0,30.47,1988,10.0,Administrator,B,MORTGAGE,w,8.18,2019,10000,2,12,0,0,0,0.0,0,0,credit_card,B1,36,0,469067,0.0,0
4,5,0,9,CA,80000.0,Individual,4766,52.2,0,0,22.94,1990,10.0,Administrator,A,RENT,w,5.32,2019,6000,0,21,0,0,0,0.0,0,0,home_improvement,A1,36,0,100087,0.0,0


In [99]:
test_data.drop(columns=['addr_state', 'purpose', 'emp_title', 'id'], axis=1, inplace=True)

In [100]:
# looking at the data one last time before feeding it into the Stacking Classifier
test_data.head()

Unnamed: 0,acc_now_delinq,acc_open_past_24mths,annual_inc,application_type,avg_cur_bal,bc_util,delinq_2yrs,delinq_amnt,dti,earliest_cr_line,emp_length,grade,home_ownership,initial_list_status,int_rate,last_credit_pull_d,loan_amnt,mort_acc,num_sats,num_tl_120dpd_2m,num_tl_90g_dpd_24m,num_tl_30dpd,out_prncp_inv,pub_rec,pub_rec_bankruptcies,sub_grade,term,tot_coll_amt,tot_cur_bal,total_rec_late_fee,verification_status
0,0,10,45000.0,Individual,3125,40.6,0,0,27.6,1987,10.0,D,RENT,w,17.57,2018,10000,0,21,0,0,0,0.0,0,0,D4,60,379,65615,0.0,1
1,0,1,40000.0,Individual,1534,26.6,0,0,17.54,2010,3.0,B,OWN,w,8.24,2017,14400,0,10,0,0,0,0.0,0,0,B1,36,0,13806,0.0,0
2,0,1,113000.0,Individual,49748,82.6,0,0,19.67,1999,8.0,A,MORTGAGE,w,7.49,2019,30000,4,9,0,0,0,8141.22,0,0,A4,36,0,447731,0.0,1
3,0,6,105000.0,Individual,42642,81.8,0,0,30.47,1988,10.0,B,MORTGAGE,w,8.18,2019,10000,2,12,0,0,0,0.0,0,0,B1,36,0,469067,0.0,0
4,0,9,80000.0,Individual,4766,52.2,0,0,22.94,1990,10.0,A,RENT,w,5.32,2019,6000,0,21,0,0,0,0.0,0,0,A1,36,0,100087,0.0,0


In [101]:
stack_predictions=pd.DataFrame(stack_clf.predict(test_data)).rename(columns={0:'Category'})
stack_predictions.head()

Unnamed: 0,Category
0,0
1,0
2,0
3,0
4,0


In [106]:
stack_submission=pd.concat([id_col, stack_predictions], axis=1)
stack_submission.rename(columns={'id':'Id'}, inplace=True)
stack_submission['Category']=stack_submission['Category'].apply(str)
stack_submission.head()

Unnamed: 0,Id,Category
0,1,0
1,2,0
2,3,0
3,4,0
4,5,0


In [107]:
stack_submission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Id        3818 non-null   int64 
 1   Category  3818 non-null   object
dtypes: int64(1), object(1)
memory usage: 59.8+ KB


In [109]:
# path to write file
prediction_path=os.path.join('..', 'data', 'predictions', 'stacking_preds.csv')

stack_submission.to_csv(prediction_path, index=False)

We will see how the predictions do on the test set from Kaggle.

This model scored 92.563% on the test data. This is higher than what we got from the validation data.