## Voting Classifiers


I am trying to predict loan outcomes (0, 1) using an unweighted soft voting ensemble classifier (sklearn's VotingClassifier class with voting='soft'). For a given sample, this outputs the class label with highest averaged probability predicted by the component classifiers. The component classifiers used here will be:


1.Decision tree

2.Gaussian naive Bayes

3.RBF kernel support vector machine


4.K-nearest neighbors


Read in data. Split into training and testing subsets (70/30) and z-score standardize the features.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, accuracy_score


In [2]:
df = pd.read_csv('/home/alam/Downloads/New DAta/DataNew.csv')

In [3]:
X = df.loc[:, ['Credit_Amount','Duration_in_Months',
           'Age','Current_Address_Yrs','Num_Credits','Num_Dependents']].values

In [4]:
y = df.loc[:, 'Default_On_Payment'].values

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [6]:
sc = StandardScaler()


In [7]:

sc.fit(X_train)




StandardScaler(copy=True, with_mean=True, with_std=True)

In [8]:
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [9]:
import time
t0 = time.clock()

tree = DecisionTreeClassifier(random_state=1)
svm = SVC(probability=True, kernel='rbf')
knn = KNeighborsClassifier(p=2, metric='minkowski')
nb = GaussianNB()
eclf = VotingClassifier(estimators=[('tree', tree), ('svm', svm), ('knn', knn),('nb', nb)], voting='soft')
param_range10 = [.001, .01, 1, 10, 100]
param_range1 = list(range(3, 8))
param_grid = [{'svm__C':param_range10, 'svm__gamma':param_range10, 'tree__max_depth':param_range1, 
               'knn__n_neighbors':param_range1}]

gs = GridSearchCV(estimator=eclf, param_grid=param_grid, scoring='accuracy', cv=5)
gs = gs.fit(X_train_std, y_train)

print('Best accuracy score: %.3f \nBest parameters: %s' % (gs.best_score_, gs.best_params_))

clf = gs.best_estimator_
clf.fit(X_train_std, y_train)
t1 = time.clock()
print('Running time: %.3f' % (t1-t0))

Best accuracy score: 0.955 
Best parameters: {'tree__max_depth': 4, 'svm__C': 100, 'knn__n_neighbors': 3, 'svm__gamma': 100}
Running time: 7200.043


Best Voting Classifier is 95% accurate.