# Voting classifier

<img src="./nb_images/voting.png" alt="Drawing" style="width: 400px;"/>

+ Voting classifier
    - is the most basic Ensemble classifier.
    - select what the most classifiers vote.
    - is called as `Majority voting` or `Vanilla Ensemble`.
    
<img src="./nb_images/voting-classifier.png" alt="Drawing" style="width: 600px;"/>

+ voting parameter
    - Hard : uses predicted class labels for majority rule voting.
    - Soft : predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

In [1]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier
import warnings
warnings.filterwarnings('ignore')

Load preprocessed data

In [2]:
X = np.load("./data/titanic/tatanic_X_train.npy")
y = np.load("./data/titanic/tatanic_y_train.npy")

print("X shape : {}".format(X.shape))
print("y shape : {}".format(y.shape))

X shape : (889, 27)
y shape : (889,)


In [4]:
import pandas as pd

In [None]:
df = pd.DataFrame

Define classifiers

In [10]:
clf1 = LogisticRegression(random_state=1)
clf2 = DecisionTreeClassifier(random_state=1)
clf3 = GaussianNB()
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

Voting Classifier Score

In [11]:
from sklearn.model_selection import cross_val_score
cross_val_score(eclf, X, y, cv=5).mean()

0.8020504030978227

Logistic Regression Score

In [12]:
cross_val_score(clf1, X, y, cv=5).mean()

0.8290420872214816

Decision Tree Score : hmm..

In [13]:
cross_val_score(clf2, X, y, cv=5).mean()

0.7840411350219006

Gaussian NB Score : Terrible.. It's better to keep out Gaussian NB

In [27]:
cross_val_score(clf3, X, y, cv=5).mean()

0.4600139655938551

Without GaussianNB

In [16]:
clf1 = LogisticRegression(random_state=1)
clf2 = DecisionTreeClassifier(random_state=1)
eclf = VotingClassifier(estimators=[('lr', clf1), ('dt', clf2)], voting='hard')

In [17]:
cross_val_score(eclf, X, y, cv=5).mean()

0.8222687742017394

Apply GridSearch

In [28]:
c_params = [0.1,  5.0, 7.0, 10.0, 15.0, 20.0, 100.0]


params ={
    "lr__solver" : ['liblinear'], "lr__penalty" : ["l2"], "lr__C" : c_params,"dt__criterion" : ["gini", "entropy"],
    "dt__max_depth" : [20, 18, 16, 14, 12, 10],
    "dt__min_samples_leaf": [1,2,3,4,5,6,7,8,9]
    }

In [29]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(X, y)

In [30]:
grid.best_score_

0.843644544431946

In [31]:
grid.best_params_

{'dt__criterion': 'gini',
 'dt__max_depth': 20,
 'dt__min_samples_leaf': 5,
 'lr__C': 5.0,
 'lr__penalty': 'l2',
 'lr__solver': 'liblinear'}