# Top 10 Algorithms for Binary Classification [Beginner's Guide]

#### How to implement the 10 most important binary classification algorithms with a few lines of Python and how they perform


1. Naive Bayes
2. Logistic Regression
3. K-Nearest Neighbours
4. Support Vector Machine
5. Decision Tree 
6. Bagging  Decision Tree (Ensemble Learning I)
7. Boosted Decision Tree (Ensemble Learning II)
8. Random Forest (Ensemble Learning III)
9. Voting Classification (Ensemble Learning IV)
10. Deep Learning with a neuronal network

In [1]:
import numpy as np
import keras.datasets as keras_data

Using TensorFlow backend.


# Data import

The wisconsin breast cancer dataset is a classic and very easy binary classification dataset.
It is included in the Sklearn Module

Data Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

#### a) Load data from Keras

In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split


x, y = load_breast_cancer(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

#### b) Check out the dataset

In [5]:
# check otu a sample
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

print(x_train[5])

(426, 30)
(143, 30)
(426,)
(143,)
[1.206e+01 1.890e+01 7.666e+01 4.453e+02 8.386e-02 5.794e-02 7.510e-03
 8.488e-03 1.555e-01 6.048e-02 2.430e-01 1.152e+00 1.559e+00 1.802e+01
 7.180e-03 1.096e-02 5.832e-03 5.495e-03 1.982e-02 2.754e-03 1.364e+01
 2.706e+01 8.654e+01 5.626e+02 1.289e-01 1.352e-01 4.506e-02 5.093e-02
 2.880e-01 8.083e-02]


In [6]:
# vectorize train and test data
x_train=vectorize_sequence(imdb_train_data,10000)
x_test=vectorize_sequence(imdb_test_data,10000)

# convert train and test labels into float numpy vector
y_train=np.asarray(imdb_train_labels).astype('float32')
y_test=np.asarray(imdb_test_labels).astype('float32')


print(x_train.shape)
print(x_test.shape)

(25000, 10000)
(25000, 10000)


# 1. Naive Bayes
Sklearn Documentation: 
* Naive Bayes: https://scikit-learn.org/stable/modules/naive_bayes.html
* MultinomialNB: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB

In [8]:
%%time

from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB().fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(mnb.score(x_test, y_test)))
print("score on train: "+ str(mnb.score(x_train, y_train)))

train shape: (426, 30)
score on test: 0.9020979020979021
score on train: 0.8943661971830986
CPU times: user 2.97 ms, sys: 1.32 ms, total: 4.29 ms
Wall time: 3.32 ms


# 2. Logistic Regression

Sklearn Documentation: 
* LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* SGD Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

In [10]:
%%time

from sklearn.linear_model import LogisticRegression

lr=LogisticRegression(max_iter=5000)
lr.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(lr.score(x_test, y_test)))
print("score on train: "+ str(lr.score(x_train, y_train)))

train shape: (426, 30)
score on test: 0.951048951048951
score on train: 0.960093896713615
CPU times: user 1.38 s, sys: 38.3 ms, total: 1.42 s
Wall time: 352 ms


In [11]:
%%time
#logistic regression with stochastic gradient decent
from sklearn.linear_model import SGDClassifier

sgd=SGDClassifier()
sgd.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(sgd.score(x_test, y_test)))
print("score on train: "+ str(sgd.score(x_train, y_train)))

train shape: (426, 30)
score on test: 0.7412587412587412
score on train: 0.7582159624413145
CPU times: user 4.1 ms, sys: 1.24 ms, total: 5.34 ms
Wall time: 3.98 ms


# 3. K-Nearest Neighbours

Sklearn Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [12]:
%%time

from sklearn.neighbors import KNeighborsClassifier

#knn = KNeighborsClassifier(n_neighbors=5,algorithm = 'ball_tree')
knn = KNeighborsClassifier(algorithm = 'brute', n_jobs=-1)

knn.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(knn.score(x_test, y_test)))
print("score on train: "+ str(knn.score(x_train, y_train)))

train shape: (426, 30)
score on test: 0.9370629370629371
score on train: 0.9413145539906104
CPU times: user 1.51 s, sys: 264 ms, total: 1.77 s
Wall time: 298 ms


# 4. Support Vector Machine

Sklearn Documentation:
* SVM Overview: https://scikit-learn.org/stable/modules/svm.html
* LinearSVC: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC


In [13]:
%%time

from sklearn.svm import LinearSVC

svm=LinearSVC(C=0.0001)
svm.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(svm.score(x_test, y_test)))
print("score on train: "+ str(svm.score(x_train, y_train)))

train shape: (426, 30)
score on test: 0.9370629370629371
score on train: 0.9225352112676056
CPU times: user 28.8 ms, sys: 2.53 ms, total: 31.4 ms
Wall time: 30.9 ms




# 4. Decision Tree

Sklearn Documentation:

* https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [27]:
%%time

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(min_samples_split=10,max_depth=3)
clf.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: "  + str(clf.score(x_test, y_test)))
print("score on train: " + str(clf.score(x_train, y_train)))

train shape: (426, 30)
score on test: 0.9300699300699301
score on train: 0.971830985915493
CPU times: user 7.11 ms, sys: 1.92 ms, total: 9.03 ms
Wall time: 7.43 ms


# 5. Bagging Decision Tree

Sklearn Documentation:

* overview ensemble methods: https://scikit-learn.org/stable/modules/ensemble.html
* bagging classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html


In [28]:
%%time

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bg=BaggingClassifier(DecisionTreeClassifier(min_samples_split=10,max_depth=3),max_samples=0.5,max_features=1.0,n_estimators=10)
bg.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(bg.score(x_test, y_test)))
print("score on train: "+ str(bg.score(x_train, y_train)))

train shape: (426, 30)
score on test: 0.9440559440559441
score on train: 0.9671361502347418
CPU times: user 30.5 ms, sys: 2.19 ms, total: 32.7 ms
Wall time: 31.4 ms


# 6. Boosting Decision Tree

Sklearn Documentation:

* AdaBoost Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier
* Gradien Boosting Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier


In [31]:
%%time

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# setting 
# min_samples_split=10
# max_depth=4

adb = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),n_estimators=100,learning_rate=0.5)
adb.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(adb.score(x_test, y_test)))
print("score on train: "+ str(adb.score(x_train, y_train)))

train shape: (426, 30)
score on test: 0.958041958041958
score on train: 1.0
CPU times: user 356 ms, sys: 3.42 ms, total: 360 ms
Wall time: 358 ms


In [24]:
%%time

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

# setting 
# min_samples_split=10
# max_depth=4

gbc = GradientBoostingClassifier(n_estimators=100)
gbc.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(gbc.score(x_test, y_test)))
print("score on train: "+ str(gbc.score(x_train, y_train)))

train shape: (426, 30)
score on test: 0.972027972027972
score on train: 1.0
CPU times: user 398 ms, sys: 4 ms, total: 402 ms
Wall time: 402 ms


# 7. Random Forest

Sklearn Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

In [32]:
%%time

from sklearn.ensemble import RandomForestClassifier

# n_estimators = number of desission trees
rf = RandomForestClassifier(n_estimators=300,max_depth=3)
rf.fit(x_train, y_train)

print("score on test: " + str(rf.score(x_test, y_test)))
print("score on train: "+ str(rf.score(x_train, y_train)))

score on test: 0.958041958041958
score on train: 0.9765258215962441
CPU times: user 384 ms, sys: 3.79 ms, total: 387 ms
Wall time: 386 ms


# 9. Voting Classifier

Sklearn Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

In [36]:
%%time

from sklearn.ensemble import VotingClassifier

# 1) naive bias = mnb
mnb = MultinomialNB().fit(x_train, y_train)
# 2) logistic regression =lr
lr=LogisticRegression(max_iter=5000)
# 3) random forest =rf
rf = RandomForestClassifier(n_estimators=30,max_depth=3)
# 4) suport vecotr mnachine = svm
svm=LinearSVC(max_iter=5000)

evc=VotingClassifier(estimators=[('mnb',mnb),('lr',lr),('rf',rf),('svm',svm)])
evc.fit(x_train, y_train)

print("score on test: " + str(evc.score(x_test, y_test)))
print("score on train: "+ str(evc.score(x_train, y_train)))

score on test: 0.965034965034965
score on train: 0.9694835680751174
CPU times: user 1.89 s, sys: 43.1 ms, total: 1.93 s
Wall time: 483 ms




In [37]:
%%time
from sklearn.model_selection import cross_val_score

for clf, label in zip([mnb, lr, rf, svm, evc], ['Naive Bayes', 'Logistic Regression', 'Random Forest', 'Support Vector Machine','Ensemble']):
    scores = cross_val_score(clf, x_train, y_train, scoring='accuracy', cv=5)
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.89 (+/- 0.02) [Naive Bayes]
Accuracy: 0.96 (+/- 0.03) [Logistic Regression]
Accuracy: 0.95 (+/- 0.01) [Random Forest]




Accuracy: 0.82 (+/- 0.13) [Support Vector Machine]




Accuracy: 0.96 (+/- 0.03) [Ensemble]
CPU times: user 16.2 s, sys: 237 ms, total: 16.4 s
Wall time: 4.35 s




# 10. Deep Learning 

Keras Documentation:
* Sequential Model: https://keras.io/guides/sequential_model/

In [39]:
%%time

from keras import layers
from keras import models
from keras import optimizers
from keras import losses
from keras import metrics
# split an additional validation dataset
x_validation=x_train[:100]
x_partial_train=x_train[100:]
y_validation=y_train[:100]
y_partial_train=y_train[100:]
model=models.Sequential()
model.add(layers.Dense(16,activation='relu',input_shape=(30,)))
model.add(layers.Dense(16,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])
model.fit(x_partial_train,y_partial_train,epochs=4,batch_size=512,validation_data=(x_validation,y_validation))


print('')
print("train shape: " + str(x_train.shape))
print("score on test: " + str(model.evaluate(x_test,y_test)[1]))
print("score on train: "+ str(model.evaluate(x_train,y_train)[1]))


Train on 326 samples, validate on 100 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4

train shape: (426, 30)
score on test: 0.37062937021255493
score on train: 0.3732394278049469
CPU times: user 519 ms, sys: 21 ms, total: 540 ms
Wall time: 517 ms


In [43]:
%%time

from keras import layers
from keras import models
from keras import optimizers
from keras import losses
from keras import regularizers
from keras import metrics

# add validation dataset
validation_split=100
x_validation=x_train[:validation_split]
x_partial_train=x_train[validation_split:]
y_validation=y_train[:validation_split]
y_partial_train=y_train[validation_split:]

model=models.Sequential()
model.add(layers.Dense(4,kernel_regularizer=regularizers.l2(0.003),activation='relu',input_shape=(30,)))
model.add(layers.Dropout(0.7))
model.add(layers.Dense(4,kernel_regularizer=regularizers.l2(0.003),activation='relu'))
model.add(layers.Dropout(0.7))
model.add(layers.Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

model.fit(x_partial_train,y_partial_train,epochs=4,batch_size=512,validation_data=(x_validation,y_validation))

print('')
print("train shape: " + str(x_train.shape))
print("score on test: " + str(model.evaluate(x_test,y_test)[1]))
print("score on train: "+ str(model.evaluate(x_train,y_train)[1]))


Train on 326 samples, validate on 100 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4

train shape: (426, 30)
score on test: 0.6293706297874451
score on train: 0.6267605423927307
CPU times: user 697 ms, sys: 23.9 ms, total: 721 ms
Wall time: 693 ms
