# Top 10 Algorithms for Binary Classification [Beginner's Guide]

#### How to implement the 10 most important binary classification algorithms with a few lines of Python and how they perform


1. Naive Bayes
2. Logistic Regression
3. K-Nearest Neighbours
4. Support Vector Machine
5. Decision Tree 
6. Bagging  Decision Tree (Ensemble Learning I)
7. Boosted Decision Tree (Ensemble Learning II)
8. Random Forest (Ensemble Learning III)
9. Voting Classification (Ensemble Learning IV)
10. Deep Learning with a neuronal network

In [1]:
import numpy as np
import keras.datasets as keras_data

Using TensorFlow backend.


# Data import

IMDB Ratings for binary Sentiment Analysis with Natural Language Processing.

Data Source: https://keras.io/api/datasets/imdb/

#### a) Load data from Keras

In [2]:
# load the dataset from keras.dataset and directly split the tuples into seperated variables

(imdb_train_data,imdb_train_labels),(imdb_test_data,imdb_test_labels)=keras_data.imdb.load_data(num_words=10000)

#### b) Check out the dataset

In [3]:
# check otu a sample
print(imdb_train_data.shape)
print(imdb_test_data.shape)
print(imdb_train_labels.shape)
print(imdb_test_labels.shape)

print(imdb_train_data[5])

(25000,)
(25000,)
(25000,)
(25000,)
[1, 778, 128, 74, 12, 630, 163, 15, 4, 1766, 7982, 1051, 2, 32, 85, 156, 45, 40, 148, 139, 121, 664, 665, 10, 10, 1361, 173, 4, 749, 2, 16, 3804, 8, 4, 226, 65, 12, 43, 127, 24, 2, 10, 10]


In [3]:
# map tokenized vector back to text

word_index = keras_data.imdb.get_word_index()

reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])

decoded_word_index = ''.join([reverse_word_index.get(i-3,'?') for i in imdb_train_data[5]])

In [4]:
# show decoded sample
print(decoded_word_index)
print(' ')

# show word index list
for key in sorted(reverse_word_index.keys()):
    if(key<=10): 
        print("%s: %s" % (key, reverse_word_index[key]))

?beginsbetterthanitendsfunnythattherussiansubmarinecrew?allotheractorsit'slikethosesceneswheredocumentaryshotsbrbrspoilerpartthemessage?wascontrarytothewholestoryitjustdoesnot?brbr
 
1: the
2: and
3: a
4: of
5: to
6: is
7: br
8: in
9: it
10: i


#### c) Vectorize the dataset

In [5]:
def vectorize_sequence(sequences,dimensions):
    results=np.zeros((len(sequences),dimensions))
    for i, sequence in enumerate(sequences):
        results[i,sequence] = 1.
    return results

In [6]:
# vectorize train and test data
x_train=vectorize_sequence(imdb_train_data,10000)
x_test=vectorize_sequence(imdb_test_data,10000)

# convert train and test labels into float numpy vector
y_train=np.asarray(imdb_train_labels).astype('float32')
y_test=np.asarray(imdb_test_labels).astype('float32')


print(x_train.shape)
print(x_test.shape)

(25000, 10000)
(25000, 10000)


# 1. Naive Bayes
Sklearn Documentation: 
* Naive Bayes: https://scikit-learn.org/stable/modules/naive_bayes.html
* MultinomialNB: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB

In [12]:
%%time

from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB().fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(mnb.score(x_test, y_test)))
print("score on train: "+ str(mnb.score(x_train, y_train)))

train shape: (25000, 10000)
score on test: 0.83936
score on train: 0.86884
CPU times: user 2.99 s, sys: 14.7 ms, total: 3.01 s
Wall time: 989 ms


# 2. Logistic Regression

Sklearn Documentation: 
* LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* SGD Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

In [14]:
%%time

from sklearn.linear_model import LogisticRegression

lr=LogisticRegression(max_iter=1000)
lr.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(lr.score(x_test, y_test)))
print("score on train: "+ str(lr.score(x_train, y_train)))

train shape: (25000, 10000)
score on test: 0.86204
score on train: 0.98984
CPU times: user 5min 8s, sys: 19.6 s, total: 5min 28s
Wall time: 59.8 s


In [13]:
%%time
#logistic regression with stochastic gradient decent
from sklearn.linear_model import SGDClassifier

sgd=SGDClassifier()
sgd.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(sgd.score(x_test, y_test)))
print("score on train: "+ str(sgd.score(x_train, y_train)))

train shape: (25000, 10000)
score on test: 0.84936
score on train: 0.98904
CPU times: user 22.8 s, sys: 86 ms, total: 22.9 s
Wall time: 21.4 s


# 3. K-Nearest Neighbours

Sklearn Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [7]:
%%time

from sklearn.neighbors import KNeighborsClassifier

#knn = KNeighborsClassifier(n_neighbors=5,algorithm = 'ball_tree')
knn = KNeighborsClassifier(algorithm = 'brute', n_jobs=-1)

knn.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(knn.score(x_test, y_test)))
print("score on train: "+ str(knn.score(x_train, y_train)))

train shape: (25000, 10000)
score on test: 0.62468
score on train: 0.7854
CPU times: user 1h 21min 56s, sys: 33.7 s, total: 1h 22min 30s
Wall time: 11min 58s


# 4. Support Vector Machine

Sklearn Documentation:
* SVM Overview: https://scikit-learn.org/stable/modules/svm.html
* LinearSVC: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC


In [17]:
%%time

from sklearn.svm import LinearSVC

svm=LinearSVC(C=0.0001)
svm.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(svm.score(x_test, y_test)))
print("score on train: "+ str(svm.score(x_train, y_train)))

train shape: (25000, 10000)
score on test: 0.8514
score on train: 0.86272
CPU times: user 2.7 s, sys: 28.9 ms, total: 2.73 s
Wall time: 1.82 s


# 4. Decision Tree

Sklearn Documentation:

* https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [45]:
%%time

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: "  + str(clf.score(x_test, y_test)))
print("score on train: " + str(clf.score(x_train, y_train)))

train shape: (25000, 10000)
score on test: 0.70496
score on train: 1.0
CPU times: user 3min 54s, sys: 2.48 s, total: 3min 56s
Wall time: 3min 58s


# 5. Bagging Decision Tree

Sklearn Documentation:

* overview ensemble methods: https://scikit-learn.org/stable/modules/ensemble.html
* bagging classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html


In [46]:
%%time

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bg=BaggingClassifier(DecisionTreeClassifier(),max_samples=0.5,max_features=1.0,n_estimators=10)
bg.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(bg.score(x_test, y_test)))
print("score on train: "+ str(bg.score(x_train, y_train)))

train shape: (25000, 10000)
score on test: 0.77592
score on train: 0.93372
CPU times: user 8min 2s, sys: 43.4 s, total: 8min 45s
Wall time: 8min 50s


# 6. Boosting Decision Tree

Sklearn Documentation:

* AdaBoost Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier
* Gradien Boosting Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier


In [9]:
%%time

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# setting 
# min_samples_split=10
# max_depth=4

adb = AdaBoostClassifier(DecisionTreeClassifier(min_samples_split=10,max_depth=4),n_estimators=10,learning_rate=0.6)
adb.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(adb.score(x_test, y_test)))
print("score on train: "+ str(adb.score(x_train, y_train)))

train shape: (25000, 10000)
score on test: 0.78664
score on train: 0.80232
CPU times: user 5min 12s, sys: 14.8 s, total: 5min 27s
Wall time: 5min 17s


In [10]:
%%time

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

# setting 
# min_samples_split=10
# max_depth=4

gbc = GradientBoostingClassifier(n_estimators=10)
gbc.fit(x_train, y_train)

print("train shape: " + str(x_train.shape))
print("score on test: " + str(gbc.score(x_test, y_test)))
print("score on train: "+ str(gbc.score(x_train, y_train)))

train shape: (25000, 10000)
score on test: 0.70112
score on train: 0.69992
CPU times: user 3min 25s, sys: 2.54 s, total: 3min 27s
Wall time: 3min 28s


# 7. Random Forest

Sklearn Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

In [15]:
%%time

from sklearn.ensemble import RandomForestClassifier

# n_estimators = number of desission trees
rf = RandomForestClassifier(n_estimators=30,max_depth=9)
rf.fit(x_train, y_train)

print("score on test: " + str(rf.score(x_test, y_test)))
print("score on train: "+ str(rf.score(x_train, y_train)))

score on test: 0.80336
score on train: 0.83244
CPU times: user 14.6 s, sys: 1.06 s, total: 15.6 s
Wall time: 15.2 s


# 9. Voting Classifier

Sklearn Documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

In [21]:
%%time

from sklearn.ensemble import VotingClassifier

# 1) naive bias = mnb
mnb = MultinomialNB().fit(x_train, y_train)
# 2) logistic regression =lr
lr=LogisticRegression(max_iter=1000)
# 3) random forest =rf
rf = RandomForestClassifier(n_estimators=30,max_depth=9)
# 4) suport vecotr mnachine = svm
svm=LinearSVC(C=0.0001)

evc=VotingClassifier(estimators=[('mnb',mnb),('lr',lr),('rf',rf),('svm',svm)])
evc.fit(x_train, y_train)

print("score on test: " + str(evc.score(x_test, y_test)))
print("score on train: "+ str(evc.score(x_train, y_train)))

score on test: 0.86316
score on train: 0.91172
CPU times: user 5min 35s, sys: 19.2 s, total: 5min 54s
Wall time: 1min 16s


In [20]:
%%time
from sklearn.model_selection import cross_val_score

for clf, label in zip([mnb, lr, rf, svm, evc], ['Naive Bayes', 'Logistic Regression', 'Random Forest', 'Support Vector Machine','Ensemble']):
    scores = cross_val_score(clf, x_train, y_train, scoring='accuracy', cv=5)
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.85 (+/- 0.01) [Naive Bayes]
Accuracy: 0.87 (+/- 0.00) [Logistic Regression]
Accuracy: 0.80 (+/- 0.01) [Random Forest]
Accuracy: 0.85 (+/- 0.00) [Support Vector Machine]
Accuracy: 0.87 (+/- 0.00) [Ensemble]
CPU times: user 44min 2s, sys: 3min 35s, total: 47min 38s
Wall time: 10min 22s


# 10. Deep Learning 

Keras Documentation:
* Sequential Model: https://keras.io/guides/sequential_model/

In [42]:
%%time

from keras import layers
from keras import models
from keras import optimizers
from keras import losses
from keras import metrics
# split an additional validation dataset
x_validation=x_train[:1000]
x_partial_train=x_train[1000:]
y_validation=y_train[:1000]
y_partial_train=y_train[1000:]
model=models.Sequential()
model.add(layers.Dense(16,activation='relu',input_shape=(10000,)))
model.add(layers.Dense(16,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])
model.fit(x_partial_train,y_partial_train,epochs=4,batch_size=512,validation_data=(x_validation,y_validation))


print('')
print("train shape: " + str(x_train.shape))
print("score on test: " + str(model.evaluate(x_test,y_test)[1]))
print("score on train: "+ str(model.evaluate(x_train,y_train)[1]))


Train on 24000 samples, validate on 1000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4

train shape: (25000, 10000)
score on test: 0.8771600127220154
score on train: 0.9545199871063232
CPU times: user 12 s, sys: 2.86 s, total: 14.9 s
Wall time: 7.72 s


In [44]:
%%time

from keras import layers
from keras import models
from keras import optimizers
from keras import losses
from keras import regularizers
from keras import metrics

# add validation dataset
validation_split=1000
x_validation=x_train[:validation_split]
x_partial_train=x_train[validation_split:]
y_validation=y_train[:validation_split]
y_partial_train=y_train[validation_split:]

model=models.Sequential()
model.add(layers.Dense(8,kernel_regularizer=regularizers.l2(0.003),activation='relu',input_shape=(10000,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(8,kernel_regularizer=regularizers.l2(0.003),activation='relu'))
model.add(layers.Dropout(0.6))
model.add(layers.Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

model.fit(x_partial_train,y_partial_train,epochs=4,batch_size=512,validation_data=(x_validation,y_validation))

print('')
print("train shape: " + str(x_train.shape))
print("score on test: " + str(model.evaluate(x_test,y_test)[1]))
print("score on train: "+ str(model.evaluate(x_train,y_train)[1]))


Train on 24000 samples, validate on 1000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4

train shape: (25000, 10000)
score on test: 0.8784400224685669
score on train: 0.9075599908828735
CPU times: user 12.1 s, sys: 1.78 s, total: 13.9 s
Wall time: 7.45 s
