# Aims

To test multiple ML models on a dataset to refresh knowledge of commonly used python libraries including sklearn and matplotlib

Much of the code is re-used from the Data Science: NLP Udemy course by The Lazy Programmer ([code](https://github.com/lazyprogrammer/machine_learning_examples/blob/master/nlp_class/nb.py))

In [78]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [85]:
data = pd.read_csv('spambase.data').to_numpy()
np.random.shuffle(data)

Split into X and Y (X is the data we want to analyse and Y is the value that specifies whether the line corresponds to a spam email or not)

To get X, we specify data[:, :48]. The first colon is for the row, and a colon on its own means all rows are included. The :48 is equivalent to 0:48, and means columns 0-48, which are word frequencies.

To get Y, we specify data[:, -1]. This is all rows again but the column index specifies column -1 only, which means the very last column, which is 1 for spam and 0 for not spam.

In [86]:
X = data[:, :48]
Y = data[:, -1]

We want to take 100 rows for the test set and the remainder of the rows for the train set. Since we have shuffled the array above we can simply take the bottom 100 rows as the test set.

We use the index [:-100,] (equivalent to [0:-100,:]) to select all rows from the beginning to 100 from the end for train.

We use the index [-100:,] (equivalent to [-100:end,:]) to select all rows from 100 from the end to the end for test.

In [87]:
Xtrain = X[:-100,]
Ytrain = Y[:-100,]
Xtest = X[-100:,]
Ytest = Y[-100:,]

We define a model using MultinomialNB from sklearn, fit it to the training data and score it on the test data.

In [None]:
NBmodel = MultinomialNB()
NBmodel.fit(Xtrain,Ytrain)
print("Classification for NB: "+str(NBmodel.score(Xtest,Ytest)))

This gives scores in the order of 0.85-0.9, or 85-90%.

We can try a different model, such as AdaBoost

In [None]:
ABmodel = AdaBoostClassifier()
ABmodel.fit(Xtrain,Ytrain)
print("Classification for AdaBoost: "+str(ABmodel.score(Xtest,Ytest)))

Running this a few times gives slightly better scores, however to visualise the performance lets run each one multiple times, compute the average score and plot the results using matplotlib.

In [75]:
NBperformance = []
ABperformance = []

for _ in range(50):
    np.random.shuffle(data)
    X = data[:, :48]
    Y = data[:, -1]
    Xtrain = X[:-100,]
    Ytrain = Y[:-100,]
    Xtest = X[-100:,]
    Ytest = Y[-100:,]
    NBmodel.fit(Xtrain,Ytrain)
    ABmodel.fit(Xtrain,Ytrain)
    NBperformance.append(NBmodel.score(Xtest,Ytest))
    ABperformance.append(ABmodel.score(Xtest,Ytest))

In [None]:
print("NB average performance: "+str(sum(NBperformance)/len(NBperformance)))
print("AB average performance: "+str(sum(ABperformance)/len(ABperformance)))

In [None]:
plt.plot(NBperformance)
plt.plot(ABperformance)
plt.ylabel('Score')
plt.ylim(0.5,1)
plt.legend(['Naive-Bayes','AdaBoost'], loc='upper left')
plt.show()

Another model we can use is a neural network, or in this specific example a multi-layer perceptron, also included in sklearn. In this example, we use a MLP with two hidden layers with 20 elements each.

In [None]:
NNmodel = MLPClassifier(hidden_layer_sizes=(20,20), max_iter=2000)
NNmodel.fit(Xtrain,Ytrain)
print("MLP performance: "+str(NNmodel.score(Xtest,Ytest)))

Due to the increased computational resources required for this, we won't loop this several times, however we can see performances of 93-96%, better than both Naive-Bayes and AdaBoost.