# BASIC SPAM DETECTION USING MACHINE LEARNING

**Arun Das**    
Research Fellow,    
Open Cloud Institute,    
The University of Texas at San Antonio.    
arun.das@my.utsa.edu

**M**achine Learning (ML) is a wonderful field of Computer Science which evolved to something greater recently. The buzzword `machine learning` and `deep learning` now rhymes well with `artificial intelligence` the once sci-fi technology. It is now impacting us in a way never imagined and is slowly being integrated in to our everyday lives. Take for example your go-to assistant Google Now or Siri. The brain of these personal assistants are very complex neural networks and other software crafted specifically for the task of answering your queries.
Here, we tackle a basic machine learning problem of classifying emails as Spam or Not Spam. We will use a publically available, clean and labelled dataset for the same. To be comprehensive, we will train our model over various classifiers and pick the best of out it. Moving forward in future work, we will do data mining to get our own email data and train a classifier to do spam detection. For now, let's see the simple example.

In [4]:
# First, let us import some libraries.

# Import Pandas to manage data
import pandas as pd

# Import Numpy to manage numerical data, matrices etc.
import numpy as np

# Import Matplotlib for visualizing graphs, charts, etc.
import matplotlib.pyplot as plt

# Import OrderedDict to get extra features of Python Dictionaries.
from collections import OrderedDict

# Import different classifiers from Scikit-Learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

#### Download the data to local machine from this URL: http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/    

*A bit about the dataset*: (extracted from http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.DOCUMENTATION)

**Number of Instances**: 4601 (1813 Spam = 39.4%)    

**Number of Attributes**: 58 (57 continuous, 1 nominal class label)    
 
**Attribute Information**:
The last column of 'spambase.data' denotes whether the e-mail was 
considered spam (1) or not (0), i.e. unsolicited commercial e-mail.      
Most of the attributes indicate whether a particular word or
character was frequently occuring in the e-mail.  The run-length
attributes (55-57) measure the length of sequences of consecutive 
capital letters.  For the statistical measures of each attribute, 
see the end of this file.  Here are the definitions of the attributes:    

*48 continuous real [0,100]* attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD,
i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail.  A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.    

*6 continuous real [0,100]* attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail    

*1 continuous real [1,...]* attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters    

*1 continuous integer [1,...]* attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters    
    
*1 continuous integer [1,...]* attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail      

*1 nominal {0,1}* class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0),     
i.e. unsolicited commercial e-mail.      
    
**Missing Attribute Values**: None    

**Class Distribution**:     
	Spam	  1813  (39.4%)     
	Non-Spam  2788  (60.6%)    

A bit more information regarding the different features is available [here](https://github.com/WinVector/zmPDSwR/tree/master/Spambase). The different features are:
   *'word.freq.make', 'word.freq.address', 'word.freq.all',
   'word.freq.3d', 'word.freq.our', 'word.freq.over', 'word.freq.remove',
   'word.freq.internet', 'word.freq.order', 'word.freq.mail',
   'word.freq.receive', 'word.freq.will', 'word.freq.people',
   'word.freq.report', 'word.freq.addresses', 'word.freq.free',
   'word.freq.business', 'word.freq.email', 'word.freq.you',
   'word.freq.credit', 'word.freq.your', 'word.freq.font',
   'word.freq.000', 'word.freq.money', 'word.freq.hp', 'word.freq.hpl',
   'word.freq.george', 'word.freq.650', 'word.freq.lab',
   'word.freq.labs', 'word.freq.telnet', 'word.freq.857',
   'word.freq.data', 'word.freq.415', 'word.freq.85',
   'word.freq.technology', 'word.freq.1999', 'word.freq.parts',
   'word.freq.pm', 'word.freq.direct', 'word.freq.cs',
   'word.freq.meeting', 'word.freq.original', 'word.freq.project',
   'word.freq.re', 'word.freq.edu', 'word.freq.table',
   'word.freq.conference', 'char.freq.semi', 'char.freq.lparen',
   'char.freq.lbrack', 'char.freq.bang', 'char.freq.dollar',
   'char.freq.hash', 'capital.run.length.average',
   'capital.run.length.longest', 'capital.run.length.total',
   'spam'.*

In [7]:
# Read the data using pandas
data = pd.read_csv('spambase.data').as_matrix()

# Shuffle the dataset
np.random.shuffle(data)

# Data has 58 columns altogether, the last column being the labels.
# Extract the features and target seperately
X = data[:,:57]
y = data[:,-1]

# Split the data into train and test. The ratio here is 60/40 (train/test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)

# Print the shapes of the train and test data
print "X_train.shape: {}" .format(X_train.shape)
print "y_train.shape: {}" .format(y_train.shape)
print "X_test.shape: {}" .format(X_test.shape)
print "y_test.shape: {}" .format(y_test.shape)

X_train.shape: (2760, 57)
y_train.shape: (2760,)
X_test.shape: (1840, 57)
y_test.shape: (1840,)


In [8]:
# A dictionary to store the classifiers and respective scores
score_dict = {}

# Different classifiers used in the training procedure and their initializations
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

# Variable to store the number of classifiers
num_classifiers = len(classifiers)

# Names of the classifiers (string)
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

# Training loop (iterates over each classifiers and trains a model based on them)
for num, classifier in enumerate(classifiers):
    model = classifier # Current classifier
    model.fit(X_train, y_train) # Fit the current classifier with the training data
    score = model.score(X_test, y_test) # Assess the performance of the trained model using test data
    name_of_classifer = names[num] 
    print "Training {} classifier" .format(name_of_classifer)
    #print "Score for {} :" .format(name_of_classifer), score
    score_dict[name_of_classifer] = score # Save the classifier and score information

# Display the classifier and score information, sorted, starting with the best classifier.
score_dict_sorted_by_value = OrderedDict(sorted(score_dict.items(), key=lambda x: x[1], reverse=True))
print "\nTraining complete\nResults:\n"
for k,v in score_dict_sorted_by_value.items():
    print "Classifier: {}, \tScore: {}" .format(k,v)

Training Nearest Neighbors classifier
Training Linear SVM classifier
Training RBF SVM classifier
Training Gaussian Process classifier
Training Decision Tree classifier
Training Random Forest classifier
Training Neural Net classifier
Training AdaBoost classifier
Training Naive Bayes classifier
Training QDA classifier

Training complete
Results:

Classifier: AdaBoost, 	Score: 0.932608695652
Classifier: Linear SVM, 	Score: 0.917391304348
Classifier: Decision Tree, 	Score: 0.909782608696
Classifier: Neural Net, 	Score: 0.870108695652
Classifier: Random Forest, 	Score: 0.858695652174
Classifier: Naive Bayes, 	Score: 0.828260869565
Classifier: QDA, 	Score: 0.825
Classifier: Nearest Neighbors, 	Score: 0.786413043478
Classifier: RBF SVM, 	Score: 0.689130434783
Classifier: Gaussian Process, 	Score: 0.598913043478


Great our models learned pretty well across the many features. We could also reduce the dimensions to "word: frequency alone. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string. For that, change `X = data[:,:57]` to `X = data[:,:48]` so that variable `X` will hold only the respective columns corresponding to word data.    
We can add much more to this. Let us dig deep in future tutorials.