# Detecitng Spam in e-mail

Observe that this is a classification problem, and uses supervised learning.

Learning Objectives:
1. A lot of NLP is preparing data
2. You can use almost any Machine Learning algorithm as long as the data can be made to fit.

## The Spam Database

The data can be found here: [link to spam database at University of California Irvine](https://archive.ics.uci.edu/ml/datasets/Spambase)

Note this data has been pre-processed. The details are noted below in the UCI documentation. Important to note that columns 1-48 are word frequency measures: the number of times words appear divided by number of words in the document * 100. Also the las colum is our target variable, the label, if an e-mail is or is not SPAM.

There is a word frequency measure for each e-mail

## UCI DOcumentation
### Source:

#### Creators:

Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt
Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304

#### Donor:

George Forman (gforman at nospam hpl.hp.com) 650-857-7835

### Data Set Information:

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...

Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

For background on spam:

Cranor, Lorrie F., LaMacchia, Brian A. Spam!
Communications of the ACM, 41(8):74-83, 1998.

(a) Hewlett-Packard Internal-only Technical Report. External forthcoming.
(b) Determine whether a given email is spam or not.
(c) ~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter.

### Attribute Information:

The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR]
= percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_longest
= length of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. 

## Resources
* [sklearn naive_bayes MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
* [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
* [skleanr AdaBoost Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
* [AdaBoost](https://en.wikipedia.org/wiki/AdaBoost)

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from wordcloud import WordCloud

In [7]:
#importing the data as a matrix
data = pd.read_csv('data/spambase.data').as_matrix()

In [8]:
np.random.shuffle(data)

In [10]:
data.shape

(4600, 58)

We will only use the first 48 columns because this is the wod requency data we want to use (See documentation).

We also know the last column holds the label (output) so we will use that column for our target variable.

In [22]:
X = data[:,:48]
Y = data[:,-1:]

Now we can split our datasets into training and testing groups.

numpy.ndarray

In [27]:
X_train = X[:-100,]
Y_train = Y[:-100,]
x_test = X[-100:,]
y_test = Y[-100:,]

Now we can create the model. First instantiate the object.

In [35]:
model = MultinomialNB()
model.fit(X_train,Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [37]:
print("Classification rate for NB: ", model.score(x_test, y_test))

Classification rate for NB:  0.9


We got a decent score of 90% accuracy but now we can easily try other models.

In [38]:
from sklearn.ensemble import AdaBoostClassifier

Ada_model = AdaBoostClassifier()
Ada_model.fit(X_train,Y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [39]:
print("Classification rate for ADA Boost Classifer: ", Ada_model.score(x_test, y_test))

Classification rate for ADA Boost Classifer:  0.92


So we did even better with the ADA Boost Classifier.

You can see here understanding and knowing the variety and distinctions between models is the key to successful Natural Language Processing and Machine Learning.

In this example it is important to know we are doing a supervised learning, classification problem. So as long as I choose a model that is a classifier, and trains itself using a known label value (i.e. supervised learning) it is not necessarily important to know what the model is "doing" or how it works, but that it provides reliable results.