# Initialization

In [17]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [18]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [39]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import StandardScaler

from sklearn.dummy import DummyClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

from sklearn.metrics import confusion_matrix

# Class, for use in pipelines, to select certain columns from a DataFrame and convert to a numpy array
# From A. Geron: Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017
# Modified by Derek Bridge to allow for casting in the same ways as pandas.DataFrame.astype
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names, dtype=None):
        self.attribute_names = attribute_names
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_selected = X[self.attribute_names]
        if self.dtype:
            return X_selected.astype(self.dtype).values
        return X_selected.values

# Exploring the dataset

Extracting the archives, we can see that there are two kinds emails (spam and ham, obviously) in different folders.
So we can identify the label of an example by looking at the directory it is stored in.

Opening some of the emails, we can observe that each one of them is starting with some metadata about the email itself, and the content of the email is following the metadata. Some HTML tags are present.

# Reading in the dataset 

We define some functions to read the dataset from the files provided. 

In [20]:
# Reads all files from the directory specified, and their content is returned as
# a pandas Series of strings. 
def read_files_from_dir(directory):
    files_contents = []
    for file_name in os.listdir(directory):
        file_path = os.path.join(directory, file_name)
        with open(file_path) as f:
            files_contents.append(f.read())
    return pd.Series(files_contents)

# Converts a series to a dataframe and adds for each of the elements a 
# constant numeric label, specified in the parameter label. 
def to_pd_DF_with_label(ser, label):
    df = pd.DataFrame()
    df['text'] = ser
    df['label'] = pd.Series(np.ones(len(df), dtype=np.int64) * label, index=df.index)
    return df

Now we will use the functions we defined, to actually read in the dataset.
These lines of code assume that the spam and ham archives 
have been extracted to directories spam and ham.

In [21]:
# read hams with label 0, since they are the negative class
hams = to_pd_DF_with_label(read_files_from_dir('ham'), 0)
# read spams with label 1, since they are the positive class
spams = to_pd_DF_with_label(read_files_from_dir('spam'), 1)

In [22]:
# check if we succeeded in reading in the dataset.
print(hams.shape)
print(spams.shape)

(1650, 2)
(1248, 2)


Now that we have two separate dataframes, we should append one to the other,
to have all data data in a single dataframe.
After the append, we know that all hams are before all the spams,
so we should shuffle the dataset to avoid problems with k-fold in the future.

In [23]:
emails = hams.append(spams, ignore_index=True)
emails = emails.take(np.random.permutation(len(emails)))
emails.reset_index(drop=True, inplace=True)
print(emails.shape)

(2898, 2)


# Cleaning the dataset

Opening the email files, we can see that the first lines of all the emails 
are data about the email itself (metadata).
Since we do not want to conduct metadata analysis of the email,
we can delete this metadata, leaving us with the title and the body of the email. 
In order to strip the metadata we have to identify it.
After opening a few files, I noticed a pattern:
the metadata is delimited by an empty line in the files.

In [24]:
emails['stripped_metadata'] = emails['text'].str.replace(r'(.*?)\n\n', '', flags=re.MULTILINE | re.DOTALL, n=1)

Now that we got rid of the metadata, the next thing I think to be unnecessary
is the data found between HTML tags, so we could remove those too,
in order to remain with only the plain text of the documents.

In [25]:
emails['just_text'] = emails['stripped_metadata'].str.replace(r"<(.*?)>", '', flags=re.MULTILINE | re.DOTALL)

In order to run some tests later, I will strip the HTLM tags also,
while leaving the metadata, so we can comapare these two methods.

In [26]:
emails['stripped_html'] = emails['text'].str.replace(r"<(.*?)>", '', flags=re.MULTILINE | re.DOTALL)

All types of preprocessing are stored in this list of tuples,
where the first element of the tuple represents name of the preprocessing type
and the second element of the tuple represents the corresponding column from the dataframe.

In [27]:
preprocessings = [('raw text', 'text'), 
                  ('stripped metadata', 'stripped_metadata'),
                  ('stripped HTML tags', 'stripped_html'),
                  ('stripped metadata and HTML', 'just_text'),
]

# Building the pipelines

In the following section, we will build the pipelines,
which will be responsible for doing the vectorization of the emails and their classification.
Some of them are really similar, with diferent paramters or diffrent steps in the pipeline.
Please note: I did not present here all the possibilities that I tried,
since the results are really similiar, and there is no need to replicate so much of the work.

In [43]:
# some paramters to count vectorizer: minimum document frequency should be 0.01 and max 0.5
count_vect_eng_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english', min_df=0.01, max_df=0.5)),
    ('classifier', LogisticRegression()),
])

# some paramters to tf-idf vectorizer: minimum document frequency should be 0.01 and max 0.5
tfidf_eng_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', min_df=0.01, max_df=0.5)),
    ('classifier', LogisticRegression()),
])

# all default CountVectorizer + LogisticRegression
count_vect_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression()),
])

# all default TfidfVectorizer + LogisticRegression
tfidf_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', LogisticRegression()),
])

# For SGD I am using the hinge loss function,
# if the log function would be used, we would get Logistic Regression

# all default CountVectorizer + SGDClassifier
count_vect_pipeline_sgd = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', SGDClassifier(max_iter=1000, loss='hinge')),
])

# all default TfidfVectorizer + SGDClassifier, using hinge loss function

tfidf_pipeline_sgd = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', SGDClassifier(max_iter=1000, loss='hinge')),
])

# just a dummy pipeline, which will always predict the mode.
dummy_pipeline = Pipeline([
    ('selector', DataFrameSelector(['label'])),
    ('dummy', DummyClassifier(strategy='most_frequent')),
])

The same idea is used as when constructing the preprocessings list: 
list of tuples, first name, second pipeline.

In [45]:
pipelines = [('Count vectorizer with English stop words', count_vect_eng_pipeline), 
             ('tf-idf vectorizer with English stop words', tfidf_eng_pipeline), 
             ('Count vectorizer', count_vect_pipeline), 
             ('tf-idf vectorizer', tfidf_pipeline), 
             ('Count vectorizer + SGD', count_vect_pipeline_sgd),
             ('tf-idf vectorizer + SGD', tfidf_pipeline_sgd),
]

# Accuracy estimation

Before we start estimating the accuracy of the pipelines, 
we should talk about the accuracy measures we use in order to evaluate the classifiers.

### K-fold

I chose stratified k-fold as my accuracy measure, with k=10,
because we have a lot of examples, so it will perform better than Holdout.
In each fold we will have more than 289 example. 
Having so many examples in each fold ensures that the measurement is 
statistically significant (we should have at least 30 examples in each fold).

Stratified k-fold is better than holdout (in this case), due to the fact that,
k-fold will perform the stratification and testing multiple times, 
so the chances of getting 'unlucky' will be less, compared to using simple holdout. 

When using k-fold, it is important to shuffle the dataset, 
since if the dataset is sorted, in each fold a particular type 
of examples may be included, which will result in a skewed result.

The shuffling of the dataset has been done previously.

Get the labels of the dataset.

## Manual estimation

In [30]:
# the labels
y = emails['label'].values

Check how the dummy classifier performs, which will predict the most frequent class.

In [31]:
np.mean(cross_val_score(dummy_pipeline, emails, y, scoring='accuracy', cv=10))

0.56935926500417611

Checking the accuracy of a classifier, with 10-fold cross validation, and stripped metadata

In [32]:
np.mean(cross_val_score(count_vect_pipeline, emails['stripped_metadata'], y, scoring='accuracy', cv=10))

0.97618541940102599

The confusion matrix:

In [33]:
y_predicted = cross_val_predict(count_vect_pipeline, emails['stripped_metadata'], y, cv=10)
confusion_matrix(y, y_predicted)

array([[1613,   37],
       [  32, 1216]])

As you can observe, the classifier is not making too many false positives 
(ham classified as spam), the type of error we are trying to avoid.

Manual testing is nice and all, but we have too many possibilities to check we should automate it.

## Automating the estimation for all pipelines

In order to test the above presented pipelines and with each of the possible preprocessing steps, I wrote two simple fors, which will check all possible combinations of these two.

In [46]:
for pipeline_name, pipeline in pipelines:
    for preproc_name, preproc in preprocessings:
        mean = np.mean(cross_val_score(pipeline, emails[preproc], y, cv=10))
        print(pipeline_name, preproc_name, mean)

Count vectorizer with English stop words raw text 0.9799856819
Count vectorizer with English stop words stripped metadata 0.97445889512
Count vectorizer with English stop words stripped HTML tags 0.980329316311
Count vectorizer with English stop words stripped metadata and HTML 0.97445173607
tf-idf vectorizer with English stop words raw text 0.955146163942
tf-idf vectorizer with English stop words stripped metadata 0.959621763513
tf-idf vectorizer with English stop words stripped HTML tags 0.961351867319
tf-idf vectorizer with English stop words stripped metadata and HTML 0.966174680826
Count vectorizer raw text 0.980674143897
Count vectorizer stripped metadata 0.976185419401
Count vectorizer stripped HTML tags 0.982398281828
Count vectorizer stripped metadata and HTML 0.976527860637
tf-idf vectorizer raw text 0.956179453526
tf-idf vectorizer stripped metadata 0.966525474287
tf-idf vectorizer stripped HTML tags 0.96308077795
tf-idf vectorizer stripped metadata and HTML 0.972043908841
C

Intrestingly, the tf-idf vectorizer, performed generally poorer 
than the CountVectorizer. The CountVectorizer seems to lose accuracy 
as the data is preprocessed, the tf-idf approach is gaining accuracy if the data is preprocessed.  

Also it looks like discarding the English stopwords has little effects, 
but the pipelines without the discarding of stopwords are performing slightly better.

The best method, according to the accuracy is the CountVectorizer 
with the stippped HTML tags and not discarding the English stopwords.

In [47]:
for pipeline_name, pipeline in pipelines:
    for preproc_name, preproc in preprocessings:
        y_predicted = cross_val_predict(pipeline, emails[preproc], y, cv=10)
        conf_matrix = confusion_matrix(y, y_predicted)
        print(pipeline_name, preproc_name)
        print(conf_matrix)

Count vectorizer with English stop words raw text
[[1621   29]
 [  29 1219]]
Count vectorizer with English stop words stripped metadata
[[1608   42]
 [  32 1216]]
Count vectorizer with English stop words stripped HTML tags
[[1617   33]
 [  24 1224]]
Count vectorizer with English stop words stripped metadata and HTML
[[1608   42]
 [  32 1216]]
tf-idf vectorizer with English stop words raw text
[[1589   61]
 [  69 1179]]
tf-idf vectorizer with English stop words stripped metadata
[[1585   65]
 [  52 1196]]
tf-idf vectorizer with English stop words stripped HTML tags
[[1605   45]
 [  67 1181]]
tf-idf vectorizer with English stop words stripped metadata and HTML
[[1594   56]
 [  42 1206]]
Count vectorizer raw text
[[1620   30]
 [  26 1222]]
Count vectorizer stripped metadata
[[1613   37]
 [  32 1216]]
Count vectorizer stripped HTML tags
[[1622   28]
 [  23 1225]]
Count vectorizer stripped metadata and HTML
[[1612   38]
 [  30 1218]]
tf-idf vectorizer raw text
[[1593   57]
 [  70 1178]]
tf-

Looking at the confusion matrices we arrive at the same conclusion, 
the CountVectorizer without the stopwords and with stripped HTML tags seems to be the most promising choice.