# Problem description

<p> <b> Build spam filters/classifiers and evaluate them. </b> </p>
<p> <b> Input: </b> spam and ham, containing spam or non-spam emails respectively, these are text files. </p>
<p> <b> Output: </b> a Jupyter notebook named ai1.ipynb containing:
classifiers and their comparison/evaluation, dataset exploration, dataset preprocessing, 
and most importantly markup cells explaing the whats, the hows, and the whys of the problem and the solution. </p>

# Initialization

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

from sklearn.decomposition import TruncatedSVD

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import StandardScaler

from sklearn.dummy import DummyClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

from sklearn.metrics import confusion_matrix

# Class, for use in pipelines, to select certain columns from a DataFrame and convert to a numpy array
# From A. Geron: Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017
# Modified by Derek Bridge to allow for casting in the same ways as pandas.DataFrame.astype
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names, dtype=None):
        self.attribute_names = attribute_names
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_selected = X[self.attribute_names]
        if self.dtype:
            return X_selected.astype(self.dtype).values
        return X_selected.values

# Exploring the dataset

<p> Extracting the archives, we can see that there are two kinds emails (spam and ham, obviously) in different folders. </p>
<p> So we can identify the label of an example by looking at the directory it is stored in. </p>
<p> Opening some of the emails, we can observe that each one of them is starting with some metadata about the email itself, and the content of the email is following the metadata. Some HTML tags are present. </p>

# Reading in the dataset 

We define some functions to read the dataset from the files provided. 

In [4]:
# Reads all files from the directory specified, and their content is returned as
# a pandas Series of strings. 
def read_files_from_dir(directory):
    files_contents = []
    for file_name in os.listdir(directory):
        file_path = os.path.join(directory, file_name)
        with open(file_path) as f:
            files_contents.append(f.read())
    return pd.Series(files_contents)

# Converts a series to a dataframe and adds for each of the elements a 
# constant numeric label, specified in the parameter label. 
def to_pd_DF_with_label(ser, label):
    df = pd.DataFrame()
    df['text'] = ser
    df['label'] = pd.Series(np.ones(len(df), dtype=np.int64) * label, index=df.index)
    return df

<p> Now we will use the functions we defined, to actually read in the dataset. </p>
<p> These lines of code assume that the spam and ham archives
have been extracted to directories spam and ham. </p>

In [5]:
# read hams with label 0, since they are the negative class
hams = to_pd_DF_with_label(read_files_from_dir('ham'), 0)
# read spams with label 1, since they are the positive class
spams = to_pd_DF_with_label(read_files_from_dir('spam'), 1)

In [6]:
# check if we succeeded in reading in the dataset.
print(hams.shape)
print(spams.shape)

(1650, 2)
(1248, 2)


<p> Now that we have two separate dataframes, we should append one to the other,
to have all data data in a single dataframe. </p>
<p> After the append, we know that all hams are before all the spams,
so we should shuffle the dataset to avoid problems with k-fold in the future. </p>

In [7]:
emails = hams.append(spams, ignore_index=True)
emails = emails.take(np.random.permutation(len(emails)))
emails.reset_index(drop=True, inplace=True)
print(emails.shape)

(2898, 2)


# Cleaning the dataset

<p> Opening the email files, we can see that the first lines of all the emails 
are data about the email itself (metadata).
Since we do not want to conduct metadata analysis of the email,
we can delete this metadata, leaving us with the body of the email. </p>
<p> In order to strip the metadata we have to identify it.
After opening a few files, I noticed a pattern:
the metadata is delimited by an empty line in the files. </p>

In [8]:
# just a friendly regex to delete everyting before a double newline (\n\n)
emails['stripped_metadata'] = emails['text'].str.replace(r'(.*?)\n\n', '', flags=re.MULTILINE | re.DOTALL, n=1)

Now that we got rid of the metadata, the next thing I think to be unnecessary
is the data found between HTML tags, so we could remove those too,
in order to remain with only the plain text of the documents.

In [9]:
# another regex to delete everyting between '<' and '>'
emails['just_text'] = emails['stripped_metadata'].str.replace(r"<(.*?)>", '', flags=re.MULTILINE | re.DOTALL)

In order to run some tests later, I will strip the HTLM tags also,
while leaving the metadata, so we can comapare these two methods.

In [10]:
emails['stripped_html'] = emails['text'].str.replace(r"<(.*?)>", '', flags=re.MULTILINE | re.DOTALL)

All types of preprocessing are stored in this list of tuples,
where the first element of the tuple represents name of the preprocessing type
and the second element of the tuple represents the corresponding column from the dataframe.

In [11]:
preprocessings = [('raw text', 'text'), 
                  ('stripped metadata', 'stripped_metadata'),
                  ('stripped HTML tags', 'stripped_html'),
                  ('stripped metadata and HTML', 'just_text'),
]

# Building the pipelines

<p> In the following section, we will build the pipelines,
which will be responsible for doing the vectorization of the emails and their classification. </p>
<p> For each pipeline you can find a short comment describing the decisions I took while building the pipeline. </p>
<p> Please note: I did not present here all the possibilities that I tried,
since the results are really similiar, and there is no need to replicate so much of the work. </p>

<p> <b> SVD scaler </b> </p>
SVD stands for Singular Value Decomposition, performs dimensonality reduction. Especially useful on the bag of words representation, since this implementation supports sparse matrices as input.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

In [12]:
# some paramters to CountVectorizer: minimum document frequency should be 0.01 and max 0.3
# these parameters are included in order to discard number which appear too often, or
# too rarely, this way avoiding word which apper once or so.

# SVD -- Singular Value Decomposition for dimensinality reduction.
# maximum number of features will be 100, and the number of iterations 7

# default LogisticRegression
count_vect_eng_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english', min_df=0.01, max_df=0.3)),
    ('dim_reduc', TruncatedSVD(n_components=100, n_iter=7)),
    ('classifier', LogisticRegression()),
])

# some paramters to tf-idf vectorizer: minimum document frequency should be 0.01 and max 0.5

# default LogisticRegression
tfidf_eng_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', min_df=0.01, max_df=0.5)),
    ('classifier', LogisticRegression()),
])

# For SGD I am using the hinge loss function,
# if the log function would be used, we would get Logistic Regression

#  CountVectorizer -- limit the number of features to 10000
# TruncatedSVD -- creates 500 new features in 10 iterations
# SGDClassifier
count_vect_pipeline_sgd = Pipeline([
    ('vectorizer', CountVectorizer(max_features=10000)),
    ('dim_reduc', TruncatedSVD(n_components=500, n_iter=10)),
    ('classifier', SGDClassifier(max_iter=1000, loss='hinge')),
])

# TfidfVectorizer + SGDClassifier, using hinge loss function
# Tfidf limits the number of features to 10000 in order to avoid unnecessary words.
tfidf_pipeline_sgd = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000)),
    ('classifier', SGDClassifier(max_iter=1000, loss='hinge')),
])

# just a dummy pipeline, which will always predict the mode.
dummy_pipeline = Pipeline([
    ('selector', DataFrameSelector(['label'])),
    ('dummy', DummyClassifier(strategy='most_frequent')),
])

The same idea is used as when constructing the preprocessings list: 
list of tuples, first name, second pipeline.

In [13]:
pipelines = [('Count vectorizer with English stop words', count_vect_eng_pipeline), 
             ('tf-idf vectorizer with English stop words', tfidf_eng_pipeline), 
             ('Count vectorizer + SGD', count_vect_pipeline_sgd),
             ('tf-idf vectorizer + SGD', tfidf_pipeline_sgd),
]

# Performance of classifiers

Before we start estimating the accuracy of the pipelines, 
we should talk about the accuracy measures we use in order to evaluate the classifiers.

### K-fold

<p> I chose stratified k-fold as my accuracy measure, with k=10,
because we have a lot of examples, so it will perform better than Holdout.
In each fold we will have more than 289 example. 
Having so many examples in each fold ensures that the measurement is 
statistically significant (we should have at least 30 examples in each fold). </p>

<p>Stratified k-fold is better than holdout (in this case), due to the fact that,
k-fold will perform the stratification and testing multiple times, 
so the chances of getting 'unlucky' will be less, compared to using simple holdout. </p>

<p> When using k-fold, it is important to shuffle the dataset, 
since if the dataset is sorted, in each fold a particular type 
of examples may be included, which will result in a skewed result. </p>

<p> The shuffling of the dataset has been done previously. </p>

## Manual estimation

Get the labels of the dataset, we will need them later for cross validation.

In [14]:
# the labels
y = emails['label'].values

Check how the dummy classifier performs, which will predict the most frequent class.

In [15]:
np.mean(cross_val_score(dummy_pipeline, emails, y, scoring='accuracy', cv=10))

0.56935926500417611

Checking the accuracy of a classifier, with 10-fold cross validation, and stripped metadata

In [16]:
np.mean(cross_val_score(count_vect_eng_pipeline, emails['stripped_metadata'], y, scoring='accuracy', cv=10))

0.95065982579644426

The confusion matrix:

In [17]:
y_predicted = cross_val_predict(count_vect_eng_pipeline, emails['stripped_metadata'], y, cv=10)
confusion_matrix(y, y_predicted)

array([[1592,   58],
       [  80, 1168]])

The confusion matrix will be handy to compare which classifiers are generating the most False Positives.

Manual testing is nice and all, but we have too many possibilities to check we should automate it.

## Automating the estimation for all pipelines

<p> In order to test the above presented pipelines and with each of the possible preprocessing steps, I wrote two simple fors, which will check all possible combinations of these two. </p>

<p> In order to provide more comprehensive analysis, I will use more than one performance measurement. </p>

<p> Precison gives the ratio of correctly classified negative examples.
Recall is the ability of the classifier to find all positive examples. </p>

<p> These two performance measures are complementary, so there is third one, called f1,
which the weighted harmonic mean of the precision and recall. </p>

http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics

https://en.wikipedia.org/wiki/Precision_and_recall#Precision

In [18]:
for scor in ['accuracy', 'f1', 'precision']:
    print(scor)
    for pipeline_name, pipeline in pipelines:
        for preproc_name, preproc in preprocessings:
            mean = np.mean(cross_val_score(pipeline, emails[preproc], y, scoring=scor, cv=10))
            print(pipeline_name, preproc_name, mean)
    print();

accuracy
Count vectorizer with English stop words raw text 0.961696694905
Count vectorizer with English stop words stripped metadata 0.951007039733
Count vectorizer with English stop words stripped HTML tags 0.964796563656
Count vectorizer with English stop words stripped metadata and HTML 0.95893449469
tf-idf vectorizer with English stop words raw text 0.955142584417
tf-idf vectorizer with English stop words stripped metadata 0.959983295549
tf-idf vectorizer with English stop words stripped HTML tags 0.961004653383
tf-idf vectorizer with English stop words stripped metadata and HTML 0.968261543969
Count vectorizer + SGD raw text 0.961362605894
Count vectorizer + SGD stripped metadata 0.95446128147
Count vectorizer + SGD stripped HTML tags 0.954103328958
Count vectorizer + SGD stripped metadata and HTML 0.951350674144
tf-idf vectorizer + SGD raw text 0.980683689297
tf-idf vectorizer + SGD stripped metadata 0.981023744183
tf-idf vectorizer + SGD stripped HTML tags 0.987575468321
tf-idf 

# Interpretting the results

<p> Intrestingly, the tf-idf vectorizer performed generally better,
than the CountVectorizer when used with LogisticRegression. One possible explanation is that due to the dimensionality reduction we lost some information and the classifier could not perform so well.
However using Stochastic Gradient Descent tf-idf will shine. </p>

<p> The CountVectorizer seems to lose accuracy 
as the data is preprocessed, the tf-idf approach is gaining accuracy if the data is preprocessed. </p>

<p> Despite the fact that the first classifier, count vectorizer with english stop words, uses only 100 features, it performed quite well. Better than 95% in all performance metrics. </p>

Also it looks like discarding the English stopwords has little effects, 
but the pipelines without the discarding of stopwords are performing slightly better.

<p> The best classifier, according to this analysis is tf-idf vectorizer with stripped HTML tags, without english stopwords and using SGD, since it has the highest score on all performance metrics. </p>

<p> This test has been run multiple times in order to check if randomness playes a role, but I found that the results are really similar. </p>

### Confusion matrices for all of the pipelines

In [19]:
for pipeline_name, pipeline in pipelines:
    for preproc_name, preproc in preprocessings:
        y_predicted = cross_val_predict(pipeline, emails[preproc], y, cv=10)
        conf_matrix = confusion_matrix(y, y_predicted)
        print(pipeline_name, preproc_name)
        print(conf_matrix)

Count vectorizer with English stop words raw text
[[1609   41]
 [  69 1179]]
Count vectorizer with English stop words stripped metadata
[[1589   61]
 [  76 1172]]
Count vectorizer with English stop words stripped HTML tags
[[1602   48]
 [  55 1193]]
Count vectorizer with English stop words stripped metadata and HTML
[[1587   63]
 [  54 1194]]
tf-idf vectorizer with English stop words raw text
[[1591   59]
 [  71 1177]]
tf-idf vectorizer with English stop words stripped metadata
[[1584   66]
 [  50 1198]]
tf-idf vectorizer with English stop words stripped HTML tags
[[1606   44]
 [  69 1179]]
tf-idf vectorizer with English stop words stripped metadata and HTML
[[1598   52]
 [  40 1208]]
Count vectorizer + SGD raw text
[[1600   50]
 [  56 1192]]
Count vectorizer + SGD stripped metadata
[[1571   79]
 [  54 1194]]
Count vectorizer + SGD stripped HTML tags
[[1594   56]
 [  66 1182]]
Count vectorizer + SGD stripped metadata and HTML
[[1578   72]
 [  57 1191]]
tf-idf vectorizer + SGD raw text


Looking at the confusion matrices we arrive at the same conclusion as before, the best classifier is tf-idf with SGD and stipped HTML tags, since it makes the least number of False Positives (ham classified as spam).

# Future improvements

I could take into consideration the length of the emails at each preprocessing stage and feed that into the pipeline as well. 

Build more pipelines with different parameters, estimators etc.

# Conclusion

I think I have implemented nearly everything I have thought of. I consider this little project succesful, since the performance of the best classifier is satisfactory.

Learning by doing helped me to understand the concepts more clearly.
I really enjoyed this assignment, as it was a hands on and interactive. 