# Context

## Understanding the data
This data is quite messy. Given that this is spam classification over other tasks like say topic classification, we need to adjust our approach accordingly. When we see odd punctuation, misspelled words, erratic patterns, odd capitializations, these are usually features and not noise.

We also have a data set of size 5572, this is rather small for advanced methods like Deep Learning, and we **might** result in an overfit model. Because of this, I will spend my time on more traditional methods, but I will look into it if I have time.

## Goal
The [paper](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/doceng11.pdf) that comes with the data set, boasts 

| Classifier          | SC%    | BH%  | Acc%   | MCC   |
|---------------------|--------|------|--------|-------|
| SVM + tok1          | 83.10  | 0.18 | 97.64  | 0.893 |
| Boosted NB + tok2   | 84.48  | 0.53 | 97.50  | 0.887 |
| Boosted C4.5 + tok2 | 82.91  | 0.29 | 97.50  | 0.887 |

## Data Status
By the time the data gets to this notebook, it has:
* Been downloaded, and unzipped
* Read into a pandas dataframe

# Approach
There are a few factors at play here and we want to optimize each step.


In [1]:
% matplotlib inline
import pandas as pd
from pathlib import Path
import sys
import seaborn as sns
import re
from pprint import pprint

project_dir = Path.cwd().parent
sys.path.append(str(project_dir/'src'))

# These are utilities that I created to reduce notebook clutter
from make_dataframe import make_dataframe, master_data_handler
import utilities as ut

First we get our data, and then we encode it for use with our Classifiers. We should note our mapping for later on.

In [2]:
master_data_handler()
df = make_dataframe()
df.label = df.label.map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Its interesting to note that, with no machine learning we can get 86.6% Accuracy! This is a good warning that accuracy will probably not be a good metric for us. It also tells us that we have an imbalanced data set. Depending on our classifier we need to handle this accordingly. 

SVM has a [class_weight parameter](http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html#sphx-glr-auto-examples-svm-plot-separating-hyperplane-unbalanced-py) that allows us to compensate.

Naive Bayes has a [complement class](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) structure that allows us to compensate. It has a corresponding function in sklearn - [ComplementNB](http://scikit-learn.org/dev/modules/generated/sklearn.naive_bayes.ComplementNB.html)

In [3]:
ratio = sum(df.label == 0)/len(df.label)
print('Ratio of ham to total: ', ratio)

Ratio of ham to total:  0.8659368269921034


As said before, we need to choose a tokenizer that will capture the features we want: 

* Punctuation preservation
* Case preservation
* Full words in tokens

There is research in spam classification [section 3.2](http://www.siefkes.net/ie/winnow-spam.pdf) with some good options. Unfortunately these are all in PCRE regex which python does not support fully (`\p{Z}` among other atoms). But luckily they aren't too hard to convert, given our data set (minimal control characters). 

I chose the simplified CRM114. Look how it handles an elipsis, we capture an extra feature of `'..'`!

In [4]:
print(df.text[0])
pprint(ut.scrm114_tokenizer(df.text[0]))
pprint(ut.eager_split_tokenizer(df.text[0]))

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
['Go',
 'until',
 'jurong',
 'point,',
 'crazy.',
 '.',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet.',
 '..',
 'Cine',
 'there',
 'got',
 'amore',
 'wat.',
 '..']
['Go',
 'until',
 'jurong',
 'point',
 '',
 'crazy',
 '',
 '',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 '',
 '',
 '',
 'Cine',
 'there',
 'got',
 'amore',
 'wat',
 '',
 '',
 '']


# Modeling

## Classification comparison metric
As suggested in the dataset paper, Matthews Correlation Coefficient has been [researched](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5456046/) to be a metric that is good for unbalanced data.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, matthews_corrcoef, make_scorer
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB


In [6]:
X_train, X_test, y_train, y_test = train_test_split(df.text, df.label, test_size=0.65, random_state=0)
data_args = [X_train, y_train, X_test, y_test]

# SVM Training

In [9]:
pipeline_svm = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', SVC(kernel='linear')),
])

parameters_svm = {
    'clf__C': [.7, .08, .9],
    'clf__class_weight': [{0: w * ratio} for w in [.192, .1925, .193]],
    'clf__gamma': [0.002, 0.003, 0.004],
    'vect__ngram_range': [(1,1), (1,2), (1,3)],
    'vect__max_df': (.03, .1, .3),
    'vect__tokenizer': [ut.scrm114_tokenizer, ut.eager_split_tokenizer],
}

ut.grid_search_analysis(pipeline_svm, parameters_svm, *data_args)

Fitting 3 folds for each of 486 candidates, totalling 1458 fits


[Parallel(n_jobs=1)]: Done 1458 out of 1458 | elapsed: 13.6min finished


Best score:  0.9050908933742281
Best params:  {'clf__C': 0.7, 'clf__class_weight': {0: 0.16625987078248386}, 'clf__gamma': 0.002, 'vect__max_df': 0.1, 'vect__ngram_range': (1, 1), 'vect__tokenizer': <function eager_split_tokenizer at 0x0000025D93A05510>}
Confusion Matrix:  [[3136    9]
 [  74  403]]
             precision    recall  f1-score   support

        ham       0.98      1.00      0.99      3145
       spam       0.98      0.84      0.91       477

avg / total       0.98      0.98      0.98      3622

Matthews Correlation Coefficient:  0.8967709622737551


# Multinomial Naive Bayes

In [8]:
prior_0 = sum(y_train==0)/len(y_train)
prior_1 = sum(y_train==1)/len(y_train)
print(prior_0, prior_1)

pipeline_mnb = Pipeline([
    ('vect', CountVectorizer()),
#     ('clf', MultinomialNB(class_prior = [prior_0, prior_1])),
    ('clf', MultinomialNB()),
])

parameters_mnb = {
    'clf__alpha': [.17, .2, .23],
    'vect__tokenizer': [ut.scrm114_tokenizer, ut.eager_split_tokenizer],
    'vect__ngram_range': [(1,1), (1,2), (1,3)],
    'vect__max_df': (.4, .5, .6),}

ut.grid_search_analysis(pipeline_mnb, parameters_mnb, *data_args)

0.8615384615384616 0.13846153846153847
Fitting 3 folds for each of 54 candidates, totalling 162 fits


[Parallel(n_jobs=1)]: Done 162 out of 162 | elapsed:   27.7s finished


Best score:  0.9366999215864544
Best params:  {'clf__alpha': 0.17, 'vect__max_df': 0.4, 'vect__ngram_range': (1, 2), 'vect__tokenizer': <function scrm114_tokenizer at 0x0000025D9388E840>}
Confusion Matrix:  [[3123   22]
 [  30  447]]
             precision    recall  f1-score   support

        ham       0.99      0.99      0.99      3145
       spam       0.95      0.94      0.95       477

avg / total       0.99      0.99      0.99      3622

Matthews Correlation Coefficient:  0.9368201198559103


# Random Forests

In [11]:
pipeline_rfc = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', RandomForestClassifier()),
])

parameters_rfc = {
    'vect__tokenizer': [ut.scrm114_tokenizer, ut.eager_split_tokenizer],
    'vect__ngram_range': [(1,1), (1,2), (1,3)],
    'vect__max_df': (.03, .1, .3),}

ut.grid_search_analysis(pipeline_rfc, parameters_rfc, *data_args)

Fitting 3 folds for each of 18 candidates, totalling 54 fits
Best score:  0.7628836104451964
Best params:  {'vect__max_df': 0.03, 'vect__ngram_range': (1, 1), 'vect__tokenizer': <function scrm114_tokenizer at 0x0000025D9388E840>}
Confusion Matrix:  [[3144    1]
 [ 165  312]]
             precision    recall  f1-score   support

        ham       0.95      1.00      0.97      3145
       spam       1.00      0.65      0.79       477

avg / total       0.96      0.95      0.95      3622

Matthews Correlation Coefficient:  0.7868174926237372


[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed:   10.8s finished
