# Context

## Understanding the data
This data is quite messy. Given that this is spam classification over other tasks like say topic classification, we need to adjust our approach accordingly. When we see odd punctuation, misspelled words, erratic patterns, odd capitializations, these are usually features and not noise.

We also have a data set of size 5572, this is rather small for advanced methods like Deep Learning, and we **might** result in an overfit model. Because of this, I will spend my time on more traditional methods, but I will look into it if I have time.

## Goal
The [paper](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/doceng11.pdf) that comes with the data set, boasts 

| Classifier          | SC%    | BH%  | Acc%   | MCC   |
|---------------------|--------|------|--------|-------|
| SVM + tok1          | 83.10  | 0.18 | 97.64  | 0.893 |
| Boosted NB + tok2   | 84.48  | 0.53 | 97.50  | 0.887 |
| Boosted C4.5 + tok2 | 82.91  | 0.29 | 97.50  | 0.887 |

# Status
By the time the data gets to this notebook, it has:
* Been downloaded, and unzipped
* Read into a pandas dataframe

# Approach
There are a few factors at play here and we want to optimize each step.
## Tokenization


## Modeling
### Model Choice
### Hyper-parameter optimization

In [1]:
% matplotlib inline
import pandas as pd
from pathlib import Path
import sys
import seaborn as sns
import re

project_dir = Path.cwd().parent
sys.path.append(str(project_dir/'src'))

# These are utilities that I created to reduce notebook clutter
from make_dataframe import make_dataframe, master_data_handler
from utilities import scrm114_tokenizer, eager_split_tokenizer

First we get our data, and then we encode it for use with our Classifiers. We should note our mapping for later on.

In [2]:
master_data_handler()
df = make_dataframe()
df.label = df.label.map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Its interesting to note that, with no machine learning we can get 86.6% Accuracy! This is a good warning that accuracy will probably not be a good metric for us. It also tells us that we have an unbalanced data set. Depending on our classifier we need to handle this accordingly. 

SVM has a [class_weight parameter](http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html#sphx-glr-auto-examples-svm-plot-separating-hyperplane-unbalanced-py) that allows us to compensate.

Naive Bayes has a [complement class](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) structure that allows us to compensate. It has a corresponding function in sklearn - [ComplementNB](http://scikit-learn.org/dev/modules/generated/sklearn.naive_bayes.ComplementNB.html)

In [3]:
ratio = sum(df.label == 0)/len(df.label)
print('Ratio of ham to total: ', ratio)

Ratio of ham to total:  0.8659368269921034


As said before, we need to choose a tokenizer that will capture the features we want: 

* Punctuation preservation
* Case preservation
* Full words in tokens

There is research in spam classification [section 3.2](http://www.siefkes.net/ie/winnow-spam.pdf) with some good options. Unfortunately these are all in PCRE regex which python does not support fully (`\p{Z}` among other atoms). But luckily they aren't too hard to convert, given our data set (minimal control characters). 

I chose the simplified CRM114. Look how it handles an elipsis, we capture an extra feature of `'..'`!

In [4]:
print(df.text[0])
scrm114_tokenizer(df.text[0])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...


['Go',
 'until',
 'jurong',
 'point,',
 'crazy.',
 '.',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet.',
 '..',
 'Cine',
 'there',
 'got',
 'amore',
 'wat.',
 '..']

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, matthews_corrcoef, make_scorer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB


ImportError: cannot import name 'ComplementNB'

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df.text, df.label, test_size=0.65, random_state=0)

# SVM Training

In [7]:
pipeline_svm = Pipeline([
    ('vect', CountVectorizer(tokenizer=scrm114_tokenizer)),
#     ('clf', MultinomialNB()),
    ('clf', SVC(kernel='linear')),
])

In [8]:
mcc_scorer = make_scorer(matthews_corrcoef)
parameters_svm = {
    'clf__C': [.7, .08, .9],
    'clf__class_weight': [{0: w * ratio} for w in [.192, .1925, .193]],
    'clf__gamma': [0.002, 0.003, 0.004],
    'vect__ngram_range': [(1,1), (1,2), (1,3)],
    'vect__max_df': (.03, .1, .3),}

gs_clf_svm = GridSearchCV(pipeline_svm, parameters_svm,
                          verbose=1, scoring=mcc_scorer)
gs_clf_svm = gs_clf_svm.fit(X_train, y_train)

Fitting 3 folds for each of 243 candidates, totalling 729 fits


[Parallel(n_jobs=1)]: Done 729 out of 729 | elapsed:  7.3min finished


In [9]:
print('Best score: ', gs_clf_svm.best_score_)
print('Best params: ', gs_clf_svm.best_params_)

Best score:  0.8962739070169965
Best params:  {'clf__C': 0.08, 'clf__class_weight': {0: 0.16625987078248386}, 'clf__gamma': 0.002, 'vect__max_df': 0.1, 'vect__ngram_range': (1, 1)}


In [10]:
y_true_svm, y_pred_svm = y_test, gs_clf_svm.predict(X_test)
print(confusion_matrix(y_true_svm, y_pred_svm))

[[3126   19]
 [  60  417]]


In [11]:
print(classification_report(y_true_svm, y_pred_svm, target_names=['ham', 'spam']))

             precision    recall  f1-score   support

        ham       0.98      0.99      0.99      3145
       spam       0.96      0.87      0.91       477

avg / total       0.98      0.98      0.98      3622



In [12]:
matthews_corrcoef(y_true_svm, y_pred_svm)

0.9022136837194339

# Multinomial Naive Bayes

In [38]:
prior_0 = sum(y_train==0)/len(y_train)
prior_1 = sum(y_train==1)/len(y_train)
print(prior_0, prior_1)

pipeline_mnb = Pipeline([
    ('vect', CountVectorizer()),
#     ('clf', MultinomialNB(class_prior = [prior_0, prior_1])),
    ('clf', MultinomialNB()),
])

0.8615384615384616 0.13846153846153847


In [54]:
mcc_scorer = make_scorer(matthews_corrcoef)
parameters_mnb = {
    'clf__alpha': [.17, .2, .23],
    'vect__tokenizer': [scrm114_tokenizer, eager_split_tokenizer],
    'vect__ngram_range': [(1,1), (1,2)],
    'vect__max_df': (.4, .5, .6),}

gs_clf_mnb = GridSearchCV(pipeline_mnb, parameters_mnb,
                          verbose=1, scoring=mcc_scorer)
gs_clf_mnb = gs_clf_mnb.fit(X_train, y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=1)]: Done 108 out of 108 | elapsed:   11.6s finished


In [55]:
print('Best score: ', gs_clf_mnb.best_score_)
print('Best params: ', gs_clf_mnb.best_params_)

Best score:  0.9366999215864544
Best params:  {'clf__alpha': 0.17, 'vect__max_df': 0.4, 'vect__ngram_range': (1, 2), 'vect__tokenizer': <function scrm114_tokenizer at 0x000001AB1175B7B8>}


In [56]:
y_true_mnb, y_pred_mnb = y_test, gs_clf_mnb.predict(X_test)
print(confusion_matrix(y_true_mnb, y_pred_mnb))

[[3123   22]
 [  30  447]]


In [57]:
print(classification_report(y_true_mnb, y_pred_mnb, target_names=['ham', 'spam']))

             precision    recall  f1-score   support

        ham       0.99      0.99      0.99      3145
       spam       0.95      0.94      0.95       477

avg / total       0.99      0.99      0.99      3622



In [58]:
matthews_corrcoef(y_true_mnb, y_pred_mnb)

0.9368201198559103

In [18]:
matthews_corrcoef(y_true_mnb, y_pred_mnb)

0.9473872018496066

# Complement Naive Bayes

In [25]:
from bayes.classifiers import ComplementNB

ModuleNotFoundError: No module named 'cnb'

In [13]:
pipeline_mnb = Pipeline([
    ('vect', CountVectorizer(tokenizer=scrm114_tokenizer)),
    ('clf', MultinomialNB()),
])

In [14]:
mcc_scorer = make_scorer(matthews_corrcoef)
parameters_mnb = {
    'vect__ngram_range': [(1,1), (1,2), (1,3)],
    'vect__max_df': (.03, .1, .3),}

gs_clf_mnb = GridSearchCV(pipeline_mnb, parameters_mnb,
                          verbose=1, scoring=mcc_scorer)
gs_clf_mnb = gs_clf_mnb.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    4.2s finished


In [15]:
print('Best score: ', gs_clf_mnb.best_score_)
print('Best params: ', gs_clf_mnb.best_params_)

Best score:  0.9191354027879277
Best params:  {'vect__max_df': 0.03, 'vect__ngram_range': (1, 2)}


In [16]:
y_true_mnb, y_pred_mnb = y_test, gs_clf_mnb.predict(X_test)
print(confusion_matrix(y_true_mnb, y_pred_mnb))

[[3135   10]
 [  33  444]]


In [17]:
print(classification_report(y_true_mnb, y_pred_mnb, target_names=['ham', 'spam']))

             precision    recall  f1-score   support

        ham       0.99      1.00      0.99      3145
       spam       0.98      0.93      0.95       477

avg / total       0.99      0.99      0.99      3622



In [18]:
matthews_corrcoef(y_true_mnb, y_pred_mnb)

0.9473872018496066