# NB classifier

This notebook trains a Support Vector Machine (with a linear kernel) to identify relevant tweets (POS).

We use scikit-learn's implementation of SVM and its cross validation tools. http://scikit-learn.org/

## Installation

To install all of the python dependencies for this notbook in a virtual environment:

```bash
# create environment in directory named 'venv'
python -m venv venv
# or:
# virtualenv venv

# activate environment
source venv/bin/activate

# install dependencies
pip3 install -r requirements.txt
```

In [2]:
from class_utils import *
import pickle
import numpy as np

from nltk.tokenize.casual import casual_tokenize
from nltk import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split

In [3]:
# globals
iteration="iter3a"
model_filename = "models/best_svc_{}.pickle".format(iteration)

## Parse data sets

Here we parse data from our training files, and then randomly select a portion to be held out for evaluation. The training set is used to both train the SVM classifier and select parameters using k-fold cross validation.

The `parse_training_data()` function is provided in the external `class_utils.py` file.

In [4]:
# parse data from files
classes = ['NEG', 'POS']
docs, targets = parse_training_data(['NEG-{}.txt'.format(iteration), 'POS-{}.txt'.format(iteration)], classes)

# convert the targets array of strings to binary labels (0=NEG, 1=POS)
lb = LabelBinarizer(sparse_output=False)
lb.fit(classes)
bin_targets = lb.transform(targets).ravel()

# split data set into to training and evaluation sets
# X_test/y_test are held out and not used during the
# k-fold training and parameter search below
#
# The percentage of samples to hod out is determined by the `test_size`
# parameter
# for this iter2, the holdout is only going to be 10% 
X_train, X_test, y_train, y_test = train_test_split(
    docs, bin_targets, test_size=0.10, random_state=0)

In [5]:
np.linspace(0.3, 1.0, 10)

array([ 0.3       ,  0.37777778,  0.45555556,  0.53333333,  0.61111111,
        0.68888889,  0.76666667,  0.84444444,  0.92222222,  1.        ])

## Create sklearn pipeline

Here we setup a scikit-learn pipeline to create vectors from our training sample vocabulary (`CountVectorizer`), normalize words based on frequency (`TfidfTransformer`), and train a SVM classifier (`SVC`). http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

We evaluate parameters based on th `fscore_prec` which is a weighted fscore which favors precision (beta < 1). We also calculate accuracy, precision, recall, and f1 scores for each of the k-fold training sessions.

Using a pipeline makes it easy to search a range of hyperparameters using sklearn's `GridSearchCV`. http://scikit-learn.org/stable/modules/grid_search.html

In [25]:
svc_pl = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf',MultinomialNB())])

parameters = {
    'vect__preprocessor': [normalize_tweet],#[normalize_tweet, normalize_simple, None],
    'vect__max_df': np.linspace(0.3, 1.0, 10),
    'vect__tokenizer': [word_tokenize],#[casual_tokenize, word_tokenize, None],
    'vect__stop_words' : ['english',None],
    'vect__ngram_range': [(1, 1), (1, 2), (1,3)],# ((1, 1), (1, 2), (1,3)),  # largest n-gram
    'tfidf__use_idf':[(True, False)],# (True, False), #DEFAULT
    'clf__alpha': np.linspace(0.05, 0.2, 3),

}

# define the scores we want to calcualte during each k-fold training
fscore_prec = make_scorer(fbeta_score, beta=2)
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'fscore_prec': fscore_prec
}

# create the GridSearchCV object.
# by setting refit='fscore_prec', the model which maximizes that score
# will be selected and retrained on all training data.
svc_search = GridSearchCV(svc_pl, parameters, n_jobs=-1, verbose=1, scoring=scoring, refit='fscore_prec')

In [26]:
# Here we do the actual training
# Can take several minutes depending on the range of parameters given
# int he parameters dict above
svc_search.fit(X_train, y_train)

Fitting 3 folds for each of 180 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   40.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:  7.0min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__preprocessor': [<function normalize_tweet at 0x106291488>], 'vect__max_df': array([ 0.3    ,  0.37778,  0.45556,  0.53333,  0.61111,  0.68889,
        0.76667,  0.84444,  0.92222,  1.     ]), 'vect__stop_words': ['english', None], 'tfidf__use_idf': [(True, False)], 'vect__tokenizer': [<function word_tokenize at 0x1167c8840>], 'vect__ngram_range': [(1, 1), (1, 2), (1, 3)], 'clf__alpha': array([ 0.05 ,  0.125,  0.2  ])},
       pre_di

In [19]:
# The parameters selected by the grid search
svc_search.best_params_

{'clf__alpha': 0.050000000000000003,
 'tfidf__use_idf': (True, False),
 'vect__max_df': 0.61111111111111116,
 'vect__ngram_range': (1, 3),
 'vect__preprocessor': <function class_utils.normalize_tweet>,
 'vect__stop_words': 'english',
 'vect__tokenizer': <function nltk.tokenize.word_tokenize>}

In [22]:
# print the average scores over the k training folds
fields = ['accuracy', 'precision', 'recall', 'f1', 'fscore_prec']

for f in fields:
    score = svc_search.cv_results_["mean_test_%s" % f][svc_search.best_index_]
    print("%s: %.3f" % (f, score))

accuracy: 0.815
precision: 0.773
recall: 0.565
f1: 0.649
fscore_prec: 0.613


In [23]:
# Get best model from grid search we ran in previous section
best_model = svc_search.best_estimator_

In [24]:
# use model to predict held out set (X_test) and print score table
# Note that in binary classification, accuracy is the same as the
# [mico averaged recall reported in the table
predictions = best_model.predict(X_test)
print(classification_report(y_test, predictions, target_names=classes))

             precision    recall  f1-score   support

        NEG       0.82      0.89      0.86        47
        POS       0.71      0.57      0.63        21

avg / total       0.79      0.79      0.79        68



b=0.5

{'clf__alpha': 0.125,
 'tfidf__use_idf': (True, False),
 'vect__max_df': 0.29999999999999999,
 'vect__ngram_range': (1, 3),
 'vect__preprocessor': <function class_utils.normalize_tweet>,
 'vect__stop_words': 'english',
 'vect__tokenizer': <function nltk.tokenize.word_tokenize>}
 
 accuracy: 0.818
precision: 0.856
recall: 0.489
f1: 0.619
fscore_prec: 0.741

  precision    recall  f1-score   support

        NEG       0.77      0.94      0.85        47
        POS       0.73      0.38      0.50        21

avg / total       0.76      0.76      0.74        68

b=1

{'clf__alpha': 0.050000000000000003,
 'tfidf__use_idf': (True, False),
 'vect__max_df': 0.61111111111111116,
 'vect__ngram_range': (1, 3),
 'vect__preprocessor': <function class_utils.normalize_tweet>,
 'vect__stop_words': 'english',
 'vect__tokenizer': <function nltk.tokenize.word_tokenize>}
 
accuracy: 0.815
precision: 0.773
recall: 0.565
f1: 0.649
fscore_prec: 0.649

precision    recall  f1-score   support

        NEG       0.82      0.89      0.86        47
        POS       0.71      0.57      0.63        21

avg / total       0.79      0.79      0.79        68

b=1.5


accuracy: 0.815
precision: 0.773
recall: 0.565
f1: 0.649
fscore_prec: 0.613

 precision    recall  f1-score   support

        NEG       0.82      0.89      0.86        47
        POS       0.71      0.57      0.63        21

avg / total       0.79      0.79      0.79        68

## Results

We check how it works by running the best classifier from the grid search on our held out set.

In [None]:
# Print confusion matrix
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
print(confusion_matrix(y_test, predictions))

## Persist model

Take our best model, retrain it on entire training dataset (including the held out set used for evaluation above), and persist it to disk.

In [None]:
# retrain on all data
best_model.fit(docs, bin_targets)

In [None]:
# save to disk
with open(model_filename, 'wb') as f:
    pickle.dump(best_model, f)