# Naive Bayes classifier

This notebook trains a multinomial naive Bayes classifier to identify relevant tweets (POS).

We use scikit-learn's implementation of Naive Bayes and its cross validation tools. http://scikit-learn.org/

## Installation

To install all of the python dependencies for this notbook in a virtual environment:

```bash
# create environment in directory named 'venv'
python -m venv venv
# or:
# virtualenv venv

# activate environment
source venv/bin/activate

# install dependencies
pip3 install -r requirements.txt
```

In [1]:
from class_utils import *
import numpy as np
import pickle

from nltk.tokenize.casual import casual_tokenize
from nltk import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split

In [2]:
# globals
iteration="iter3a"
model_filename = "models/best_nb{}.pickle".format(iteration)

## Parse data sets

Here we parse data from our training files, and then randomly select a portion to be held out for evaluation. The training set is used to both train the Naive Bayes classifier and select parameters using k-fold cross validation.

The `parse_training_data()` function is provided in the external `class_utils.py` file.

In [3]:
# parse data from files
classes = ['NEG', 'POS']
docs, targets = parse_training_data(['NEG-{}.txt'.format(iteration), 'POS-{}.txt'.format(iteration)], classes)

# convert the targets array of strings to binary labels (0=NEG, 1=POS)
lb = LabelBinarizer(sparse_output=False)
lb.fit(classes)
bin_targets = lb.transform(targets).ravel()

# split data set into to training and evaluation sets
# X_test/y_test are held out and not used during the
# k-fold training and parameter search below
#
# The percentage of samples to hod out is determined by the `test_size`
# parameter
X_train, X_test, y_train, y_test = train_test_split(
    docs, bin_targets, test_size=0.10, random_state=0)

## Create sklearn pipeline

Here we setup a scikit-learn pipeline to create vectors from our training sample vocabulary (`CountVectorizer`), normalize words based on frequency (`TfidfTransformer`), and train a Naive Bayes classifier (`MultinomialNB`). http://scikit-learn.org/stable/modules/pipeline.html

We evaluate parameters based on th `fscore_prec` which is a weighted fscore which favors precision (beta < 1). We also calculate accuracy, precision, recall, and f1 scores for each of the k-fold training sessions.

Using a pipeline makes it easy to search a range of hyperparameters using sklearn's `GridSearchCV`. http://scikit-learn.org/stable/modules/grid_search.html

In [4]:
nb_pl = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

parameters = {
    'vect__preprocessor': [normalize_tweet],# [normalize_tweet, normalize_simple, None],
    'vect__max_df': np.linspace(0.4, 1.0, 10),
    'vect__tokenizer': [casual_tokenize, word_tokenize, None],
    #'vect__min_df': [1, 2],
    'vect__stop_words': ['english'], # [None, 'english'],
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': [(1,3)],# ((1, 1), (1, 2), (1, 3)),  # largest n-gram
    'tfidf__use_idf': [True], # (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': np.linspace(0.05, 0.2, 10),
}

# define the scores we want to calcualte during each k-fold training
fscore_prec = make_scorer(fbeta_score, beta=0.5)
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'fscore_prec': fscore_prec
}

# create the GridSearchCV object.
# by setting refit='fscore_prec', the model which maximizes that score
# will be selected and retrained on all training data.
nb_search = GridSearchCV(nb_pl, parameters, n_jobs=-1, verbose=1, scoring=scoring, refit='fscore_prec')

In [5]:
# Here we do the actual training and parameter search
# Depending on the number of parameters defined above,
# this can take several minutes.
nb_search.fit(X_train, y_train)

Fitting 3 folds for each of 300 candidates, totalling 900 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   18.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 900 out of 900 | elapsed:  5.7min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'tfidf__use_idf': [True], 'clf__alpha': array([ 0.05   ,  0.06667,  0.08333,  0.1    ,  0.11667,  0.13333,
        0.15   ,  0.16667,  0.18333,  0.2    ]), 'vect__max_df': array([ 0.4    ,  0.46667,  0.53333,  0.6    ,  0.66667,  0.73333,
        0.8    ,  0.86667,  0.93333,  1.     ]), ...enizer': [<function casual_tokenize at 0x1094bb268>, <function word_tokenize at 0x109529048>, None]},
       pre_dispatch='2*n_jobs', refit='fscore_prec

In [6]:
# The parameters selected by the grid search
nb_search.best_params_

{'clf__alpha': 0.15000000000000002,
 'tfidf__use_idf': True,
 'vect__max_df': 0.40000000000000002,
 'vect__ngram_range': (1, 3),
 'vect__preprocessor': <function class_utils.normalize_tweet>,
 'vect__stop_words': 'english',
 'vect__tokenizer': <function nltk.tokenize.word_tokenize>}

In [7]:
# print the average scores over the k training folds
fields = ['accuracy', 'precision', 'recall', 'f1', 'fscore_prec']

for f in fields:
    score = nb_search.cv_results_["mean_test_%s" % f][nb_search.best_index_]
    print("%s: %.3f" % (f, score))

accuracy: 0.813
precision: 0.889
recall: 0.446
f1: 0.590
fscore_prec: 0.737


## Results

We check how it works by running the best classifier from the grid search on our held out set.

In [8]:
# Get best model from grid search we ran in previous section
best_model = nb_search.best_estimator_

In [9]:
# first try predicting two synthetic tweets using our best model
tweet_neg = 'the wise man bowed his head solemnly and spoke: "theres actually zero difference between good & bad things. you imbecile. you fucking moron"'
tweet_pos = "@user something about opting out of testing assessment #SAT #optout"

pred_probs = best_model.predict_proba([tweet_neg, tweet_pos])
pred_class = best_model.predict([tweet_neg, tweet_pos])
pred_labels = lb.inverse_transform(pred_class).tolist()
print("Probabilities: \n", pred_probs, "\n\n", "Classes: ", pred_labels)

Probabilities: 
 [[ 0.77028252  0.22971748]
 [ 0.69710313  0.30289687]] 

 Classes:  ['NEG', 'NEG']


In [10]:
# use model to predict held out set (X_test) and print score table
# Note that in binary classification, accuracy is the same as the
# [mico averaged recall reported in the table
predictions = best_model.predict(X_test)
print(classification_report(y_test, predictions, target_names=classes))

             precision    recall  f1-score   support

        NEG       0.78      0.96      0.86        47
        POS       0.80      0.38      0.52        21

avg / total       0.78      0.78      0.75        68



In [11]:
# Print confusion matrix
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
print(confusion_matrix(y_test, predictions))

[[45  2]
 [13  8]]


## Persist model

Take our best model, retrain it on entire training dataset (including the held out set used for evaluation above), and persist it to disk.

In [12]:
# retrain on all data
best_model.fit(docs, bin_targets)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.40000000000000002, max_features=None,
        min_df=1, ngram_range=(1, 3),
        preprocessor=<function no...use_idf=True)), ('clf', MultinomialNB(alpha=0.15000000000000002, class_prior=None, fit_prior=True))])

In [13]:
# save to disk
with open(model_filename, 'wb') as f:
    pickle.dump(best_model, f)