# The One Goal for Today

Understand how to fit a naive Bayes model using text data.

In [1]:
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB, GaussianNB, ComplementNB, BernoulliNB
from sklearn.metrics import confusion_matrix, classification_report
from zipfile import ZipFile

# Naive Bayes for Text, Multi-Class

Text data, *without preprocessing*, is qualitative data. Let's use Naive Bayes to classify some text data! Today's data  has more than two classes, so this is multi-class classification rather than binary classification. 
* last week Monday: Bayes Rule
* last week Wednesday: Naive Bayes, one independent variable, binary classification
* last week Friday: Naive Bayes, two independent variables, binary classification; Laplace smoothing
* this week Monday: Naive Bayes, many independent variables, binary classification; Laplace smoothing and log space
* today: Naive Bayes, many independent variables, multiclass classification
* Friday: deep dive into evaluation metrics beyond accuracy and confusion matrices

I'm going to be using the news dataset from [here](https://data.world/elenadata/vox-articles). Side note: this data set was released for a workshop in 2017 that I co-organized!

## I. Load and Look at our data

Let's load and __look at our data__. Where is the dependent variable?

This data is big, so I zipped it. Let's look at the first five lines.

In [2]:
with ZipFile('data/dsjVoxArticles.zip') as z:
    with z.open('dsjVoxArticles.tsv', 'r') as tsv:
        lines = [next(tsv) for x in range(5)]
        print(lines)



For efficiency, I'm going to ignore the article bodies and just use the titles. (They would need quite a bit of preprocessing anyway since they contain markup.) So I want the first and third fields.

In [3]:
data = []

with ZipFile('data/dsjVoxArticles.zip') as z:
    with z.open('dsjVoxArticles.tsv', 'r') as tsv:
        for line in tsv:
            cols = line.decode('utf-8').strip().split('\t')[:3]
            data.append([cols[0], cols[2]])

Let's make this into a numpy array and take a look.
* How many data points?
* How many classes?
* What are the classes, anyway?

In [4]:
data = np.array(data)
print(data.shape)
print(np.unique(data[:, 1]))
print(len(np.unique(data[:, 1])))

(23025, 2)
['2014 Midterm Elections' '2016 Golden Globes' '2016 Grammys'
 '2016 Presidential Election' '2016 Rio Olympics' '2016ish' 'Almanac'
 'Ant-Man' 'Apple' 'Avengers: Age of Ultron' 'Bernie Sanders'
 'Best of 2014' 'Best of 2016' 'Black Mirror, Season 3' 'Books'
 'Business & Finance' 'Campaign Finance' 'Carly Fiorina' 'Celebrities'
 'China' 'Climate Change' 'College Football' 'Comic Books' 'Congress'
 'Conversations' 'Criminal Justice' 'Cuba' 'Culture' 'Dear Julia'
 'Debates' 'Donald Trump' 'Ebola' 'Economic Mobility' 'Education'
 'Emmy Awards' 'Energy & Environment' 'Episode of the Week' 'Explainers'
 'Fear the Walking Dead' 'Fear the Walking Dead, Season 1'
 'Fear the Walking Dead, season 1, episode 1'
 'Fear the Walking Dead, season 1, episode 2' 'Features' 'First Person'
 'Game of Thrones' 'Game of Thrones season 6, episode 3'
 'Game of Thrones, season 5, episode 1'
 'Game of Thrones, season 5, episode 10'
 'Game of Thrones, season 5, episode 2'
 'Game of Thrones, season 5, e

Well, that's too many classes, and some of them are super specific. Let's just take five pretty generic classes.

In [8]:
reduced_data = data[np.where(np.isin(data[:, 1], ['Business & Finance', 'Health Care', 'Science & Health', 'Politics & Policy', 'Criminal Justice']))]
np.random.shuffle(reduced_data)
print(reduced_data.shape)
print(np.unique(reduced_data[:, 1], return_counts=True))

(3165, 2)
(array(['Business & Finance', 'Criminal Justice', 'Health Care',
       'Politics & Policy', 'Science & Health'], dtype='<U139'), array([ 425,  408,  326, 1359,  647]))


## II. Split the data

Let's split the data into train, dev and test. 

When we check by printing shapes and unique values, does everything look okay?

In [9]:
train_data, dev_data, test_data = np.split(reduced_data, [int(.8 * len(reduced_data)), int(.9 * len(reduced_data))])
print(train_data.shape, dev_data.shape, test_data.shape)
print(np.unique(train_data[:, 1]), np.unique(dev_data[:, 1]), np.unique(test_data[:, 1]))

(2532, 2) (316, 2) (317, 2)
['Business & Finance' 'Criminal Justice' 'Health Care' 'Politics & Policy'
 'Science & Health'] ['Business & Finance' 'Criminal Justice' 'Health Care' 'Politics & Policy'
 'Science & Health'] ['Business & Finance' 'Criminal Justice' 'Health Care' 'Politics & Policy'
 'Science & Health']


## III. Preprocess the data

On Monday we tokenized the data and extracted counts for each token for each class ourselves.

Today I'm going to use two scikit-learn utilities:
* CountVectorizer - will tokenize and count
* LabelEncoder - will map the string labels to ints

As on Monday, I use *only the training data* to extract my token vocabulary.

In [10]:
vectorizer = CountVectorizer(lowercase=True, analyzer='word', max_features=1000)

vectorizer.fit(iter(train_data[:, 0]))
# We have to use np.asarray because sklearn 1.0 doesn't want matrices for naive Bayes
train_processed = np.asarray(vectorizer.transform(iter(train_data[:, 0])).todense())
dev_processed = np.asarray(vectorizer.transform(iter(dev_data[:, 0])).todense())
test_processed = np.asarray(vectorizer.transform(iter(test_data[:, 0])).todense())

encoder = LabelEncoder()
encoder.fit(train_data[:, 1])
train_labels = encoder.transform(train_data[:, 1])
dev_labels = encoder.transform(dev_data[:, 1])
test_labels = encoder.transform(test_data[:, 1])

## IV. Fit, Predict and Score

Today I'm going to compare the performance of several scikit-learn Naive Bayes alternatives on this dataset. If you recall from last week, these variations on Naive Bayes model different *probability distributions* over the training data, rather than using the likelihoods and priors directly.

Although we aren't using our own, hand-written Naive Bayes, you can see that the pattern is the same:
1. Fit
2. Predict
3. Score

With respect to "score", you'll see we are calculating:
* precision
* recall
* F1

*per class*. 

In [11]:
nb = MultinomialNB()
nb.fit(train_processed, train_labels)
pred = nb.predict(dev_processed)
print(classification_report(dev_labels, pred, target_names=encoder.classes_))
print(confusion_matrix(dev_labels, pred))

                    precision    recall  f1-score   support

Business & Finance       0.49      0.42      0.45        40
  Criminal Justice       0.63      0.55      0.59        31
       Health Care       0.71      0.56      0.63        36
 Politics & Policy       0.70      0.76      0.73       142
  Science & Health       0.70      0.75      0.72        67

          accuracy                           0.67       316
         macro avg       0.65      0.61      0.62       316
      weighted avg       0.67      0.67      0.67       316

[[ 17   2   0  14   7]
 [  1  17   1  12   0]
 [  1   0  20  10   5]
 [ 12   8   5 108   9]
 [  4   0   2  11  50]]


In [12]:
nb = GaussianNB()
nb.fit(train_processed, train_labels)
pred = nb.predict(dev_processed)
print(classification_report(dev_labels, pred, target_names=encoder.classes_))
print(confusion_matrix(dev_labels, pred))

                    precision    recall  f1-score   support

Business & Finance       0.37      0.35      0.36        40
  Criminal Justice       0.34      0.52      0.41        31
       Health Care       0.27      0.53      0.36        36
 Politics & Policy       0.73      0.56      0.63       142
  Science & Health       0.64      0.51      0.57        67

          accuracy                           0.51       316
         macro avg       0.47      0.49      0.47       316
      weighted avg       0.58      0.51      0.53       316

[[14  3 15  6  2]
 [ 2 16  2  9  2]
 [ 0  1 19 10  6]
 [11 24 19 79  9]
 [11  3 15  4 34]]


In [13]:
nb = ComplementNB()
nb.fit(train_processed, train_labels)
pred = nb.predict(dev_processed)
print(classification_report(dev_labels, pred, target_names=encoder.classes_))
print(confusion_matrix(dev_labels, pred))

                    precision    recall  f1-score   support

Business & Finance       0.44      0.45      0.44        40
  Criminal Justice       0.58      0.68      0.63        31
       Health Care       0.68      0.69      0.68        36
 Politics & Policy       0.78      0.69      0.73       142
  Science & Health       0.68      0.78      0.72        67

          accuracy                           0.68       316
         macro avg       0.63      0.66      0.64       316
      weighted avg       0.69      0.68      0.68       316

[[18  3  2  9  8]
 [ 2 21  0  8  0]
 [ 2  0 25  3  6]
 [15 11  7 98 11]
 [ 4  1  3  7 52]]


In [14]:
nb = BernoulliNB()
nb.fit(train_processed, train_labels)
pred = nb.predict(dev_processed)
print(classification_report(dev_labels, pred, target_names=encoder.classes_))
print(confusion_matrix(dev_labels, pred))

                    precision    recall  f1-score   support

Business & Finance       0.41      0.30      0.35        40
  Criminal Justice       0.73      0.52      0.60        31
       Health Care       0.74      0.47      0.58        36
 Politics & Policy       0.65      0.77      0.71       142
  Science & Health       0.68      0.75      0.71        67

          accuracy                           0.65       316
         macro avg       0.64      0.56      0.59       316
      weighted avg       0.65      0.65      0.64       316

[[ 12   0   0  19   9]
 [  1  16   0  13   1]
 [  1   0  17  13   5]
 [ 12   6   5 110   9]
 [  3   0   1  13  50]]


## V. Questions

1. What are the definitions for precision, recall and F1, and how do they relate to the confusion matrix?
2. What do "macro avg" and "weighted avg" mean?
3. Which variant of Naive Bayes works the best on this data?
4. Is there a class that is consistently miscategorized regardless of method?
5. Which metric or way of analyzing the results makes the most sense to you? Why?

## Acknowledgments

This notebook is inspired by https://github.com/sachinbiradar9/News-Classification/blob/master/news.ipynb