# _Joe's Playground: Text Classification w/ Naive Bayes & SVM_

In [7]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from six.moves import range

# Setup Pandas
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

For this playground Notebook, we're going to be using the "Twenty Newsgroups" dataset. Below is the official description, quoted from the website:

"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."

This dataset includes 18,000 newsgroup posts with 20 topics. To get faster execution for this first example however we will work on a partial dataset with only 4 categories out of the 20 available.

In [10]:
# 4 categories we'll focus on first
categories = ['alt.atheism', 'soc.religion.christian',
             'comp.graphics', 'sci.med']

# parts to remove (from fast.ai lesson)
remove = ('headers', 'footers', 'quotes')

In [17]:
# load files
from sklearn.datasets import fetch_20newsgroups

# gather training data set, with categories set to ones above
twenty_train = fetch_20newsgroups(subset='train', categories=categories, 
                                  shuffle=True, random_state=1)

In [18]:
# check target_names to see category names, should return the same as categories above
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [19]:
# files are loaded in memory in data attribute
print(len(twenty_train.data))
print(len(twenty_train.filenames))

2257
2257


In [20]:
# print the first lines of the first loaded file
print("\n".join(twenty_train.data[0].split('\n')[:3]))

From: jaeger@buphy.bu.edu (Gregg Jaeger)
Subject: Re: The Inimitable Rushdie (Re: An Anecdote about Islam
Organization: Boston University Physics Department


In [21]:
# print out the category integer id for the first 10 samples stored in the target attribute
twenty_train.target[:10]

array([0, 2, 3, 1, 2, 1, 1, 1, 2, 3])

In [22]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The numbers correspond to the index of the `target_names` list.

Another way to look at it...

In [23]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

alt.atheism
sci.med
soc.religion.christian
comp.graphics
sci.med
comp.graphics
comp.graphics
comp.graphics
sci.med
soc.religion.christian


# _Extract features from text_ 

### __Bag of Words__
To perform machine learning on text documents, we first need to turn text content into numerical feature vectors. The most intuitive way to do this is with something called __bag of words__, which does the following:

1. Assigns a fixed integer id to each word occurring in any document of the training set (i.e. builds a dictionary from words to integer indices)
2. For each document `#i`, count the number of occurrences of each word `w` and store it in `X[i, j]` as the value of feature `#j` where `j` is the index of word `w` in the dictionary.

There is one problem with bag of words though: the number of distinct words in the corpus. This number can be extremely large (typically > 100,000) and puts heavy constraints on our computers RAM. Yet, when we take a closer look, most of the values in `X` will be 0 for any given document as the prevelance of words in the dictionary will tend to be heavily right-skewed (i.e. a small portion of the total words will make up the majority of occurrences). This means that bags of words tend to be __high-dimensional sparse datasets__. 

To get two birds (memory and 0s) with one stone we can use `scipy.sparse` matrices because they only store the non-zero parts of the feature vectors in memory. 

### __Tokenizing text with `scikit-learn`__

Another option is to use [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) which includes text preprocessing, tokenizing and filtering of stopwords. This function builds a dictionary of features and transforms documents to feature vectors.

In [50]:
# below is simple example of how CountVectorizer works
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
print(X.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [51]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [52]:
len(count_vect.vocabulary_)

35788

### _Occurrences to frequencies_

Word occurrence may be a good starting point but we run into an issue for longer vs. shorter documents: longer ones will have higher average count values than shorter ones, even if they talk about the same topics. 

This is where we can start talking about __Term Frequencies__ which addresses this disconnect by dividing the number of occurrences by total words in the document. Additionally, for words that occur in many documents in the corpus (and thus have the higher liklihood of providing no useful information), we can downscale the weights for these particular words.

This strategy is otherwise known as _Term Frequency times Inverse Document Frequency_ (i.e. [__td-idf__](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)). 

Let's use this on the `twenty_train` data set.

In [53]:
# import TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

In [54]:
list(X_train_tfidf)[:5]

[<1x35788 sparse matrix of type '<class 'numpy.float64'>'
 	with 221 stored elements in Compressed Sparse Row format>,
 <1x35788 sparse matrix of type '<class 'numpy.float64'>'
 	with 141 stored elements in Compressed Sparse Row format>,
 <1x35788 sparse matrix of type '<class 'numpy.float64'>'
 	with 124 stored elements in Compressed Sparse Row format>,
 <1x35788 sparse matrix of type '<class 'numpy.float64'>'
 	with 61 stored elements in Compressed Sparse Row format>,
 <1x35788 sparse matrix of type '<class 'numpy.float64'>'
 	with 93 stored elements in Compressed Sparse Row format>]

# _Training a classifier_

We now have our features in `X_train_tfidf`. Now we are going to train a classifier to predict a category of a post. Since we're focusing on _naive bayes_ we start by using a [naive bayes classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes). 

More specifically we'll be using a [multinomial Naive Bayes classifer](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) which is suitable for classification with discrete features (i.e. word counts for text classification). The basis for naive Bayes is the assumption of conditional independence between every pair of feautres given the value of the class variable. 

In [55]:
from sklearn.naive_bayes import MultinomialNB

# create classifier object
clf = MultinomialNB()

# fit classifier to data
clf.fit(X_train_tfidf, twenty_train.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [57]:
# create new sample docs to predict target label on
docs_new = ['God is love', 'OpenGL on the GPU is fast']

# only need to call transform since the original transformers have already been fit to training set
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predict_new = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predict_new):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


# _Build a Pipeline_

To streamline the process,we can use the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) class to do all these steps for us. 

In [58]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

# fit pipeline to data
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

# _Evaluate Performance on Test set_

In [59]:
import numpy as np

# import test set
twenty_test = fetch_20newsgroups(subset='test', categories=categories, 
                                 shuffle=True, random_state=1)

docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.8348868175765646

In [60]:
from sklearn import metrics

print(metrics.classification_report(twenty_test.target, predicted, 
                                    target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

              accuracy                           0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



In [62]:
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(random_state=42))
])

# fit pipeline to data
text_clf.fit(twenty_train.data, twenty_train.target)

predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.9307589880159787

In [63]:
print(metrics.classification_report(twenty_test.target, predicted, 
                                    target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.96      0.85      0.90       319
         comp.graphics       0.92      0.97      0.94       389
               sci.med       0.95      0.93      0.94       396
soc.religion.christian       0.90      0.96      0.93       398

              accuracy                           0.93      1502
             macro avg       0.93      0.93      0.93      1502
          weighted avg       0.93      0.93      0.93      1502

