# Text Classification

- We can build a model that can classify a text document as one of many possible classes.
- Perform classification using a text dataset, using sensible preprocessing, tokenization, and feature engineering scheme.
- Use scikit-learn text vectorizers to fit and transform text data into a format to be used in a ML model

## Imports

In [1]:
import nltk
from nltk.corpus import stopwords
import string
from nltk import word_tokenize, FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np
np.random.seed(0)

## Get Categories, Train, Test Data

In [2]:
categories = ['alt.atheism', 'comp.windows.x', 'rec.sport.hockey', 'sci.crypt','talk.politics.guns']

# "remove" parameter is so that the model can't overfit to metadata  
# included in the articles that sometimes acts as a dead-giveaway as to 
# what class the article belongs to

newsgroups_train = fetch_20newsgroups(
    subset = 'train', 
    categories = categories, 
    remove =('headers', 'footers', 'quotes'))

newsgroups_test = fetch_20newsgroups(
    subset = 'test', 
    categories = categories, 
    remove = ('headers', 'footers', 'quotes'))

## Break Apart Data + Labels

In [3]:
data = newsgroups_train.data
target = newsgroups_train.target
label_names = newsgroups_train.target_names
label_names

['alt.atheism',
 'comp.windows.x',
 'rec.sport.hockey',
 'sci.crypt',
 'talk.politics.guns']

In [4]:
# check the shape of data to see what our data looks like. 
# Our dataset contains 2,814 different articles 
# spread across the five classes we chose.

newsgroups_train.filenames.shape

(2814,)

Before we can begin cleaning and preprocessing our text data, we need to make some decisions about things such as:

* Do we remove stop words or not?
* Do we stem or lemmatize our text data, or leave the words as is?
* Is basic tokenization enough, or do we need to support special edge cases through the use of regex?
* Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?
* Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?
* What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?

## Stopwords

In [6]:
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
stopwords_list += ["''", '""', '...', '``']

## Tokenize

To save ourselves some time, use a function to clean the dataset, and then use Python's map() function to clean every article in the dataset at the same time.

In [8]:
def process_article(article):
    tokens = word_tokenize(article)
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return stopwords_removed


In [9]:
processed_data = list(map(process_article, data))

In [10]:
processed_data[0]

['note',
 'trial',
 'updates',
 'summarized',
 'reports',
 '_idaho',
 'statesman_',
 'local',
 'nbc',
 'affiliate',
 'television',
 'station',
 'ktvb',
 'channel',
 '7',
 'randy',
 'weaver/kevin',
 'harris',
 'trial',
 'update',
 'day',
 '4',
 'friday',
 'april',
 '16',
 '1993',
 'fourth',
 'day',
 'trial',
 'synopsis',
 'defense',
 'attorney',
 'gerry',
 'spence',
 'cross-examined',
 'agent',
 'cooper',
 'repeated',
 'objections',
 'prosecutor',
 'ronald',
 'howen',
 'spence',
 'moved',
 'mistrial',
 'denied',
 'day',
 'marked',
 'caustic',
 'cross-examination',
 'deputy',
 'marshal',
 'larry',
 'cooper',
 'defense',
 'attorney',
 'gerry',
 'spence',
 'although',
 'spence',
 'explicitly',
 'stated',
 'one',
 'angle',
 'stategy',
 'must',
 'involve',
 'destroying',
 'credibility',
 'agent',
 'cooper',
 'cooper',
 'government',
 "'s",
 'eyewitness',
 'death',
 'agent',
 'degan',
 'spence',
 'attacked',
 'cooper',
 "'s",
 'credibility',
 'pointing',
 'discrepancies',
 'cooper',
 "'s",
 '

## Explore Dataset

In [11]:
# total vocabulary size of the training dataset
total_vocab = set()
for article in processed_data:
    total_vocab.update(article)
len(total_vocab)

46990

## Frequency Distributions

In [17]:
articles_concat = []
for article in processed_data:
    articles_concat += article

In [18]:
len(articles_concat)

411966

In [19]:
articles_freqdist = FreqDist(articles_concat)
articles_freqdist.most_common(200)

[('--', 29501),
 ('x', 4840),
 ("'s", 3203),
 ("n't", 2933),
 ('1', 2529),
 ('would', 1985),
 ('0', 1975),
 ('one', 1758),
 ('2', 1664),
 ('people', 1243),
 ('use', 1146),
 ('get', 1068),
 ('like', 1036),
 ('file', 1024),
 ('3', 1005),
 ('also', 875),
 ('key', 869),
 ('4', 864),
 ('could', 853),
 ('know', 814),
 ('think', 814),
 ('time', 781),
 ('may', 729),
 ('even', 711),
 ('new', 706),
 ('first', 678),
 ('*/', 674),
 ('system', 673),
 ('5', 673),
 ('well', 670),
 ('information', 646),
 ('make', 644),
 ('right', 638),
 ('see', 636),
 ('many', 634),
 ('two', 633),
 ('/*', 611),
 ('good', 608),
 ('used', 600),
 ('7', 593),
 ('government', 588),
 ('way', 572),
 ('available', 568),
 ('window', 568),
 ("'m", 562),
 ('db', 553),
 ('much', 540),
 ('encryption', 537),
 ('6', 537),
 ('using', 527),
 ('say', 523),
 ('gun', 520),
 ('number', 518),
 ('program', 515),
 ('us', 510),
 ('team', 498),
 ('must', 483),
 ('law', 476),
 ('since', 449),
 ('need', 444),
 ('game', 439),
 ('chip', 437),
 ('s

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tf_idf_data_train = vectorizer.fit_transform(data)
tf_idf_data_test = vectorizer.transform(newsgroups_test.data)

In [23]:
# we have 2814 articles, with 36,622 unique words in the vocabulary
tf_idf_data_train.shape

(2814, 36622)

Our vectorized data contains 2,814 articles, with 36,622 unique words in the vocabulary. However, the vast majority of these columns for any given article will be zero, since every article only contains a small subset of the total vocabulary. Recall that vectors mostly filled with zeros are referred to as Sparse Vectors. These are extremely common when working with text data.

In [24]:
non_zero_cols = tf_idf_data_train.nnz / float(tf_idf_data_train.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_cols))

percent_sparse = 1 - (non_zero_cols / float(tf_idf_data_train.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 107.28038379530916
Percentage of columns containing 0: 0.9970706028126451


As we can see from the output above, the average vectorized article contains 107 non-zero columns. This means that 99.7% of each vector is actually zeroes! This is one reason why it's best not to create your own vectorizers, and rely on professional packages such as scikit-learn and NLTK instead -- they contain many speed and memory optimizations specifically for dealing with sparse vectors. This way, we aren't wasting a giant chunk of memory on a vectorized dataset that only has valid information in 0.3% of it.

In [25]:
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)

In [27]:
nb_classifier.fit(tf_idf_data_train, target)
rf_classifier.fit(tf_idf_data_train, target)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [29]:
nb_train_preds = nb_classifier.predict(tf_idf_data_train)
nb_test_preds = nb_classifier.predict(tf_idf_data_test)

rf_train_preds = rf_classifier.predict(tf_idf_data_train)
rf_test_preds = rf_classifier.predict(tf_idf_data_test)

In [30]:
nb_train_score = accuracy_score(target, nb_train_preds)
nb_test_score = accuracy_score(newsgroups_test.target, nb_test_preds)
rf_train_score = accuracy_score(target, rf_train_preds)
rf_test_score = accuracy_score(newsgroups_test.target, rf_test_preds)

print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))
print("")
print('-'*70)
print("")
print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score, rf_test_score))

Multinomial Naive Bayes
Training Accuracy: 0.9531 		 Testing Accuracy: 0.8126

----------------------------------------------------------------------

Random Forest
Training Accuracy: 0.9851 		 Testing Accuracy: 0.7896


The models did well. Since there are five classes, the naive accuracy rate (random guessing) would be 20%. 

With scores of 78 and 81 percent, the models did much better than random guessing. 

There is some evidence of overfitting, as the scores on the training set are much higher than those of the test set. 

This suggests that the models' fits could be improved with some tuning.