# Goal of project
- Author: Arun Nemani
- Effectively extract features from N categories of the 20-newsgroups dataset
- Train and fit a classification model to predict text inputs based on the extracted features
- Report the accuracies for each classification models

### Cloned repos as baseline
- https://github.com/stefansavev/demos/blob/master/text-categorization/20ng/20ng.py
- https://nlpforhackers.io/text-classification/


In [2]:
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
import nltk
import string
from nltk import word_tokenize
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.grid_search import GridSearchCV

## Initialize and load dataset

First we split the dataset into a training, development, and testing set.
The purpose of splitting up the dataset in this manner is to ensure we do not bias or overfit our model by iteratively refining our model on the train and test sets. Instead, we fine tune our model on the train and development set, and invoke the test only once as a final input on the tuned model.

Thus, the entire dataset is split into the training set (70%), development set (15%), and test set (15%).

In [6]:
# Initializations
nltk.download('punkt')
nltk.download('stopwords')
categ = ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 
              'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 
              'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 
              'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 
              'sci.space', 'soc.religion.christian', 'talk.politics.guns', 
              'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] #None: Indicating all 20 categories are included in dataset
remove = ('headers', 'footers', 'quotes') #Required to remove false features that result in overfitting
RANDOM_STATE = 35

# Load dataset
print("Loading 20 newsgroups dataset for categories:")
newsdata = fetch_20newsgroups(subset='all', categories=categ)
X_train, X_intermediate, Y_train, Y_intermediate = train_test_split(newsdata.data, newsdata.target, test_size=0.30, random_state=RANDOM_STATE)
X_dev, X_test, Y_dev, Y_test = train_test_split(X_intermediate, Y_intermediate, test_size=0.50, random_state=RANDOM_STATE)
print('Data loaded')
print()
print('Training data documents:', len(X_train))
print('Development data documents:', len(X_dev))
print('Test data documents:', len(X_test))
print()
print('Total Newsgroups :', newsdata.target_names)

[nltk_data] Downloading package punkt to /Users/m1u6026/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/m1u6026/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Loading 20 newsgroups dataset for categories:
Data loaded

Training data documents: 13192
Development data documents: 2827
Test data documents: 2827

Total Newsgroups : ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [7]:
def Stem_tokenize(text):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in word_tokenize(text)]

## Finalized Feature Extractor

Note that we have SIGNIFICANTLY reduced the number of vocabulary features in our vectorizer from 1392137 to 85540.
This is primarily due to the min_df parameter since changing the max_df does not impact the feature size significantly.
Basically, our previous vectorizers were overfitting features that were VERY sparse in the dataset.

FE will be our finalized feature extractor for this project.
Next we explore classification schemes.

In [9]:
FE = TfidfVectorizer(analyzer= 'word', tokenizer= Stem_tokenize,
                                stop_words=stopwords.words('english') + list(string.punctuation),
                                lowercase=True, strip_accents='ascii', ngram_range=(1,2),
                                min_df=5, max_df= 0.75)
Vocab_train = FE.fit_transform(X_train)
Vocab_dev = FE.transform(X_dev)
print('FE training set vocabulary size is {} in {} documents'.format(Vocab_train.shape[1], Vocab_train.shape[0]))
print('FE dev set vocabulary size is {} in {} documents'.format(Vocab_dev.shape[1], Vocab_dev.shape[0]))

FE training set vocabulary size is 85540 in 13192 documents
FE dev set vocabulary size is 85540 in 2827 documents


## Feature Classification

First we need to understand the dataset before selection a classification model.
The matrix output of the feature extraction methods are very sparse with a small set of non-zero values. 

It is important to note that we will fine tune our classification models on the dev set USING the feature extraction model created on the training set.

Thus we will try Multinomial Naive-Bayes, regularized Logistic regression, and Stochastic Gradient Descent.

## Grid-search for hyperparameter tuning

In [10]:
classifier_nb = MultinomialNB()
params = {'alpha':[0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0]}
grid_classifier_nb = GridSearchCV(classifier_nb, params, scoring = 'accuracy')
grid_classifier_nb.fit(Vocab_train, Y_train)
pred = grid_classifier_nb.predict(Vocab_dev)
print("Multinomial NB optimal alpha: {}".format(grid_classifier_nb.best_params_))

classifier_lreg = LogisticRegression(penalty = 'l2', solver='sag', random_state=RANDOM_STATE, n_jobs=-1)
params = {'C':[0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0]}
grid_classifier_lreg = GridSearchCV(classifier_lreg, params, scoring = 'accuracy')
grid_classifier_lreg.fit(Vocab_train, Y_train)
pred = grid_classifier_lreg.predict(Vocab_dev)
print("Logistic Regression optimal C: {}".format(grid_classifier_lreg.best_params_))

classifier_SGD = SGDClassifier(tol=0.0001, penalty="l2", random_state=RANDOM_STATE, n_jobs=-1)
params = {'alpha':[0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0]}
grid_classifier_SGD = GridSearchCV(classifier_SGD, params, scoring = 'accuracy')
grid_classifier_SGD.fit(Vocab_train, Y_train)
pred = grid_classifier_SGD.predict(Vocab_dev)
print("Stochastic Gradient Descent optimal alpha: {}".format(grid_classifier_SGD.best_params_))

Multinomial NB optimal alpha: {'alpha': 0.01}
Logistic Regression optimal C: {'C': 5.0}
Stochastic Gradient Descent optimal alpha: {'alpha': 0.0001}


## Optimal classification model

Now that we have identified the optimal parameters for each model, we will calculate the final accuracies on the test set and select the final classification model for this project.

Note that we predefine the regularization parameters for SGD and logistic regression classifiers.

The idea of regularization is to avoid learning very large weights, which are likely to fit the training data but do not generalize well. L2 regularization adds a penalty to the sum of the squared weights whereas L1 regularization computes add the penalty via the sum of the absolute values of the weights. The result is that L2 regularization makes all the weights relatively small, and L1 regularization drives lots of the weights to 0, effectively removing unimportant features.

In this particular application, there are a number of features that are very sparse but unique in identifying newsgroups.
Thus, only "L2" regularization has been selected for all classification models.


In [11]:
Vocab_test = FE.transform(X_test)
classifier_NB = MultinomialNB(alpha=0.01)
classifier_NB.fit(Vocab_train, Y_train)
pred = classifier_NB.predict(Vocab_test)
print("NB classifier accuracy: {}".format(round(metrics.accuracy_score(Y_test, pred),4)))

classifier_lreg = LogisticRegression(penalty = 'l2', solver='sag', C=5, random_state=RANDOM_STATE, n_jobs=-1)
classifier_lreg.fit(Vocab_train, Y_train)
pred = classifier_lreg.predict(Vocab_test)
print("Logistic regression classifier accuracy: {}".format(round(metrics.accuracy_score(Y_test, pred),4)))

classifier_SGD = SGDClassifier(tol=0.0001, penalty="l2", alpha=0.0001, random_state=RANDOM_STATE, n_jobs=-1)
classifier_SGD.fit(Vocab_train, Y_train)
pred = classifier_SGD.predict(Vocab_test)
print("Stochastic Gradient Descent classifier accuracy: {}".format(round(metrics.accuracy_score(Y_test, pred),4)))

NB classifier accuracy: 0.9073
Logistic regression classifier accuracy: 0.9059
Stochastic Gradient Descent classifier accuracy: 0.9165


## Finalized predictive model

Multinomial NB classifier accuracy on final test set: 0.9073.

Logistic regression accuracy on final test set: 0.9059.

Stochastic gradient descent accuracy on final test set: 0.9165.

Based on our preprocessing, parameter tuning, and model selection, a stochastic gradient descent classifier can accurately predict an input corpus into one of the 20 newsgroups. However, multinomial NB has marginally lower accuracy but perfoms significantly much faster than the other classifiers on the new test set. 

Below we apply some sanity checks to see our prediction work in real-time.

In [16]:
def predictNewsGroup(text, clf):
    Vocab_test = FE.transform([text])
    targets = newsdata.target_names
    idx = clf.predict(Vocab_test)
    print("Predicted newsgroup: {}".format(targets[int(idx)]))
    return

print("Newsgroup categories: {}".format(newsdata.target_names))
print()
predictNewsGroup("A Honda CBR is a dope ride", classifier_SGD)
predictNewsGroup("He is #1 player with the highest contract signed for the Minnesota Wild", classifier_SGD)
predictNewsGroup("I'll only sell with my Gamecube for $1000", classifier_SGD)
predictNewsGroup("Homs is really unstable right now. Many refugees are actively leaving the region", classifier_SGD)
predictNewsGroup("Interstellar was a really good movie. I'm sure Carl Sagan would've loved it", classifier_SGD)
predictNewsGroup("MLops is an important concept for productionalizing machine learning models", classifier_SGD)
predictNewsGroup("Donald Trump didn't tell the whole truth about the Russia investigation", classifier_SGD)

Newsgroup categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Predicted newsgroup: rec.motorcycles
Predicted newsgroup: rec.sport.hockey
Predicted newsgroup: misc.forsale
Predicted newsgroup: talk.politics.mideast
Predicted newsgroup: sci.space
Predicted newsgroup: rec.autos
Predicted newsgroup: sci.med
