# Goal of project

- Utilize all 20 categories that are a part of the 20-newsgroups dataset and classify extracted features
- Report the accuracies for each classification models

# Cloned repos as baseline
- https://github.com/stefansavev/demos/blob/master/text-categorization/20ng/20ng.py
- https://nlpforhackers.io/text-classification/


In [19]:
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
import nltk
import string
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

# Initializations
nltk.download('punkt')
categories = None #Indicating all 20 categories are included in dataset
remove = ('headers', 'footers', 'quotes') #Required to false features that may be included, resulting in overfitting
RANDOM_STATE = 35

# Load dataset
print("Loading 20 newsgroups dataset for categories:")
newsdata = fetch_20newsgroups(subset='all')
print('Data loaded')

def trainClassifier(clf, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE)
    clf.fit(X_train, y_train)
    Y_pred = clf.predict(X_test)
    print("Accuracy: {}".format(metrics.accuracy_score(y_test, Y_pred)))
    return

Model1 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())])

trainClassifier(Model1, newsdata.data, newsdata.target)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/arunnemani/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Loading 20 newsgroups dataset for categories:
Data loaded
Accuracy: 0.8550509337860781


# Model1 

We just implemented a simple Multinomial Naive-Bayes classification model utilizing the CountVectorizer to extract features. Now, we will interate step by step to improve our model robustness and accuracy.

First, we need to implement three main features for this project
- Stopwords: This is to prevent common articles, punctuation, etc in our traning model.
- Stemming: This is to prevent multiple occurances of the same word within our model (eg. fly and flying are the same)
- Tokenizer: This is to break up entire sentences or lines into meaningful words, phrases, or "tokens"

In [20]:
def Stem_tokenize(text):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in word_tokenize(text)]

In [21]:
Model2 = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer=Stem_tokenize,
                             stop_words=stopwords.words('english') + list(string.punctuation))),
    ('classifier', MultinomialNB())])

trainClassifier(Model2, newsdata.data, newsdata.target)

Accuracy: 0.8404074702886248


# Model2

Now that we have incorporated tokenization, stopwords (with punctuations), and stemming, we can fine tune our feature extraction method. Note that the acurracy dropped between Model1 and Model2, however, it is significantly more robust in classifying key words towards newsgroups (sanity checks to follow).

We will add bi-grams to our model to ensure that multi word features are also being classified together.
For ex. "24 bit" or "image processing" are considered one feature.

In [24]:
Model3 = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer=Stem_tokenize,
                             stop_words=stopwords.words('english') + list(string.punctuation),
                                  lowercase=True, strip_accents='ascii', ngram_range=(1,2))),
    ('classifier', MultinomialNB())])

trainClassifier(Model3, newsdata.data, newsdata.target)

Accuracy: 0.8711799660441426


# Model3

Up till now, we have used a standard CountVectorizer that counts the number of unique features within a dataset. This can be problematic as the frequency of these features are not accounted for. This presents and overfitting challenge particularly for linear approaches.

Thus we utilize another feature extractor (TfidfVectorizer) that accounts for the frequency of uniquely identified features in our model. All of the tuning parameters can still be used.

In [26]:
Model4 = Pipeline([
    ('vectorizer', TfidfVectorizer(tokenizer=Stem_tokenize,
                             stop_words=stopwords.words('english') + list(string.punctuation),
                                  lowercase=True, strip_accents='ascii', ngram_range=(1,2))),
    ('classifier', MultinomialNB())])

trainClassifier(Model4, newsdata.data, newsdata.target)

Accuracy: 0.8739388794567062


# Model4

With the incorporation of TfidfVectorizer, we see a slight increase in accuracy.
Further optimization
- max_df and min_df: Further tuning these parameters can weight towards features that are not considered outliers.

Now, we can fine tune the classifcation algorithm (MultinomialNB)

In [27]:
Model5 = Pipeline([
    ('vectorizer', TfidfVectorizer(tokenizer=Stem_tokenize,
                             stop_words=stopwords.words('english') + list(string.punctuation),
                                  lowercase=True, strip_accents='ascii', ngram_range=(1,2))),
    ('classifier', MultinomialNB(alpha=.01))])

trainClassifier(Model5, newsdata.data, newsdata.target)

Accuracy: 0.9320882852292021


# Model5

Tuning the alpha parameter (this is simply a Laplace smoothing coefficent), we increased our model accuracy by ~6%!
Next we can try Stochastic gradient descent

In [28]:
Model6 = Pipeline([
    ('vectorizer', TfidfVectorizer(tokenizer=Stem_tokenize,
                             stop_words=stopwords.words('english') + list(string.punctuation),
                                  lowercase=True, strip_accents='ascii', ngram_range=(1,2))),
    ('classifier', SGDClassifier(alpha=.0001, n_iter=100, penalty="elasticnet"))])

trainClassifier(Model6, newsdata.data, newsdata.target)



Accuracy: 0.8909168081494058


# Model6

SGD approach yielded in a lower accuracy score, however, this can be improved through further parameter tuning.

We will select Model5 as our finalized model and test out some text inputs

In [41]:
def predictNewsGroup(text, pipeline):
  targets = newsdata.target_names
  idx = pipeline.predict([text])[0]
  return targets[idx]
targets = newsdata.target_names
print(targets)

predictNewsGroup("A Honda CBR is a meh bike", Model5)
predictNewsGroup("This NLP script is very godlike", Model5)
predictNewsGroup("Why would", Model5)
predictNewsGroup("This NLP script is very godlike", Model5)
predictNewsGroup("This NLP script is very godlike", Model5)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


'sci.med'