## 20 Newsgroups Dataset Analysis

The object is to train a model with using the scikit-learn dataset: 20 newsgroups. 
This project does the following:
    
1. Loads the dataset and categories
2. Extracts the data into feature vectors
3. Trains the linear model for perform categorization
4. Use a grid search strategy to find a good configuration of both the feature extraction components and the classifier

In [1]:
# Downlaod the 20news dataset
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups (subset='train', categories=categories, shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

### Extract Features from Data

##### Bag of Words Representation

Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indicies.

Then, for each document **i**, count the number of occurrences of each word **w** and store it in **w\[i, j\]** as the value of feature **j** where **j** is the index of word **w** in the dictionary

Use *TfidfTransformer* to calculate frequency of words in documents

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

### Train classifier

Using the feature vectors, train a Naive Bayes classifier to predict the category of a post

In [19]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [22]:
# Prediction test
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


### Build a Pipeline

Improve the Vectorize => Transform => Classify process using the Pipeline class

In [23]:
# Build Pipeline
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

In [24]:
# Train Classifier with Pipeline
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

### Evaluate Performance Accuracy 

Evaluate the predictive accuracy using the test set

In [27]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
print('Accuracy:', np.mean(predicted == twenty_test.target))

Accuracy: 0.8348868175765646


#### Support Vector Machine (SVM)

Naive Bayes (above) has about an 83.5% accuracy. Now try plugging in the SVM classifier into the Pipeline

In [31]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None)),
])

text_clf.fit(twenty_train.data, twenty_train.target)

predicted = text_clf.predict(docs_test)
print('Accuracy:', np.mean(predicted == twenty_test.target))

Accuracy: 0.9101198402130493


In [33]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))
print(metrics.confusion_matrix(twenty_test.target, predicted))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.80      0.87       319
         comp.graphics       0.87      0.98      0.92       389
               sci.med       0.94      0.89      0.91       396
soc.religion.christian       0.90      0.95      0.93       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502

[[256  11  16  36]
 [  4 380   3   2]
 [  5  35 353   3]
 [  5  11   4 378]]


### Parameter Tuning using Grid Search

There are several parameters in the TfidfTransformer and classifiers. Instead of manually setting the parameters, use Grid Search to run an exhausted search of the best parameters on a grid of possibles values

In [40]:
from sklearn.model_selection import GridSearchCV
parameters = { 
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3)
}

# n_jobs parameter set at -1 decides how many cores to use in its exhaustive search
gs_clf = GridSearchCV(text_clf, parameters, cv=5, iid=False, n_jobs=-1)

In [41]:
# Perform Grid Search on smaller subset of training data
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

In [46]:
print(gs_clf.best_score_)
for param_name in sorted(parameters.keys()):
    print('%s: %r' % (param_name, gs_clf.best_params_[param_name]))

0.9641208585450458
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)


In [44]:
predicted = gs_clf.predict(docs_test)
print('Accuracy:', np.mean(predicted == twenty_test.target))

Accuracy: 0.9101198402130493
