In [1]:
import numpy as np

from sklearn.datasets import fetch_20newsgroups

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split, StratifiedKFold
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

# News Category Classification

In this notebook we will work with whe 20 newsgroups dataset that comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

First we load the data. This may take a couple of minutes and requires internet access

In [2]:
train_bunch = fetch_20newsgroups(subset='train', remove=('footers', ))
test_bunch = fetch_20newsgroups(subset='test', remove=('footers', ))

## Feature Engineering

In machine learning we need to work with some kind of table representation. In this example the raw data is pieces of text, and therefore needs feature engineering to find a numerical representation we can feed to our machine learning algorithms.

In this notebook we will use a very simple, but effective, methods called a _bag-of-words_. The bag-of-words is based on counting how often words occur. The procedure looks like:

1. Preprocess raw text, for instance transforming the documents into lower case.
2. Tokenize preprocessed texts. In this step we split the text into "tokens" that constitute the pieces that we want to work with. This is typically words, or combinations of subsequent words, in cases of text categorization.
3. Build a vocabulary of tokens by filtering, for instance removing numericals, non-informative words such as "and"/"or"/... or some domain specific filtering such as using only names.
4. Based on the vocabulary, describe each document as a vector of counts of each word in the vocabulary. This collection of vectors is our bag-of-words.

First we take a look at what the raw data looks like.

In [3]:
train_bunch.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail."

We will use scikit-learn [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class to build our bag-of-words. The `CountVectorizer` class implements the whole procedure descibed above, and allows plenty of customization.

Fitting our vectorizer is a one-liner. The `min_df` keyword specifies how large share of the documents a token must occur in in order to be included in the vocabulary. We set a 5 \% threshold to get a more manageable amount of words for illustration.

In [4]:
count_vectorizer = CountVectorizer(min_df=0.05).fit(train_bunch.data)

To show how the preprocessing works, we can separate the preprocessing and tokenization steps.

In [5]:
preprocessor = count_vectorizer.build_preprocessor()
tokenizer = count_vectorizer.build_tokenizer()

Default preprocessing is simply using lower case text.

In [6]:
preprocessed_text = preprocessor(train_bunch.data[0])
preprocessed_text

"from: lerxst@wam.umd.edu (where's my thing)\nsubject: what car is this!?\nnntp-posting-host: rac3.wam.umd.edu\norganization: university of maryland, college park\nlines: 15\n\n i was wondering if anyone out there could enlighten me on this car i saw\nthe other day. it was a 2-door sports car, looked to be from the late 60s/\nearly 70s. it was called a bricklin. the doors were really small. in addition,\nthe front bumper was separate from the rest of the body. this is \nall i know. if anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail."

Default tokenization involves simple word splitting based on whitespace and punctuation. This often works well, but will split compound words such as "New York".

In [7]:
tokens = tokenizer(preprocessed_text)
tokens

['from',
 'lerxst',
 'wam',
 'umd',
 'edu',
 'where',
 'my',
 'thing',
 'subject',
 'what',
 'car',
 'is',
 'this',
 'nntp',
 'posting',
 'host',
 'rac3',
 'wam',
 'umd',
 'edu',
 'organization',
 'university',
 'of',
 'maryland',
 'college',
 'park',
 'lines',
 '15',
 'was',
 'wondering',
 'if',
 'anyone',
 'out',
 'there',
 'could',
 'enlighten',
 'me',
 'on',
 'this',
 'car',
 'saw',
 'the',
 'other',
 'day',
 'it',
 'was',
 'door',
 'sports',
 'car',
 'looked',
 'to',
 'be',
 'from',
 'the',
 'late',
 '60s',
 'early',
 '70s',
 'it',
 'was',
 'called',
 'bricklin',
 'the',
 'doors',
 'were',
 'really',
 'small',
 'in',
 'addition',
 'the',
 'front',
 'bumper',
 'was',
 'separate',
 'from',
 'the',
 'rest',
 'of',
 'the',
 'body',
 'this',
 'is',
 'all',
 'know',
 'if',
 'anyone',
 'can',
 'tellme',
 'model',
 'name',
 'engine',
 'specs',
 'years',
 'of',
 'production',
 'where',
 'this',
 'car',
 'is',
 'made',
 'history',
 'or',
 'whatever',
 'info',
 'you',
 'have',
 'on',
 'this'

The complete tokenizer returns the bag-of-words matrix in sparse format by default.

In [8]:
count_vectorizer.transform(train_bunch.data[:1]).toarray()

array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 1, 0, 1, 0, 0,
        0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2,
        0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 

# Classification

This dataset has quite many observations, 11314 in training set, so we will use a fixed train-/validation-data split. We will use 20 \% of the training set as validation-set and do a selection stratified on the class so we validate on equal amounts of documents from each class.

Since we have a time dimension, we may select the validation set to be the latest articles in each class. But since the test set is selected based on date, we simply choose randomly from each class for validation set. It will probably suffice.

In [9]:
train_x, val_x, train_y, val_y = train_test_split(train_bunch.data, train_bunch.target, 
                                                  test_size=0.2, 
                                                  random_state=42,
                                                  stratify=train_bunch.target)

To make things easy, we will use the `Pipeline` class again. This time we will use a [Naive Bayes classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes) variant.

In [10]:
np.random.seed(1)

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('model', MultinomialNB())
])

pipeline.fit(train_x, train_y)
val_accuracy = metrics.accuracy_score(val_y, pipeline.predict(val_x))

print(f'Validation set accuracy: {val_accuracy * 100:.3f} %')

Validation set accuracy: 84.401 %


Around 84 \% on the first attempt is not too shabby.

### Optimization

In the example above we used default settings for both preprocessing and model fitting. This is probably not optimal and we would like to optimize the results.

We will use something called grid-search, which is a brute-force algorithm for optimizing parameters. The user specifies a parameter grid, and a model will be fitted for each point in the grid. The point with the best validation-set performance is the model we then choose to continue with. Grid search is implemented in the [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class and described in the [user guide](https://scikit-learn.org/stable/modules/grid_search.html#grid-search).

First we unfortunately have to redo our validation split since `GridSearchCV` assumes k-fold cross-validation. 

In [11]:
folder = StratifiedKFold(n_splits=3, random_state=1)

Now we will specify our parameter grid. We optimize two parameters, namely `min_df` of the `CountVectorizer` and `alpha` of `MultinomialNB`. We can of course optimize more parameters, but the computational demands will increase.

This fitting may already take a few minutes.

In [12]:
param_grid = {
    'vectorizer__min_df': [0, 0.01, 0.05],
    'model__alpha': [0.001, 0.01, 0.1, 1]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=folder)

%time grid_search.fit(train_bunch.data, train_bunch.target)  # We time the fitting using %time jupyter magic.

print(f'Best validation accuracy: {grid_search.best_score_ * 100:3f} %')

Wall time: 2min 18s
Best validation accuracy: 89.110836 %


We managed to squeeze out another 5 percentages on the validation set. 

We decide to go with this optimized model and report the accuracy on the test set.

In [13]:
test_predicted_target = grid_search.best_estimator_.predict(test_bunch.data)

test_accuracy = metrics.accuracy_score(test_bunch.target, test_predicted_target)

print(f'Test-set accuracy: {test_accuracy * 100:3f} %')

Test-set accuracy: 81.213489 %


Similar to the Boston Housing dataset, the test-set accuracy is lower than the validation one. But this is what we have.

# Your attempts

Feel free to experiment with the data available.

Suggestions:
* Try another type of model. For instance a [Support Vector Classifer](https://scikit-learn.org/stable/modules/svm.html#svm-classification) or [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
* Try to add another type of pre-processing after the `CountVectorizer`. For instance a [`TfidfTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) that re-weighs the words. 