# Document Classification

This notebook provides an introduction on using NLTK and Scikit-Learn for performing Document Classification

## Initialize NTLK

Download some of the resources that NLTK needs

In [1]:
import nltk
nltk.download('book', quiet=True)

True

## Import the additional modules

The `numpy` module is used for simple numerical operations. While it may seem odd to install a package to do simple stuff, Scikit-Learn depends on numpy anyway.

While NLTK also contains implementation of machine learning algorithms, Scikit-Learn provides general purpose machine learning implementations. Learning to use its modules meaning learning to use them on data outside NLP

In [2]:
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import (
    CountVectorizer,
    TfidfTransformer,
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    roc_auc_score,
    confusion_matrix,
)

## Document Classification

Given several news articles, a model is to be trained that will identify the section of the news paper the article belongs to. For simplicity and speed, only a subset of the dataset will be used

### Loading the data

Scikit-Learn provides the `fetch_20newsgroups` which allows us to fetch news articles belonging into the categories specified in the function call.

The function returns a dictionary with different information. The `data` contains the list of documents while `target` contains the integer indicator for the labels. There are other contents of the dictionary but these are the ones needed for this example.

In [3]:
categories = [
    'alt.atheism',
    'misc.forsale',
    'rec.sport.baseball',
    'sci.space',
    'soc.religion.christian',
    'talk.politics.guns',
]

In [4]:
news = fetch_20newsgroups(subset='test', categories=categories)

In [5]:
documents = news['data']
labels = news['target']

len(documents)

2262

### Balancing the Dataset

While the labels are not perfectly equal, the magnitudes of the difference are not that far apart. The number of the data for each label will be as it is.

In [6]:
np.unique(labels, return_counts=True)

(array([0, 1, 2, 3, 4, 5]), array([319, 390, 397, 394, 398, 364]))

### Feature Engineering

The document will be converted into word tokens for the bag of words processing. Case folding is also used given that the goal of the example is to identify topics

In [7]:
tokenized_docs = [nltk.word_tokenize(d.lower()) for d in documents]

### Splitting the Dataset

The dataset will be split into 3 different sets, the training set, the validation set, and the testing set.

*   Train (70%): Used for training the machine learning model
*   Validation (10%): Used for evaluation of pipeline changes such as feature engineering and model hyperparameters
*   Test (20%): Used to evaluate the whole pipeline.

The scikit-learn `train_test_split` only splits in two sets. Thus, it will be used twice. The `strafity` options makes sures that the split will consider the balance of the labels while `random_state` sets the seed to allow replicability

In [8]:
train_docs, test_docs, train_labels, test_labels = \
    train_test_split(tokenized_docs, labels, test_size=0.2, stratify=labels, random_state=0)

train_docs, val_docs, train_labels, val_labels = \
    train_test_split(train_docs, train_labels, test_size=1/8, stratify=train_labels, random_state=0)

len(train_docs), len(test_docs), len(val_docs)

(1582, 453, 227)

### Tokens to Counts

Scikit-Learn provides a `CountVectorizer` that converts a document into the a vector of token counts. The `CountVectorizer` provides its own implementation of text proprocessing and tokenization. However, since the data are already tokenized, the `tokenizer` and `preprocessor` argument is set to a function that does nothing. The `max_features` is an optional setting defined if only the top N words are to be considered.

Take note that on the training the `fit_transform` method is used. This means that only the vocabularies in the training will be considerd in counting. The validation and test only used the `transform` method.

In [9]:
cv = CountVectorizer(
    tokenizer=lambda x: x,
    preprocessor=lambda x: x,
    max_features=2000,
)

cv.fit(train_docs)



CountVectorizer(max_features=2000,
                preprocessor=<function <lambda> at 0x7f1d143495f0>,
                tokenizer=<function <lambda> at 0x7f1d14349680>)

In [10]:
cv.vocabulary_

{'from': 751,
 ':': 104,
 '@': 112,
 '(': 13,
 'keith': 986,
 ')': 14,
 'subject': 1708,
 'loaded': 1058,
 'apple': 220,
 'system': 1726,
 'for': 736,
 'sale': 1540,
 'summary': 1714,
 'w/': 1898,
 '2': 53,
 '5.25': 90,
 "''": 6,
 'drives': 610,
 ',': 17,
 '1': 27,
 '3.5': 73,
 'drive': 609,
 'etc': 649,
 '.': 20,
 'distribution': 589,
 'usa': 1858,
 'organization': 1274,
 'university': 1847,
 'of': 1245,
 'washington': 1909,
 'seattle': 1567,
 'lines': 1049,
 '32': 77,
 'nntp-posting-host': 1218,
 '--': 19,
 '-': 18,
 'cpu': 512,
 'on': 1258,
 'card': 392,
 ';': 105,
 'memory': 1128,
 'to': 1804,
 'total': 1815,
 'disk': 588,
 ']': 115,
 '[': 113,
 "'s": 11,
 'with': 1949,
 'monitor': 1162,
 'serial': 1582,
 'cards': 393,
 'external': 679,
 'modem': 1158,
 'original': 1276,
 'all': 173,
 'the': 1765,
 'above': 127,
 '&': 4,
 'software': 1630,
 'talk': 1733,
 'is': 943,
 'cheap': 416,
 'communications': 465,
 'sunday': 1716,
 'game': 763,
 'books': 346,
 'systems': 1727,
 'and': 197,
 

In [11]:
train_cvectors = cv.transform(train_docs)
val_cvectors = cv.transform(val_docs)
test_cvectors = cv.transform(test_docs)

In [12]:
train_cvectors.toarray()

array([[ 2,  0,  3, ...,  0, 17,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  2, ...,  0,  0,  0],
       ...,
       [ 1,  1,  0, ...,  0,  0,  0],
       [ 5,  0,  2, ...,  1,  0,  1],
       [ 0,  0,  0, ...,  0,  0,  0]])

### Counts to TF-IDF

To normalize the word counts by the inverse document frequency, the `TfidfTransformer` can be used to transform a word count matrix to the normalized TF-IDF values

Take note that on the training the `fit_transform` method is used. This means that only the IDF in the training will be used for normalization. The validation and test only used the `transform` method.

In [13]:
tfidf = TfidfTransformer()
tfidf.fit(train_cvectors)

TfidfTransformer()

In [14]:
train_vectors = tfidf.transform(train_cvectors)
val_vectors = tfidf.transform(val_cvectors)
test_vectors = tfidf.transform(test_cvectors)

In [15]:
train_vectors.toarray()

array([[0.02751813, 0.        , 0.05194713, ..., 0.        , 0.30754969,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.19364577, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.01048542, 0.02004091, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.14654142, 0.        , 0.07376863, ..., 0.07797382, 0.        ,
        0.07729232],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

### Random Forest

A `RandomForestClassifier` is used to classify the training data with its label. The `n_estimators` defines the number of decision trees used in the ensemble

In [16]:
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
rf.fit(train_vectors, train_labels)

RandomForestClassifier(n_jobs=-1)

In [17]:
train_predict = rf.predict(train_vectors)
val_predict = rf.predict(val_vectors)
test_predict =  rf.predict(test_vectors)

### Evaluation

The different sets will be evaluated to show how the performance degrades from the training data and the out of sample data. The following metrics are computed for each set.

*   Precision
*   Recall
*   F1-Score
*   Accuracy
*   Confusion Matrix

#### Training Performance

In [18]:
print(classification_report(train_labels, train_predict))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       223
           1       1.00      1.00      1.00       273
           2       1.00      1.00      1.00       278
           3       1.00      1.00      1.00       275
           4       1.00      1.00      1.00       278
           5       1.00      1.00      1.00       255

    accuracy                           1.00      1582
   macro avg       1.00      1.00      1.00      1582
weighted avg       1.00      1.00      1.00      1582



In [19]:
confusion_matrix(train_labels, train_predict)

array([[223,   0,   0,   0,   0,   0],
       [  0, 273,   0,   0,   0,   0],
       [  0,   0, 278,   0,   0,   0],
       [  0,   0,   0, 275,   0,   0],
       [  0,   0,   0,   0, 278,   0],
       [  0,   0,   0,   0,   0, 255]])

#### Validation Performance

In [20]:
print(classification_report(val_labels, val_predict))

              precision    recall  f1-score   support

           0       1.00      0.75      0.86        32
           1       0.83      0.97      0.89        39
           2       0.92      0.85      0.88        40
           3       0.88      0.93      0.90        40
           4       0.89      0.97      0.93        40
           5       1.00      0.94      0.97        36

    accuracy                           0.91       227
   macro avg       0.92      0.90      0.91       227
weighted avg       0.91      0.91      0.91       227



In [21]:
confusion_matrix(val_labels, val_predict)

array([[24,  0,  1,  3,  4,  0],
       [ 0, 38,  0,  1,  0,  0],
       [ 0,  5, 34,  1,  0,  0],
       [ 0,  2,  0, 37,  1,  0],
       [ 0,  0,  1,  0, 39,  0],
       [ 0,  1,  1,  0,  0, 34]])

#### Test Performance

In [22]:
print(classification_report(test_labels, test_predict))

              precision    recall  f1-score   support

           0       0.98      0.66      0.79        64
           1       0.85      0.96      0.90        78
           2       0.87      0.90      0.88        79
           3       0.88      0.92      0.90        79
           4       0.89      0.97      0.93        80
           5       0.94      0.89      0.92        73

    accuracy                           0.89       453
   macro avg       0.90      0.88      0.89       453
weighted avg       0.90      0.89      0.89       453



In [23]:
confusion_matrix(test_labels, test_predict)

array([[42,  1,  8,  2,  9,  2],
       [ 0, 75,  1,  2,  0,  0],
       [ 1,  4, 71,  2,  1,  0],
       [ 0,  4,  0, 73,  0,  2],
       [ 0,  1,  0,  1, 78,  0],
       [ 0,  3,  2,  3,  0, 65]])

### Interpretability

Random forests are also interpretable machine learning models. Random forests provides a feature importance, providing a value in how much the feature is used in separating the classes.

`CountVectorizer` creates the vector based on the sorted vocabulary, thus the feature importance are defined in the same order.

Some of the words in the top importance are directly related to the category like `sale`, `space`, `baseball`, `game` and `god`. On the other hand, words in the low importances are generic words like `proper`, `beautiful`, `results` and some rare looking words.

In [24]:
importance_idx = rf.feature_importances_.argsort()

In [25]:
vocabulary = sorted(cv.vocabulary_)

In [26]:
top_importance = [vocabulary[idx] for idx in importance_idx[-10:][::-1]]
top_importance

['space',
 '$',
 'god',
 'athos.rutgers.edu',
 'sale',
 're',
 'that',
 'baseball',
 'waco',
 'players']

In [27]:
low_importance = [vocabulary[idx] for idx in importance_idx[:10]]
low_importance

['fully',
 'speaking',
 'south',
 'allowed',
 'somewhat',
 'et',
 'buyback',
 'burn',
 'americans',
 'jmd']