# Text Classification with Scikit-Learn

The goal of this workshop is to explore text classification with scikit-learn, a widely used library.

In this pre-work, you will see how to do text classification on a collection of text documents on different topics.

We will:
- Read in data
- Extract feature vectors
- Train a linear model to perform categorization
- Find a good configuration of both the feature extraction components and the classifier

These activities are adapted from the "Working with Text Data" scikit-learn tutorial (no longer available, but contained in [v1.4](https://scikit-learn.org/1.4/tutorial/text_analytics/working_with_text_data.html) of the documentation).

**You are welcome to download and run this notebook somewhere else** (e.g., on your local computer). If you do, you will need to have [scikit-learn](https://scikit-learn.org/1.4/install.html#installation-instructions) installed.


## Loading the 20 newsgroups dataset

We will be using a classic dataset called “Twenty Newsgroups”. Here is the official description, quoted from [the website](http://qwone.com/~jason/20Newsgroups/):

"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."

The code below uses scikit-learn's built-in dataset loader for 20 newsgroups. Alternatively, it is possible to download the dataset manually from the website and use the [sklearn.datasets.load_files](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html) function by pointing it to the `20news-bydate-train sub-folder` of the uncompressed archive folder.

In order to get faster execution times for this first example, we will work on a partial dataset with only 4 categories out of the 20 available in the dataset:

In [63]:
categories = ['comp.sys.mac.hardware', 'rec.sport.baseball',
               'comp.graphics', 'sci.med']

We can now load the list of files matching those categories as follows:

In [64]:
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train',
     categories=categories, shuffle=True, random_state=42)

The returned dataset is a scikit-learn “bunch”: a simple holder object with fields that can be both accessed as python `dict` keys or `object` attributes for convenience. For example, the `target_names` holds the list of the requested category names (only four of the twenty because these are the ones we said should be fetched above):

In [65]:
# 获取类别名称，类别名称指在分类问题中，用于表示不同类别的可读标签
# 提高可读性：将数值标签（如 0, 1, 2）转换为具体类别（如 "cat", "dog", "bird"）。
twenty_train.target_names
#  可以看到下面几种新闻的类型

['comp.graphics', 'comp.sys.mac.hardware', 'rec.sport.baseball', 'sci.med']

The files themselves are loaded in memory in the `data` attribute:

In [66]:
print("Data", len(twenty_train.data))

print(twenty_train.data[0])  # Print the first news article，这里是不包含label的

Data 2353


We can also get the filenames:

In [67]:
print("Filenames", len(twenty_train.filenames))

Filenames 2353


Let’s print the first lines of the first loaded file and its label:

In [68]:
print("Label:", twenty_train.target_names[twenty_train.target[0]])
print("\n".join(twenty_train.data[0].split("\n")[:3]))

Label: sci.med
From: robin@ntmtv.com (Robin Coutellier)
Subject: Critique of Pressure Point Massager
Originator: robin@volans


Supervised learning algorithms will require a category label for each document in the training set. In this case, the category is the name of the newsgroup which also happens to be the name of the folder holding the individual documents.

For speed and space efficiency reasons, scikit-learn loads the target attribute as an array of integers that corresponds to the index of the category name in the `target_names` list. The category integer id of each sample is stored in the `target` attribute:



In [69]:
twenty_train.target[:10]

array([3, 0, 1, 0, 2, 0, 0, 1, 3, 2])

It is possible to get back the category names as follows:

In [70]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

sci.med
comp.graphics
comp.sys.mac.hardware
comp.graphics
rec.sport.baseball
comp.graphics
comp.graphics
comp.sys.mac.hardware
sci.med
rec.sport.baseball


You might have noticed that the samples were shuffled randomly when we called `fetch_20newsgroups(..., shuffle=True, random_state=42)`: this is useful if you wish to select only a subset of samples to quickly train a model and get a first idea of the results before re-training on the complete dataset later.

## Extracting features from text files

In order to perform machine learning on text documents, we first need to turn the text content into a numerical representation (as discussed in lecture 1 and 2).

### Bag of words

We will start with the approach described in lecture 1: a bag of words:

1. Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
2. For each document `d` and each word `w`, count how often `w` appears in `d` and store the value in `X[d, w]`. Rather than the string representing the word, we will use its ID from the dictionary created in (1).

We call each word a 'feature'. A typical vocabulary has over 100,000 words, so we will have a lot of features!

If we have 10,000 samples, then storing `X` as a NumPy array of type float32 would require 10,000 x 100,000 x 4 bytes = 4GB of RAM.

As discussed in lecture 1, many of these counts will be zero. In fact, our documents have at most a few thousand words, so the vast majority of the counts will be zero. We can describe the data as being high-dimensional sparse dataset.

This was the motivation for the sparse storage methods discussed in lecture 1 and that are the focus of part of assignment 1. `scipy.sparse` matrices are data structures that do exactly what we need, and scikit-learn has built-in support for these structures.

In the assignment, we don't let you use these libraries, so you can develop an intuition for how they work. In practise, you should not use your own implementation as other people have spent a lot of time optimising these implementations.

### Tokenising text with scikit-learn


To count tokens we can use [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#). As well as counting, it can also perfom a range of text processing steps, including tokenising text, filtering stopwords, ignoring very frequent or very rare features, and so on. It produces an object that can then transform a document into a feature vector.

To explain these processing steps a little more:

- _Tokenising_ is splitting a string into a series of individual tokens, e.g., `"Hi there!" -> ["Hi", "there", "!"]`. This is an essential step.
- _Filtering stopwords_ is the removal of very common words, e.g., `["The", "book", "is", "good"] -> ["book", "good"]`.  Note that this is not necessarily a good idea, depending on the application.

In [71]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2353, 33868)

[CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#) supports counts of N-grams of words or characters. N-grams are when a sequence of consecutive items are treated as one item, e.g., the bigrams in the string `"I like ice cream"` are `"I like"`, `"like ice"`, and `"ice cream"`.

Once fitted, the vectorizer has built a dictionary of feature indices:

In [72]:
count_vect.vocabulary_.get(u'algorithm')

5071

The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.

### TF-IDF

As discussed in lecture 1, raw frequencies are often not quite what we want. We described the TF-IDF equation, which changes the value we store for a word, by (1) squashing the frequency, and (2) rescaling it depending on how many different documents the word appears in.

TF-IDF can be computed using [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html):


In [73]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2353, 33868)

In the code above, we first use the `fit(..)` method to fit our estimator to the data and then use the `transform(..)` method to transform our count-matrix to a tf-idf representation.

These two steps can be combined to achieve the same end result faster by using the `fit_transform(..)` method:

In [74]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2353, 33868)

## Training a classifier

Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a [naïve Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) classifier, which provides a nice baseline for this task. Note, naïve Bayeswas briefly mentioned in lecture 2, but you are not expected to understand the details of the method.

scikit-learn includes several variants of naïve Bayes, and the one most suitable for word counts is the multinomial variant:


In [75]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting steps as before. The difference is that we call `transform` instead of `fit_transform` on the transformers. `transform` applies the change to new data. `fit_transform` fits the parameters of the transformer:

In [76]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => rec.sport.baseball
'OpenGL on the GPU is fast' => comp.graphics


## Building a pipeline

In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a classifier, doing all of the steps together, passing the output of one as the input to the next:


In [77]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

The names `vect`, `tfidf` and `clf` (classifier) are arbitrary. We will use them to perform grid search for suitable hyperparameters below. We can now train the model with a single command:



In [78]:
text_clf.fit(twenty_train.data, twenty_train.target)

## Evaluation of the performance on the test set

Now we will get the test data, which the model has not seen, and use it to measure how accurate the model is:

In [79]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

np.float64(0.9329929802169751)

We achieved 93.3% accuracy. Let’s see if we can do better with a linear [support vector machine (SVM)](https://scikit-learn.org/stable/modules/svm.html#svm), which is widely regarded as one of the best linear text classification algorithms (although it’s also a bit slower than naïve Bayes). We can change the learner by simply plugging a different classifier object into our pipeline:

In [80]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

np.float64(0.9432035737077218)

We achieved 94.3% accuracy using the SVM. scikit-learn provides further utilities for more detailed performance analysis of the results:

In [81]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                       precision    recall  f1-score   support

        comp.graphics       0.92      0.92      0.92       389
comp.sys.mac.hardware       0.95      0.95      0.95       385
   rec.sport.baseball       0.94      0.99      0.97       397
              sci.med       0.97      0.92      0.94       396

             accuracy                           0.94      1567
            macro avg       0.94      0.94      0.94      1567
         weighted avg       0.94      0.94      0.94      1567



These metrics are as described in lecture 2.

We can see that we are doing slightly better on baseball.

Now let's look at the confusion matrix

In [82]:
metrics.confusion_matrix(twenty_test.target, predicted)

array([[358,  12,  10,   9],
       [ 10, 364,   8,   3],
       [  0,   3, 393,   1],
       [ 22,   5,   6, 363]])

Baseball is very different from medicine and computer related documents, so it is rarely confused for them (see the third row and column). The two computer related ones are easy to confuse.

## Hyperparameter tuning using grid search

We’ve already encountered some hyperparameters such as `use_idf` in the `TfidfTransformer`. Classifiers tend to have many hyperparameters as well; e.g., `MultinomialNB` includes a smoothing hyperparameter `alpha`, and `SGDClassifier` has a penalty hyperparameter `alpha` and configurable loss and penalty terms in the objective function (see the module documentation, or use the Python `help(...)` function to get a description of these).

We could manually try various settings of these hyperparameters to find the configuration that leads to the best results. That would be a lot of effort though. Instead, we can automatically search through a range of combinations of options. The simplest search is a grid search. 

Let's try out using (a) either words or bigrams, (b) with or without idf, and (c) with a penalty parameter of either 0.01 or 0.001 for the linear SVM:



In [83]:
from sklearn.model_selection import GridSearchCV
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}

The more options you consider, the more effort this search will involve. In this case we are considering 2x2x2 = 8 combinations.

If we have multiple CPU cores at our disposal, we can tell the grid searcher to try these eight parameter combinations in parallel by using the `n_jobs` parameter. If we give this parameter a value of `-1`, grid search will detect how many cores are installed and use them all.

When running this in Ed, we use just 1 core, but you can use more on your own machine.

In [84]:
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=1)

The grid search instance behaves like a normal scikit-learn model. Let’s perform the search on a smaller subset of the training data to speed up the computation:

In [85]:
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

The result of calling `fit` on a `GridSearchCV` object is a classifier that we can use to run `predict` to get a guess at the answer for an input:

In [86]:
twenty_train.target_names[gs_clf.predict(['Computers are cool'])[0]]

'comp.sys.mac.hardware'

The object’s `best_score_` and `best_params_` attributes store the best mean score and the parameters setting corresponding to that score:

In [87]:
gs_clf.best_score_
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)


A more detailed summary of the search is available at `gs_clf.cv_results_`.

The `cv_results_` parameter can be imported into pandas as a DataFrame for further inspection.

Your done with the pre-work part of this workshop!