# Basic document classification with scikit-learn

`scikit-learn` has a few useful tools for doing basic document classification, all contained in the `sklern.feature_extraction.text` module.

We'll use a dataset of Amazon reviews, from [https://nijianmo.github.io/amazon/index.html](https://nijianmo.github.io/amazon/index.html), as our demo.  We'll do something a bit tricky: we'll use reviews from one subject area as our training data, and then mix reviews from other product categories for our testing dataset.  We'll do a simple binary classification task: predict whether the number of stars a reviewer gives a product is greater than 3, or less than 3, based purely on the text of the review.  We'll ignore reviews with exactly 3 stars.  This is meant to capture the difference between generally positive and negative reviews, while excluding neutral ones.

We could very easily do this as a regression task, but we're going to do it as a classification task for these demos.  To make it into a regression task, just swap the models in these notebooks for regression models, and change the scoring metrics to a regression metric like $R^2$ or mean squared error.  We could also treat this as a multi-class classification problem by leaving in all star values--the code in the cells below actually sets everything up to do this, before filtering out 3-star reviews and collapsing the rest into the two categories.

We're going to do a basic _bag-of-words_ analysis, meaning that the features we extract from our text are just word counts.  This has a few advantages:
- It's easy.  You can code up a pretty robust, well-tested, production-ready word-counting program, from scratch, in a few days.  Less if you just need something quick and dirty.
- It's fairly interpretable.  You can train a model and look at the resulting weights, which will directly map back to a particular word.  So you can see what words are most indicative of the thing you're predicting or analyzing--which can be a good way to audit your model for problems, and can lead to very productive cycles of model/preprocessing revision.
- It has a _lot_ of knobs and dials you can tweak to change the behavior.  What you define as a "word," how you filter out words, how you clean up the text, etc.  The effects of a lot of these changes can often be reasoned about before running too much of your model pipeline.

There are a few important caveats about this kind of approach, though:
- Sparsity.  You'll end up with an _enormously_ sparse matrix (rarely more than 1%--on a good day--of the matric entries will be non-zero).  And you'll rarely have a small subset of features that contain most of your information: it's very common for most features to have a very small contribution, so you end up needing to keep a lot of them in.
- Model runtime can be slow.  The matrix will have an enormous number of columns--regularly tens, or hundreds, of thousands--which just takes a long time to crunch through.
- Bag-of-words models are extremely crude, blunt instruments.  They only measure a rough approximation of _the meaning of the text,_ and ignore _all_ inforation about structure (both grammatical structures like sentences and clauses, and larger organizational structures like paragraphs and sections and chapters).

In spite of those caveats, bag-of-words models are always good to try, since they're easy to throw together and epxeriment with.  Their simplicity also makes them a great baseline to compare other models against: you generally want a model to be better than bag-of-words on some major metric.  That could be speed, or generalization to other linguistic domains, or accuracy, or something else.

Let's download the data.  (Note: this code uses Parquet to store the intermediate/locally cached versions of the files; make sure you have `fastparquet` or `pyarrow` installed so Pandas can read and write this file type).

In [1]:
import os
import pandas as pd

def undersample_majority_classes(df):
    """Under-sample classes in the dataset so that there's
    an equal number of all target classes."""
    resample_n = min(5_000, df["overall"].value_counts().min())
    df = (
        df
        .groupby("overall")
        .sample(resample_n, random_state=0)
        .reset_index(drop=True)
    )
    return df

if not all([
    os.path.isfile("electronics.parquet"),
    os.path.isfile("video_games.parquet"),
    os.path.isfile("clothes.parquet"),
]):
    # the "reviewText" field contains the text of the review.
    # The "overall" field contains the number of stars.
    electronics = pd.read_json(
        "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz",
        lines=True,
    )[["reviewText", "overall"]]
    electronics = undersample_majority_classes(electronics).assign(productCategory="Electronics")
    electronics.to_parquet("electronics.parquet")
    
    video_games = pd.read_json(
        "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Video_Games_5.json.gz",
        lines=True,
    )[["reviewText", "overall"]]
    video_games = undersample_majority_classes(video_games).assign(productCategory="Video Games")
    video_games.to_parquet("video_games.parquet")
    
    clothes = pd.read_json(
        "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Clothing_Shoes_and_Jewelry_5.json.gz",
        lines=True,
    )[["reviewText", "overall"]]
    clothes = undersample_majority_classes(clothes).assign(productCategory="Clothing and Jewelry")
    clothes.to_parquet("clothes.parquet")
else:
    electronics = pd.read_parquet("electronics.parquet")
    video_games = pd.read_parquet("video_games.parquet")
    clothes = pd.read_parquet("clothes.parquet")
    
# Remove 3-star reviews.
electronics = electronics[electronics["overall"] != 3]
video_games = video_games[video_games["overall"] != 3]
clothes = clothes[clothes["overall"] != 3]

# Set the "overall" column to be the binary classes of "positive"
# (for >3) and "negative" (<3).
electronics["overall"] = ["Positive" if i > 3 else "Negative" for i in electronics["overall"]]
video_games["overall"] = ["Positive" if i > 3 else "Negative" for i in video_games["overall"]]
clothes["overall"] = ["Positive" if i > 3 else "Negative" for i in clothes["overall"]]

Let's create our train-test splits...

In [2]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(
    electronics,
    train_size=0.9,
    stratify=electronics["overall"],
    random_state=0,
)
test = pd.concat((test, video_games, clothes))

print(train)

                                              reviewText   overall  \
22268  Outstanding. I have the 16mb version of this. ...  Positive   
20118  I decided to give this product a try after hav...  Positive   
20581  Holds a good amount of weight and works great....  Positive   
23548  I bought this to replace a larger switch from ...  Positive   
23512  Unlike other players I have bought, this unit ...  Positive   
...                                                  ...       ...   
4220   I've used a number of different wireless adapt...  Negative   
20533  I guess I have a hard time with the uber expen...  Positive   
55     I never got it to connect properly. There is a...  Negative   
6169   Thought I had already written a review for thi...  Negative   
19691  Let me preface this review by saying I'm not m...  Positive   

      productCategory  
22268     Electronics  
20118     Electronics  
20581     Electronics  
23548     Electronics  
23512     Electronics  
...            

Now let's use some of the tools from scikit-learn to convert our text columns into numeric vector format.  We'll just explore what this looks like before we build any models.

In [3]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

train_x = CountVectorizer().fit_transform(train["reviewText"])
print(type(train_x))
print(train_x.shape)
print(train_x[0, :])

<class 'scipy.sparse._csr.csr_matrix'>
(18000, 38273)
  (0, 24353)	1
  (0, 16774)	1
  (0, 33699)	1
  (0, 731)	1
  (0, 36232)	1
  (0, 23792)	1
  (0, 33925)	1
  (0, 19089)	1
  (0, 6134)	1
  (0, 23928)	2
  (0, 22725)	3
  (0, 19557)	1
  (0, 28780)	1
  (0, 37402)	2
  (0, 19586)	1
  (0, 18100)	1
  (0, 25759)	2
  (0, 19716)	1
  (0, 4565)	3
  (0, 8143)	1
  (0, 35898)	1
  (0, 13509)	1
  (0, 10611)	1
  (0, 14921)	1
  (0, 35013)	1
  (0, 38008)	1
  (0, 19046)	1
  (0, 20688)	1
  (0, 23150)	1
  (0, 19524)	1
  (0, 34075)	1


The `CountVectorizer()` transformer transforms texts into a spare matrix fromat.  We saw this kind of data structure briefly when we covered the `scipy` library a few months back, but here's the quick refresher.

A spare matrix (or sparse array) is a specialized data structure for storing data with _lot_ of zeros.  It takes a lot of shortcuts to essentially not store zero-values at all.  This is critical for data like language, where for anything you measure, you'll end up with mostly zeros for each document.

Language is _extremely_ sparse.  Consider: if we're using words as our features, how many words are there in a given corpus of data?  Easily a few tens of thousands of unique words in most non-trivial corpora, but empirically, most documents only use a _very_ small subset of those words.  If most of your documents are a few hundred words, then they can at most use a few hundred words out of those tens of thousands.  In other words: _most documents don't use most words._  And relatedly: _most words only appear in a few documents._

Sparsity like this is a _big_ problem for any machine learning or statistical analysis.  The reasons for why are a bit beyond what we're covering today, but just remember: _sparsity is bad._  We want to get rid of it however we can.  Fortunately, the `CountVectorizer()` has a few tools for doing this (though it lacks a few important ones like stemming/lemmatization).  First let's look at our sparsity as-is (which we'll measure as "the percent of entries in the matrix that are zero"):

In [4]:
sparsity = 1 - (train_x.nnz / (train_x.shape[0] * train_x.shape[1]))
print(f"Sparsity: {sparsity:.3%}")

Sparsity: 99.803%


...or, in other words, only 0.2% of our cells contain non-zero values.  Eek.  Let's use a few easy tools to decrease our sparsity: we'll just filter out words by frequency.  We can rely on two important insights:

1. Extremely high-frequency words (e.g. stopwords) tend to contribute very little information.  Often these are "structure" or "function" words in English.  Their counts tend to have very low variance across documents.
2. Extremely low frequency words also contribute very little information.  If a word only apears in a few documents out of hundreds of thousands, it _might_ be able to tell us someting, but probably not a whole lot.

So, we can just remove words that are super common and super rare.

In [5]:
train_x = CountVectorizer(
    # No words that appear in >50% of documents.
    max_df=0.5,
    
    # No words that appear in <10 documents.
    min_df=10,
).fit_transform(train["reviewText"])
print(type(train_x))
print(train_x.shape)
sparsity = 1 - (train_x.nnz / (train_x.shape[0] * train_x.shape[1]))
print(f"Sparsity: {sparsity:.3%}")

<class 'scipy.sparse._csr.csr_matrix'>
(18000, 7055)
Sparsity: 99.118%


Note: the `min_df` and `max_df` parameters behave differently depending on whether they're a float or an int.  If they're a float, they must be strictly between 0 and 1, and they will be interpreted as a _percent of documents in the corpus_.  So `max_df=0.5` means "nothing that appears in more than half of the documents."  If the arguments are integers, they're interpreted as the number of documents.  So `min_df=10` means "nothing that appears in less than 10 documents."

Note that we still have a _lot_ of sparsity, but we decreased our sparsity (and number of features) by almost a factor of 10!  We can be a bit more aggressive and try to get this down even lower by using some other options in the `CountVectorizer()` tranformer:

In [6]:
train_x = CountVectorizer(
    # No words that appear in >50% of documents.
    max_df=0.5,
    
    # No words that appear in <10 documents.
    min_df=10,
    
    # Keep only the 10,000 most frequent features.
    max_features=10_000,
    
    # Use a pre-defined list of English stopwords.
    # Anything on the list gets removed, regardless of
    # the other filtering options.  We could provide our
    # own stoplist as a list of words here, but passing
    # "english" uses a built-in stoplist from scikit-learn.
    stop_words="english",
).fit_transform(train["reviewText"])
print(type(train_x))
print(train_x.shape)
sparsity = 1 - (train_x.nnz / (train_x.shape[0] * train_x.shape[1]))
print(f"Sparsity: {sparsity:.3%}")

<class 'scipy.sparse._csr.csr_matrix'>
(18000, 6786)
Sparsity: 99.379%


Before we come back to way to decreae this sparsity even further, let's throw a quick classifier together.  There isn't much special here compared to other kinds of data, but there are a few considerations that are more important for language data:

1. When doing this sort of word-counting analysis, you want to make sure you pick a model that deals well with huge numbers of features.
2. You also usually want a model that's fast, because your dataset will often be huge.
3. You want a model that can natively deal with sparse datasets without converting them to a dense format.  (This is only a consideration when doing word count analyses like we're doing now).

In practice, you can throw pretty much any scikit-learn model at the data, but I find the following tend to consistently work pretty well:
- Linear-kernel support vector machines, and `SGDClassifier()`, for binary classification.
- Random forests and other ensembles of trees (they tend to be good at everything, though, so this isn't too surprising).
- Naive Bayes models tend to excel in bag of words settings: they are extremely fast, scale well to large numbers of features, and tend to be reasonably accurate.  (though they can occasionally be _very_ inaccurate).

We'll use a random forest model, just for kicks.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics

clf = Pipeline([
    (
        "feature extraction",
        CountVectorizer(
            min_df=10,
            max_df=0.5,
            stop_words="english",
            # max_features=10_000,
        )
    ),
    ("classifier", LinearSVC(random_state=0, max_iter=2000))
])

# Just to show how long this can take, I'll use a
# %time "magic command" in Jupyter.  LinearSVC models
# are usually very fast, but this is a *lot* of data.
%time clf.fit(train["reviewText"], train["overall"])
test["Predictions"] = clf.predict(test["reviewText"])

# Calculate a few different F1 scores.
overall_f1 = metrics.f1_score(test["overall"], test["Predictions"], average="macro")
print(f"Overall F1 score: {overall_f1}")

for product_category, results in test.groupby("productCategory"):
    f1 = metrics.f1_score(results["overall"], results["Predictions"], average="macro")
    print(f"{product_category}-specific F1 score: {f1}")

CPU times: total: 3.56 s
Wall time: 3.56 s
Overall F1 score: 0.763788898259823
Clothing and Jewelry-specific F1 score: 0.7759023256824613
Electronics-specific F1 score: 0.7919700436862909
Video Games-specific F1 score: 0.7487776953058092


# Improving accuracy with TF-IDF weighting

Often, we can gain a boost in accuracy by applying _term frequency-inverse document frequency_ weighting to our word counts.  This combines two different weighting schemes: the "term frequency" (TF) weighting an the "inverse document frequency" (IDF) weighting.

Term frequency is just an $\ell_1$ normalization applied row-wise to each document.  The word counts in a document are divided by the total number of words, so that all the re-weighted counts sum to 1.  This measurew what proportion of the document each word represents--this is often more desirable than the raw count, since raw counts are highly correlated with overall document length.  (Document length might be important for some analyses, but in general, for bag-of-words, we want to remove or isolate the effects of document length).

$$
\operatorname{tf}(word, document) = \frac{\text{Number of times }word\text{ appears in }document}{\text{Number of words in }document}
$$

Sometimes you'll see term frequency calculated with $\ell_2$ norm; this just means the denominator in the above example changes to the sum of the _squares_ of word counts, so that the _squared_ term frequencies sum to 1.  (This has some other mathematical advantages, mostly that the dot product between two vectors is now equivalent to the cosine similarity, but is much faster to compute; cosine similarity is used a lot in NLP tasks).

Inverse document frequency is the negative logarithm of the proportion of documents in the corpus that contain the target word.  I.e.:

$$
\begin{align}
\operatorname{idf}(word) &= -\log\left(\frac{\text{Number of documents containing } term}{\text{Total number of documents}}\right) \\
&= \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing } term}\right)
\end{align}
$$

TF-IDF is just the product of these two measures for each word.  We can apply TF-IDF transforms two ways in scikit-learn: either by running the `sklearn.feature_extraction.text.TfidfTransformer()` on the results of the `CountVectorizer()`, or by using the `sklearn.feature_extraction.text.TfidfVectorizer()` transformer, which combines these two steps for us.  We might want to keep them separate if we're applying custom filtering/feature selection steps to the raw word counts, but usually, you can just use the `TfidfVectorizer()` from the get-go.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

clf = Pipeline([
    (
        "feature extraction",
        # TfidfVectorizer supports all the same options as CountVectorizer,
        # plus a few others that we're not setting
        TfidfVectorizer(
            min_df=10,
            max_df=0.5,
            stop_words="english",
            # max_features=10_000,
        )
    ),
    ("classifier", LinearSVC(random_state=0, max_iter=2000))
])

# Just to show how long this can take, I'll use a
# %time "magic command" in Jupyter.  LinearSVC models
# are usually very fast, but this is a *lot* of data.
%time clf.fit(train["reviewText"], train["overall"])
test["Predictions"] = clf.predict(test["reviewText"])

# Calculate a few different F1 scores.
overall_f1 = metrics.f1_score(test["overall"], test["Predictions"], average="macro")
print(f"Overall F1 score: {overall_f1}")

for product_category, results in test.groupby("productCategory"):
    f1 = metrics.f1_score(results["overall"], results["Predictions"], average="macro")
    print(f"{product_category}-specific F1 score: {f1}")

CPU times: total: 1.47 s
Wall time: 1.48 s
Overall F1 score: 0.7939126991347976
Clothing and Jewelry-specific F1 score: 0.807468314110313
Electronics-specific F1 score: 0.8294812003023333
Video Games-specific F1 score: 0.7766162021081828


# Some other tricks

Since `CountVectorizer()`/`TfidfVectorizer()` just convert text into sparse matrices, we can do our own feature selection or dimensionality reduction or whatever we want.  E.g., it's pretty common to use SVD to do dimensionality reduction, which increases the speed of fitting models by _a lot,_ at the cost of only a little accuracy.  (though the SVD step can take a long time on some datasets).  We'll revisit SVD a bit later; bag of words + SVD is actually a common topic modeling algorithm called Latent Semantic Analysis (LSA; sometimes called Latent Semantic Indexing, LSI).