# Document clasification with spaCy

spaCy is an awesome library.  It provides a number of large, pre-trained language model that do all sorts of cool stuff:
- Smart tokenization: much smarter than most of the character-pattern-based splitting tools, like scikit-learn's `CountVectorizer()` uses.
- Part of speech annotation, syntactic dependency parsing, named entity recognition, stopword identification, lemmatization, and a few other word-level annotations: it runs these automatically for you.
- (with the right models) Dense word vectors: we'll come back to the details of this later, but each word is mapped to a dense, 300-dimensional vector that encodes something about its meaning.  Documents, words, and spans of words are assigned a single vector.
- Document classification with pretty big, accurate models--we won't be covering this.
- Lots of modularity: you can swap out different parts of the model with your own, e.g., swapping out the tokenizer for something custom, or using a different stopword list, or changing how lemmatization works.  We don't be doing any of this.
- GPU acceleration for some models--we don't be doing this.

spaCy is a very large, deep library, but we're only going to do a few basic thing with it.  Namely, we're going to use it to do smarter tokenization, remove stopwords, and lemmatize our text before feeding it back into scikit-learn's `CountVectorizer()`.  (Basically: using spaCy for smarter preprocessing). We'll also see a few ways to pick-and-choose what things we run, which might be desirable for speed.

Install spaCy with:
```bash
conda install spacy
```

Or follow the [installation instructions](https://spacy.io/usage) if you want, e.g., to install with GPU support.  Download the `en_core_web_sm` model to run this notebook yourself.

Then, we'll use spaCy's built-in vectorization and compare that to a bag-of-words approach.

In [1]:
import spacy

# make sure to download the models with: `python -m spacy download [model name]`.
# Then load the models with `model = spacy.load("model name")`.
nlp = spacy.load("en_core_web_sm")

The `nlp` object is callable, like a function.  It takes in a string, and returns a spaCy `Document`, which contains a list of `Tokens`, each of which has a bunch of different annotations set.  The example below does not show all of the annotations that get set, only the ones that I use most often.

In [2]:
text = (
    "spaCy (/speɪˈsiː/ spay-SEE) is an open-source software library for "
    "advanced natural language processing, written in the programming "
    "languages Python and Cython."
)

doc = nlp(text)
print(doc)

spaCy (/speɪˈsiː/ spay-SEE) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.


In [3]:
print(f"{'Token':<15}{'Lemma':<15}{'Coarse POS':<11}{'Fine POS':<10}{'Stopword?':<10}{'Syntactic Role'}".upper())

for tok in doc:
    text = tok.text
    lemma = tok.lemma_
    pos = tok.pos_
    fine_grained_pos = tok.tag_
    stopword = tok.is_stop
    syntactic_role = tok.dep_
    print(f"{text:<15}{lemma:<15}{pos:<11}{fine_grained_pos:<10}{stopword:<10}{syntactic_role}")

TOKEN          LEMMA          COARSE POS FINE POS  STOPWORD? SYNTACTIC ROLE
spaCy          spacy          NUM        CD        0         nsubj
(              (              PUNCT      -LRB-     0         punct
/speɪˈsiː/     /speɪˈsiː/     PUNCT      NFP       0         compound
spay           spay           PROPN      NNP       0         compound
-              -              PUNCT      HYPH      0         punct
SEE            SEE            PROPN      NNP       1         nsubj
)              )              PUNCT      -RRB-     0         punct
is             be             AUX        VBZ       1         ROOT
an             an             DET        DT        1         det
open           open           ADJ        JJ        0         amod
-              -              PUNCT      HYPH      0         punct
source         source         NOUN       NN        0         compound
software       software       NOUN       NN        0         compound
library        library        NOUN       NN  

Let's use this to tokenize our text data from last time.  First, we need to re-load the training data (this is the same loading code from the last notebook).

In [4]:
import pandas as pd

electronics = pd.read_parquet("electronics.parquet")
video_games = pd.read_parquet("video_games.parquet")
clothes = pd.read_parquet("clothes.parquet")
    
# Remove 3-star reviews.
electronics = electronics[electronics["overall"] != 3]
video_games = video_games[video_games["overall"] != 3]
clothes = clothes[clothes["overall"] != 3]

# Set the "overall" column to be the binary classes of "positive"
# (for >3) and "negative" (<3).
electronics["overall"] = ["Positive" if i > 3 else "Negative" for i in electronics["overall"]]
video_games["overall"] = ["Positive" if i > 3 else "Negative" for i in video_games["overall"]]
clothes["overall"] = ["Positive" if i > 3 else "Negative" for i in clothes["overall"]]

from sklearn.model_selection import train_test_split

train, test = train_test_split(
    electronics,
    train_size=0.9,
    stratify=electronics["overall"],
    random_state=0,
)
test = pd.concat((test, video_games, clothes))

# Simple use: apply the full processing pipeline

We'll write a quick function that applies a spaCy model to a bunch of texts, and returns the lemmas for all non-stopword, non-punctuation, non-whitespace tokens.  We can use `nlp.pipe()` to pass an iterable of texts, and process them a bit more efficiently than explicitly looping over them. (this mostly only matters if we're running on the GPU, which we aren't, so we won't see much of a speedup).

We _can_ tell spaCy to parse our text using multiple threads.  I would recommend being _very_ careful about doing this with the `*_lg` models.  spaCy will create a full copy of the model for each processing thread, which can quickly eat into your RAM.  It also adds a lot of startup time.  If you're doing to be using multiple threads, you should probably use one of the `*_sm` models, unless you have a lot of RAM and you need the extra accuracy from the `*_lg` models.

Since we have a lot of data, I'm going to use the English small model (you'll need to install it to run this cell), run it in 4 parallel threads, and set a moderate batch size so data can be more efficiently shuttled back and forth between worker processes.  Note: the number of processes might not always be a "more is faster" setting--there's a lot of overhead involved in sending data back and forth between the worker processes.

Note: `nlp.pipe` can accept any iterator--not just lists--and it returns an iterator that only processes documents on-demand, as you iterate through the results.  This makes it extremely easy to use spaCy for processing _enormous_ amounts of text.  Just create a generator/lazy iterator that reads through lines in a file, point `nlp.pipe` at that generator, and save the results out to another file as you get them and do whatever processing you need to them.

In [5]:
from tqdm.notebook import tqdm
def spacy_clean(nlp, texts):
    docs = list(tqdm(
        nlp.pipe(texts, n_process=4, batch_size=1000),
        total=len(texts),
        desc="spaCy parsing",
        smoothing=0.01,
    ))
    docs = [
        " ".join(
            tok.lemma_
            for tok in doc
            if not (
                tok.is_stop
                or tok.is_space
                or tok.is_punct
            )
        )
        for doc in tqdm(docs, desc="Filtering tokens (after spaCy parsing)")
    ]
    return docs

In [6]:
# This is going to take a long time--we're about to speed it up a lot,
# so we're only going to run on a subset of the texts.
nlp = spacy.load("en_core_web_sm")
cleaned = spacy_clean(nlp, train["reviewText"])

print(f"Pre-cleaning:\n{train['reviewText'].iloc[0]}")
print()
print(f"Post-cleaning:\n{cleaned[0]}")

spaCy parsing:   0%|          | 0/18000 [00:00<?, ?it/s]

Filtering tokens (after spaCy parsing):   0%|          | 0/18000 [00:00<?, ?it/s]

Pre-cleaning:
Outstanding. I have the 16mb version of this. Its been on my key ring with my keys in my pocket with pocket knife and change and used every day for two years. It looks new and keeps on ticking

Post-cleaning:
outstanding 16 mb version key ring key pocket pocket knife change day year look new keep tick


Eesh.  That took a long time.  There's got to be a way to speed that up.

# Disabling processing steps for speed

Fortunately there is!  When loading a spaCy model, or doing anything with it, we can disable some of the processing steps that we aren't using.  First, here's a list of the processing steps that our small NLP model is using:

In [7]:
for step_name, step in nlp.pipeline:
    print(step_name)

tok2vec
tagger
parser
attribute_ruler
lemmatizer
ner


The `tagger` step does part of speech tagging, which we're not using right now.  The `parser` step does syntactic parsing, which we also aren't using.  The `ner` step does named entity recognition, which we also aren't using.  So let's disable all of those steps, and run the above code again, and see how much speed we gain.  (the spaCy documentation has more information on available steps, including ones you can add to a model after the fact).

In [8]:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
train["Cleaned Text"] = spacy_clean(nlp, train["reviewText"])
test["Cleaned Text"] = spacy_clean(nlp, test["reviewText"])

spaCy parsing:   0%|          | 0/18000 [00:00<?, ?it/s]

Filtering tokens (after spaCy parsing):   0%|          | 0/18000 [00:00<?, ?it/s]

spaCy parsing:   0%|          | 0/42000 [00:00<?, ?it/s]

Filtering tokens (after spaCy parsing):   0%|          | 0/42000 [00:00<?, ?it/s]

Much faster!  But this is still a lot slower than using a `CountVectorizer()`, because the spaCy model is still doing a lot of processing to figure out where the different tokens are and all that.  That's the cost for higher accuracy: the model spend more time figuring stuff out and crunching numbers on our behalf.  In the real world, you'll often have to make decisions about how fast is "fast enough" and how accurate is "accurate enough."  

# Using `nlp.make_doc` for even more speed--but even fewer features

Fortunately, we have one last trick up our sleeves.  If you only need tokenization--not lemmatization--you can use `nlp.make_doc`, which disables even more of the processing.  It effectively runs the bare minimum processing: it does tokenization and some extremely basic token tagging like identifying stopwords.  Let's see how much faster this is.

We'll use the same function as before, but we'll replace `nlp.pipe(docs)` with some code to run `nlp.make_doc` over all the texts.

In [9]:
def spacy_clean(nlp, texts):
    docs = [
        nlp.make_doc(i)
        for i in tqdm(
            texts,
            desc="spaCy parsing",
            smoothing=0.01,
        )
    ]
    docs = [
        " ".join(
            # get the lowercase version--not lemma--since make_doc()
            # doesn't apply lemmatization.
            tok.lower_
            for tok in doc
            if not (
                tok.is_stop
                or tok.is_space
                or tok.is_punct
            )
        )
        for doc in tqdm(docs, desc="Filtering tokens (after spaCy parsing)")
    ]
    return docs

_ = spacy_clean(nlp, train["reviewText"])
_ = spacy_clean(nlp, test["reviewText"])

spaCy parsing:   0%|          | 0/18000 [00:00<?, ?it/s]

Filtering tokens (after spaCy parsing):   0%|          | 0/18000 [00:00<?, ?it/s]

spaCy parsing:   0%|          | 0/42000 [00:00<?, ?it/s]

Filtering tokens (after spaCy parsing):   0%|          | 0/42000 [00:00<?, ?it/s]

Dang.  That's fast--and we're only using a single processing thread!  You can use this with multiprocessing, but you have to write the logic yourself.  It isn't too complicated, though.

Note: In one of the next notebook, we'll see how Gensim provides tools similar to spaCy for doing this kind of preprocessing, but the tools in Gensim are much faster (though a bit less accurate).  Gensim has a very different approach compared to spaCy.

Now, having done all our own preprocessing, let's throw the results through our classifier from the other notebook.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics

clf = Pipeline([
    (
        "feature extraction",
        # TfidfVectorizer supports all the same options as CountVectorizer,
        # plus a few others that we're not setting
        TfidfVectorizer(
            min_df=10,
            max_df=0.5,
            stop_words="english",
            # max_features=10_000,
        )
    ),
    ("classifier", LinearSVC(random_state=0, max_iter=2000))
])

# Just to show how long this can take, I'll use a
# %time "magic command" in Jupyter.  LinearSVC models
# are usually very fast, but this is a *lot* of data.
%time clf.fit(train["Cleaned Text"], train["overall"])
test["Predictions"] = clf.predict(test["Cleaned Text"])

# Calculate a few different F1 scores.
overall_f1 = metrics.f1_score(test["overall"], test["Predictions"], average="macro")
print(f"Overall F1 score: {overall_f1}")

for product_category, results in test.groupby("productCategory"):
    f1 = metrics.f1_score(results["overall"], results["Predictions"], average="macro")
    print(f"{product_category}-specific F1 score: {f1}")

CPU times: total: 906 ms
Wall time: 921 ms
Overall F1 score: 0.7896350083888972
Clothing and Jewelry-specific F1 score: 0.8026907232785583
Electronics-specific F1 score: 0.8274530240858073
Video Games-specific F1 score: 0.7726024660828436


# Some things we didn't cover

All of the steps that spaCy applies--tokenization, lemmatization, POS tagging, syntactic dependencies, etc--can be used independently, rather than through the language models you get from `spacy.load()`.  In my experience, the most common reason to do this is either to seriously customize the behavior of these elements (which is much more advanced than we'd be covering here anyways), or for speed (in which case, Gensim or other tools might be better anyways).

spaCy's models can also be run on a GPU for extra speed if your computer has one.  As with all GPU-bound computations, though, it's not necessarily an automatic speedup.  You can't/shouldn't use multiprocessing with GPU models (unless you have multiple discrete GPUs in your system), and you need to balance the length of your texts and the batch sizes going to the GPUs to minimize data serialization overhead.  Running on a GPU is extremely useful if you have a lot of data.

spaCy also has ways to save `Doc` objects to file and reload them later--look into `DocBin` if you're interested in this.