# Python demo: Bag-of-Words modeling with `spacy` + `scikit-learn`

`spacy` is a library that provides the next level of linguistic sophistication and control.  Unlike `gensim`, which is primarily built on _string processing_ operations, `spacy` is built to be much more linguistically-savvy.  It knows about things like verb, nouns, and syntax, and it's probably the single best go-to tool for general-purpose linguistic annotations and parsing.  But, this comes at a cost: it's slower than `gensim` (by quite a lot), and it uses a lot more memory (since it uses a number of machine learning models to do all of its annotations).

We'll use `spacy` to recreate most of what we did with `gensim` in notebook 01b.

Pros:
- Much deeper, richer linguistic annotations: you can add features like part of speech, named entity tags, syntactic information, etc., and enrich your data.
- Very fast given how much it's doing; still slower than `gensim` or a pure-`scikit-learn` solution, but much faster than many tools that apply similar kind of annotations.
- Extremely easy to use; the API is super easy to get started with.
- Models can be run on a GPU for extra speed.
- You can train your own `spacy` models to do different annotation tasks.  How to do this goes well beyond the scope of this notebook, but it's very doable.

Cons:
- Slower--in absolute terms--than something like `gensim`.  (because it's just doing more stuff).
- Requires a bit more coding to really make use of its features--it's a surprisingly deep library.
- The API can be unexpectedly deep, which can lead to occasional "footgun" moments.

In [1]:
# requirements
# !conda install --yes tqdm pandas scikit-learn spacy
# !python -m spacy download en_core_web_sm

In [2]:
# tqdm is a magic library that gives you progerss bars when iterating
# through things.
from tqdm.notebook import tqdm

# register tqdm with pandas so we can get .progress_apply() method
# added to dataframes.  This is a version of pd.DataFrame.apply()
# but now it prints a progress bar!
tqdm.pandas(smoothing=0)

In [3]:
import pandas as pd

# load the data
train = pd.read_csv("../../data/train.csv")
test = pd.read_csv("../../data/test.csv")

train.head()

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
0,en_0964290,product_en_0740675,reviewer_en_0342986,1,Arrived broken. Manufacturer defect. Two of th...,I'll spend twice the amount of time boxing up ...,en,furniture
1,en_0690095,product_en_0440378,reviewer_en_0133349,1,the cabinet dot were all detached from backing...,Not use able,en,home_improvement
2,en_0311558,product_en_0399702,reviewer_en_0152034,1,I received my first order of this product and ...,The product is junk.,en,home
3,en_0044972,product_en_0444063,reviewer_en_0656967,1,This product is a piece of shit. Do not buy. D...,Fucking waste of money,en,wireless
4,en_0784379,product_en_0139353,reviewer_en_0757638,1,went through 3 in one day doesn't fit correct ...,bubble,en,pc


In [4]:
# first: a helper function to absract the "fit + predict + score" logic.
from sklearn import metrics

def fit_and_score(clf, train_x, train_y, test_x, test_y):
    """fit the model `clf` to the `train` dataset and evaluate its
    performance on the `test` dataset."""
    clf.fit(train_x, train_y)
    preds = clf.predict(test_x)
    
    # calculate some classification metrics
    accuracy = metrics.accuracy_score(preds, test_y)
    f1 = metrics.f1_score(preds, test_y, average="macro")

    # and some regression metrics (since "predict the number of stars"
    # could reasonably be either kind of task).
    r2 = metrics.r2_score(preds, test_y)
    mae = metrics.mean_absolute_error(preds, test_y)
    
    return pd.Series({"Accuracy": accuracy, "F1": f1, "R2": r2, "MAE": mae})

`spacy` is build arounds its downloadable models, which have been trained by the developers to do a wide range of linguistic annotation tasks like tokenization, part-of-speech tagging, syntactic dependency analysis, and more.  You do have to download the models before you can use them, using the command `python -m spacy download [model]` from the command line.  (and from inside your conda environment/virtual environment/whatever sort of environment you migth be using).  But once it's downloaded, it's super easy:

In [5]:
import spacy

# the small English model--optimized for speed and memory footprint,
# but at the cost of (a little bit of) accuracy.
nlp = spacy.load("en_core_web_sm")

# run the full annotation pipeline on a piece of text.
doc = nlp("This is an example sentence.")

`doc` now stores the _document_ after `spacy` processes it.  `doc` behaves like a list of _annotated tokens_.  Note that token annotations are accessible via attributes, and most attributes have two versions: with with an underscore at the end (e.g.: `token.lemma_`) and one without (`token.lemma`).  You usually want the underscore version.  The non-underscore version returns an integer value, which `spacy` uses internally to track and work with tokens.  (It's actually faster for `spacy` to work with numeric representations rather than strings).  You should basically never need the non-underscore-having attributes.

In [6]:
for token in doc:
    print(
        token,          # original orthographic form of the token
        token.lower_,   # lowercased version of the token
        token.lemma_,   # lemmatized (stemmed) form of the token
        token.pos_,     # coarse-grained part of speech tag
        token.tag_,     # fine-grained part of speech tag
        token.ent_iob_, # named entity type
        token.dep_,     # syntactic depdency role
        token.is_stop,  # True if the token is a stopword, else False
        token.morph,    # miscellaneous morphological information, in
                        # `Feature=Value|Feature=Value|...` format.
    )

This this this PRON DT O nsubj True Number=Sing|PronType=Dem
is is be AUX VBZ O ROOT True Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
an an an DET DT O det True Definite=Ind|PronType=Art
example example example NOUN NN O compound False Number=Sing
sentence sentence sentence NOUN NN O attr False Number=Sing
. . . PUNCT . O punct False PunctType=Peri


We can use these annotations to filter tokens.  The function below keeps the lemmatized form of any word that isn't a stopword, isn't a punctuation token, isn't a whitespace token, and isn't a number.

In [7]:
def spacy_preprocess(nlp, texts):
    # nlp.pipe(docs) will run the pipeline over each document and return
    # an iterator over processed documents.  This can be multiprocessed
    # for extra speed.
    docs = nlp.pipe(
        tqdm(texts),
        
        # spaCy processes a few hundred documents per second at its
        # default pipeline configuration.
        # disable some steps we don't need to speed this up;
        # "parser" = the syntactic parser, and "ner" = named entity
        # recognition.
        disable=["parser", "ner"],
        
        # multiprocess it--but be warned, this creates a fully copy
        # of the model in each worker process.
        # this causes a good bit of startup overhead, but it's worth it
        # for this much data.
        n_process=8,
        batch_size=500,
    )

    # we could filter the tokens in a lot of ways, but I'm choosing
    # list comprehension today.
    docs = (
        [
            tok.lemma_.lower()
            for tok in doc
            if not (
                tok.is_stop     # no stopwords
                or tok.is_space # no space tokens
                or tok.is_punct # no punctuation tokens
                or tok.is_digit # no numbers
            )
        ]
        for doc in docs
    )

    # spacy has nothing like Gensim's `Dictionary`--so we'll join the tokens
    # back into one string and feed it through scikit-learn's `CountVectorizer`.
    docs = [" ".join(i) for i in docs]
    
    return docs

bow_train = spacy_preprocess(nlp, train["review_body"])
bow_test = spacy_preprocess(nlp, test["review_body"])

  0%|          | 0/200000 [00:00<?, ?it/s]

  0%|          | 0/5000 [00:00<?, ?it/s]

Note the speed--this is a lot slower than `gensim`, even when we use multiprocessing for a speedup.  But this isn't surprising, since the `spacy` models are doing a lot more work than the `gensim` preprocessing functions.

In [8]:
bow_test[0]

'awful fabric feel like tablecloth fit like child clothing customer service nice regret miss return date donate quality poor'

In [9]:
# fit, predict, and score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ("bag of words", CountVectorizer(max_df=0.5, min_df=10)),
    ("clf", BernoulliNB())
])
fit_and_score(
    clf,
    bow_train,
    train["stars"],
    bow_train,
    train["stars"],
)

Accuracy    0.486365
F1          0.468694
R2          0.348594
MAE         0.809565
dtype: float64

There is a _lot_ more you can do with spaCy.  It excels at anything where you need to have linguistically-relevant annotations (e.g.: grammatical and semantic annotations; the "mechanisms of language itself" rather than "thing that language might be connected to"), but it does a _lot_ of processing, so you will often be trading speed for the extra accuracy.