# Document annotations with spaCy

This will be a pretty quick notebook: we'll use spaCy to generate dense, 300-dimensional word vectors for each document in our dataset, and directly run predictions on those.

In [1]:
# same data loading code as before
import os
import pandas as pd

def undersample_majority_classes(df):
    """Under-sample classes in the dataset so that there's
    an equal number of all target classes."""
    resample_n = min(5_000, df["overall"].value_counts().min())
    df = (
        df
        .groupby("overall")
        .sample(resample_n, random_state=0)
        .reset_index(drop=True)
    )
    return df

if not all([
    os.path.isfile("electronics.parquet"),
    os.path.isfile("video_games.parquet"),
    os.path.isfile("clothes.parquet"),
]):
    # the "reviewText" field contains the text of the review.
    # The "overall" field contains the number of stars.
    electronics = pd.read_json(
        "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz",
        lines=True,
    )[["reviewText", "overall"]]
    electronics = undersample_majority_classes(electronics).assign(productCategory="Electronics")
    electronics.to_parquet("electronics.parquet")
    
    video_games = pd.read_json(
        "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Video_Games_5.json.gz",
        lines=True,
    )[["reviewText", "overall"]]
    video_games = undersample_majority_classes(video_games).assign(productCategory="Video Games")
    video_games.to_parquet("video_games.parquet")
    
    clothes = pd.read_json(
        "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Clothing_Shoes_and_Jewelry_5.json.gz",
        lines=True,
    )[["reviewText", "overall"]]
    clothes = undersample_majority_classes(clothes).assign(productCategory="Clothing and Jewelry")
    clothes.to_parquet("clothes.parquet")
else:
    electronics = pd.read_parquet("electronics.parquet")
    video_games = pd.read_parquet("video_games.parquet")
    clothes = pd.read_parquet("clothes.parquet")
    
# Remove 3-star reviews.
electronics = electronics[electronics["overall"] != 3]
video_games = video_games[video_games["overall"] != 3]
clothes = clothes[clothes["overall"] != 3]

# Set the "overall" column to be the binary classes of "positive"
# (for >3) and "negative" (<3).
electronics["overall"] = ["Positive" if i > 3 else "Negative" for i in electronics["overall"]]
video_games["overall"] = ["Positive" if i > 3 else "Negative" for i in video_games["overall"]]
clothes["overall"] = ["Positive" if i > 3 else "Negative" for i in clothes["overall"]]

from sklearn.model_selection import train_test_split

train, test = train_test_split(
    electronics,
    train_size=0.9,
    stratify=electronics["overall"],
    random_state=0,
)
test = pd.concat((test, video_games, clothes))

Only the `en_core_web_lg` model has word vectors, so we need to load that. Fortunately, we can disable basically all the components to make it faster.  _Note:_ I've often found that, for reasons I don't entirely understand yet, using `n_process` with `nlp.pipe()` for vectorization seems to slow things down a lot.  So, we'll go single-threaded here.

In [2]:
import numpy as np
import spacy
from tqdm.notebook import tqdm

nlp = spacy.load("en_core_web_lg")
train_x = np.array([
    # make_doc() will apply vectors, so we'll use it for speed
    nlp.make_doc(i).vector
    for i in tqdm(train["reviewText"])
])
test_x = np.array([
    nlp.make_doc(i).vector
    for i in tqdm(test["reviewText"])
])

  0%|          | 0/18000 [00:00<?, ?it/s]

  0%|          | 0/42000 [00:00<?, ?it/s]

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics

clf = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", LinearSVC(random_state=0, max_iter=2_000)),
])
%time clf.fit(train_x, train["overall"])
test["Predictions"] = clf.predict(test_x)

# Calculate a few different F1 scores.
overall_f1 = metrics.f1_score(test["overall"], test["Predictions"], average="macro")
print(f"Overall F1 score: {overall_f1}")

for product_category, results in test.groupby("productCategory"):
    f1 = metrics.f1_score(results["overall"], results["Predictions"], average="macro")
    print(f"{product_category}-specific F1 score: {f1}")



CPU times: total: 22.2 s
Wall time: 22.2 s
Overall F1 score: 0.787759197608813
Clothing and Jewelry-specific F1 score: 0.7893347345330886
Electronics-specific F1 score: 0.8023999649822723
Video Games-specific F1 score: 0.783445178481015


Even though the accuracy may seem to have gone down a bit compared to the bag of words approach, there are actually a _lot_ of benefits to this sort of dense vector representation.  Mostly, it allows much better generalizeability.  Bag-of-words has a problem in that every word is orthogonal to every other word: the difference between "cat" and "dog" is the same as the distance between "cat" and "nuclear."  In a dense representation, this is not necessarily true:

In [4]:
cat = nlp("cat")
dog = nlp("dog")
nuclear = nlp("nuclear")
print(cat.similarity(dog))
print(cat.similarity(nuclear))

0.8016854705531046
0.1296658079441922


(This results not from the denseness itself, but rather, from the way that these dense representations have been trained).  So, by using these kinds of representations, our model can better generalize to words that weren't seen during training, but which have vectors in the vector model.  This lets us provide a bit of "prior knowledge" to the models, which can be extremely useful.