# Text preprocessing with Gensim

Gensim is a very nice counterpart to spaCy.  It also has a very different focus.  spaCy focuses on general-purpose NLP, with lots of language annotation tools.  Gensim is focused more directly on analyzing _enormous_ datasets.  It makes a lot of choices that trade accuracy for speed when it comes to preprocessing steps.

Let's do the same thing we just did: a bag of words analysis with some basic preprocessing: tokenization, stopword removal, lemmatization.  Then feed it back through our scikit-learn pipeline.

In [1]:
# same data loading code as before
import pandas as pd

electronics = pd.read_parquet("electronics.parquet")
video_games = pd.read_parquet("video_games.parquet")
clothes = pd.read_parquet("clothes.parquet")
    
# Remove 3-star reviews.
electronics = electronics[electronics["overall"] != 3]
video_games = video_games[video_games["overall"] != 3]
clothes = clothes[clothes["overall"] != 3]

# Set the "overall" column to be the binary classes of "positive"
# (for >3) and "negative" (<3).
electronics["overall"] = ["Positive" if i > 3 else "Negative" for i in electronics["overall"]]
video_games["overall"] = ["Positive" if i > 3 else "Negative" for i in video_games["overall"]]
clothes["overall"] = ["Positive" if i > 3 else "Negative" for i in clothes["overall"]]

from sklearn.model_selection import train_test_split

train, test = train_test_split(
    electronics,
    train_size=0.9,
    stratify=electronics["overall"],
    random_state=0,
)
test = pd.concat((test, video_games, clothes))

In [2]:
from gensim.parsing import preprocessing
from tqdm.notebook import tqdm
tqdm.pandas()

def preprocess(s):
    """Apply some of gensim's preprocessing tools."""
    # remove HTML tags
    s = preprocessing.strip_tags(s)
    
    # Remove non-alphanumeric characters
    s = preprocessing.strip_punctuation(s)
    s = preprocessing.strip_numeric(s)
    
    # Remove stopwords
    s = preprocessing.remove_stopwords(s)
    
    # Remove short tokens. This turns out to sometimes
    # be useful when you're focusing on meaning; in English
    # and many Indo-European languages, veyr short words
    # are often structural or otherwise non-meaning-carrying.
    # Or at least, they're *less* meaning-carrying than longer
    # words.
    s = preprocessing.strip_short(s)
    
    # Stem the text using the Porter stemmer, which looks
    # at the last few characters in a word and applies some
    # patterns to remove inflection.
    s = preprocessing.stem_text(s)
    
    return s

train["Cleaned Text"] = train["reviewText"].progress_apply(preprocess)
test["Cleaned Text"] = test["reviewText"].progress_apply(preprocess)

  0%|          | 0/18000 [00:00<?, ?it/s]

  0%|          | 0/42000 [00:00<?, ?it/s]

That's fast!  We're applying most of the same steps as we did with spaCy, but multiple times faster! However: the results we get are not as easily human-readable.

(Note: the above processing steps, plus a few others, are available in the `gensim.parsing.preprocessing.preprocess_string()` function; I've broken them out to show how easy it is to mix-and-match different processing steps).

In [3]:
print(train["reviewText"].iloc[0])
print(train["Cleaned Text"].iloc[0])

Outstanding. I have the 16mb version of this. Its been on my key ring with my keys in my pocket with pocket knife and change and used every day for two years. It looks new and keeps on ticking
outstand version it kei ring kei pocket pocket knife chang dai year look new keep tick


Notice a few weird words:
- "Outstanding" --> "outstand"
- "Key" --> "kei"
- "Day" --> "dai"
- "Change" --> "chang"

This is pretty emblematic of the Porter stemmer.  The Porter stemer is designed with a very narrow purpose in mind.  It is designed to make sure that every inflected form of a word is replaced with the same string, but it does not require that string to be the "dictionary form" of a word.  (This contrasts with spaCy, which does return the "dictionary form" of a word).  The Porter stemmer does this by looking only at the patterns of letters in the word, and using a set of replacement rules.  This allows it to be fast, but at the cost of human readability.  In practice, using a human-readable lemmatizer versus the Porter stemmer (or any of the many other stemmers out there) has a negligible impact on the accuracy of your downstream models, at least every time I've tested it.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics

clf = Pipeline([
    (
        "feature extraction",
        # TfidfVectorizer supports all the same options as CountVectorizer,
        # plus a few others that we're not setting
        TfidfVectorizer(
            min_df=10,
            max_df=0.5,
            # stop_words="english",
            # max_features=10_000,
        )
    ),
    ("classifier", LinearSVC(random_state=0, max_iter=2000))
])

# Just to show how long this can take, I'll use a
# %time "magic command" in Jupyter.  LinearSVC models
# are usually very fast, but this is a *lot* of data.
%time clf.fit(train["Cleaned Text"], train["overall"])
test["Predictions"] = clf.predict(test["Cleaned Text"])

# Calculate a few different F1 scores.
overall_f1 = metrics.f1_score(test["overall"], test["Predictions"], average="macro")
print(f"Overall F1 score: {overall_f1}")

for product_category, results in test.groupby("productCategory"):
    f1 = metrics.f1_score(results["overall"], results["Predictions"], average="macro")
    print(f"{product_category}-specific F1 score: {f1}")

CPU times: total: 828 ms
Wall time: 821 ms
Overall F1 score: 0.7892704955134862
Clothing and Jewelry-specific F1 score: 0.8046542231105482
Electronics-specific F1 score: 0.836921069797782
Video Games-specific F1 score: 0.7689399612259036
