# Python demo: Bag-of-Words modeling with `gensim` and `scikit-learn`

`scikit-learn`'s tools for bag-of-words models are great, but we usually need (or want) to exercise a bit more fine-grained control than we can get by just using the `CountVectorizer`/`TfidfVectorizer`/`HashingVectorizer`.  The `gensim` library gives us some nice tools to do this.  `gensim` isn't really designed to be a text preprocessing library--it's designed for things like training word vector models (see notebook 02), topic models, and similar things--but it has a set of very simple, extremely fast, surprisingly robust text cleaning and preprocessing tools that we can borrow.  We'll use these tools to clean up our documents and transform them into bag-of-words models, entirely within Gensim, then use the Bernoulli Naive Bayes classifier to build the predictive model.

Pros:
- Much more granular control over our preprocessing compared to `CountVectorizer`/etc, without adding a lot of complexity.
- Still extremely fast--`gensim` is designed for speed and datasets of enormous size.

Cons:
- `gensim`'s preprocessing is really more _string processing_ than _language processing_.  This is part of why it's so fast, but it also means `gensim` knows nothing about, say, "verbs" or "sentences."

In [1]:
# requirements
# !conda install --yes pandas tqdm gensim scikit-learn ipywidgets

In [2]:
# tqdm is a magic library that gives you progerss bars when iterating
# through things.
from tqdm.notebook import tqdm

# register tqdm with pandas so we can get .progress_apply() method
# added to dataframes.  This is a version of pd.DataFrame.apply()
# but now it prints a progress bar!
tqdm.pandas(smoothing=0)

In [3]:
import pandas as pd

# load the data
train = pd.read_csv("../../data/train.csv")
test = pd.read_csv("../../data/test.csv")

In [4]:
# first: a helper function to absract the "fit + predict + score" logic.
from sklearn import metrics

def fit_and_score(clf, train_x, train_y, test_x, test_y):
    """fit the model `clf` to the `train` dataset and evaluate its
    performance on the `test` dataset."""
    clf.fit(train_x, train_y)
    preds = clf.predict(test_x)
    
    # calculate some classification metrics
    accuracy = metrics.accuracy_score(preds, test_y)
    f1 = metrics.f1_score(preds, test_y, average="macro")

    # and some regression metrics (since "predict the number of stars"
    # could reasonably be either kind of task).
    r2 = metrics.r2_score(preds, test_y)
    mae = metrics.mean_absolute_error(preds, test_y)
    
    return pd.Series({"Accuracy": accuracy, "F1": f1, "R2": r2, "MAE": mae})

Gensim's preprocessing tools are mostly in the `gensim.parsing.preprocessing` module.  There are a lot of smaller functions for specific tasks, but there's also `preprocess_string`, which applies a (very sensible) set of default preprocessing steps.  (We'll break this down into its component pieces in just a minute).  This function takes in a string, and returns a list of strings; i.e., it converts documents (strings) into lists of processed tokens.

In [5]:
from gensim.parsing import preprocessing as pre

# default preprocessing pipeline.  lowercases, removes numbers/punctuation/
# other funny characters, stems, and returns each document as a list of
# string.
preprocessed = train["review_body"].progress_apply(pre.preprocess_string)
print(preprocessed[0])

  0%|          | 0/200000 [00:00<?, ?it/s]

['arriv', 'broken', 'manufactur', 'defect', 'leg', 'base', 'complet', 'form', 'wai', 'insert', 'caster', 'unpackag', 'entir', 'chair', 'hardwar', 'notic', 'spend', 'twice', 'time', 'box', 'useless', 'thing', 'send', 'star', 'review', 'chair', 'got', 'sit', 'far', 'includ', 'pictur', 'inject', 'mold', 'qualiti', 'assur', 'process', 'miss', 'hesit', 'bui', 'make', 'wonder', 'aren', 'miss', 'structur', 'support', 'imped', 'assembl', 'process']


The list of preprocessing steps performed by `preprocess_string()` can be broken out using other functions from this same module:

In [6]:
def preprocess(s):
    """preprocess a document and return a list of processd tokens."""
    # convert the text to lowercase
    s = s.lower()

    # remove HTML/XML tags, which can show up a lot in data pulled from
    # the internet.
    s = pre.strip_tags(s)

    # remove punctuation--rarely useful/needed for the things Gensim is designed
    # to do.
    s = pre.strip_punctuation(s)

    # Replace multiple whitespaces with a single space
    s = pre.strip_multiple_whitespaces(s)
    
    # remove numbers
    s = pre.strip_numeric(s)
    
    # remove stopword
    s = pre.remove_stopwords(s)
    
    # remove any short tokens (2 letters or less)
    s = pre.strip_short(s)
    
    # run the text through the Porter stemmer
    s = pre.stem_text(s)
    
    # split the string at whitespaces to get the list of tokens
    return s.split()
    
preprocessed = train["review_body"].progress_apply(preprocess)
print(preprocessed[0])

  0%|          | 0/200000 [00:00<?, ?it/s]

['arriv', 'broken', 'manufactur', 'defect', 'leg', 'base', 'complet', 'form', 'wai', 'insert', 'caster', 'unpackag', 'entir', 'chair', 'hardwar', 'notic', 'spend', 'twice', 'time', 'box', 'useless', 'thing', 'send', 'star', 'review', 'chair', 'got', 'sit', 'far', 'includ', 'pictur', 'inject', 'mold', 'qualiti', 'assur', 'process', 'miss', 'hesit', 'bui', 'make', 'wonder', 'aren', 'miss', 'structur', 'support', 'imped', 'assembl', 'process']


Now, we have to create a document-term matrix from these lists of lists of tokens.  `gensim` wants us to go about it this way:
- Create a `gensim.corpora.Dictionary` object, which will store tokens, their raw frequencies, and their document frequencies (i.e. how many documents they appear in).
- (optiona, but recommended) remove super rare and super frequent words from the `Dictionary`'s vocabulary list.
- Use the `Dictionary` to transform our list of lists of tokens into a (`gensim`-specific) sparse matrix format.
- Use `gensim.matutils` module to convert into a Scipy sparse matrix format, so we can use it with `scikit-learn` models.

It's not as much code as it sounds.

_Note:_ we very easily could just stop here and go straight to `scikit-learn`.  All we'd need to to is skip the `s.split()` line in the `preprocess()` function we just wrote (so we get back strings, not lists of strings), but then we could pass our `preprocessed` object directly in as our `X` value to a `CountVectorizer` + `BernoulliNB` pipeline.  This is probably a good idea, since the next few cells are basically just replicating the workof the `CountVectorizer` transformer, but these steps are required for using almost any of `gensim`'s own models (which we'll do in notebook 02).

In [7]:
from gensim.corpora import Dictionary

# gensim.corpora.Dictionary objects find all of our unique vocabulary,
# can filter out super rare/common terms, and can then efficiently transform
# processed dodcument into bag-of-words formats.
id2word = Dictionary(preprocessed)

# Remove very rare and very common words--this operation happens in-place.
id2word.filter_extremes(
    # pass a float between 0 and 1 --> remove any token in more/fewer than
    # that *percent* of documents.  Pass an integer --> remove any token
    # that appears in more/fewer than exactly that many documents.
    no_above=0.5,
    no_below=10,
)

# Dictionary.doc2bow(list_of_tokens) will convert list_of_tokens into
# a Gensim-internal sparse matrix format.
bow = [id2word.doc2bow(i) for i in preprocessed]
print(bow[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 1), (29, 2), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1)]


To convert this format (a list of length-2 tuples) into a `scikit-learn` compatible format, we need to grab `gensim.matutils.corpus2csc`--which will convert this into a _compressed sparse column (CSC)_ matrix--and then transpose it via the `.T` attribute.  `corpus2csc` gives us a sparse matrix with _one column per observation, one row per feature_, but `scikit-learn` expects _one row per observation, one column per feature_, hence the transpose operation.

In [8]:
from gensim.matutils import corpus2csc

# csc = compressed sparse column = one column per document.
# Need to transpose it to get one document per row, which is what
# scikit-learn models expect.  This also convert it to a compressed
# sparse row format for us.
bow_train = corpus2csc(bow).T

We can combine these steps (minus training the `Dictionary` object) together into a series of chained `pandas` methods:

In [9]:
bow_test = (
    test["review_body"]
    .progress_apply(preprocess)
    .progress_apply(id2word.doc2bow)
    # need to specify num_terms for corpus2csc, otherwise
    # we might end up with all-zero columns being dropped.
    # that will cause issues later for the naive bayes model.
    .pipe(corpus2csc, num_terms=len(id2word))
).T

  0%|          | 0/5000 [00:00<?, ?it/s]

  0%|          | 0/5000 [00:00<?, ?it/s]

In [10]:
from sklearn.naive_bayes import BernoulliNB

fit_and_score(
    BernoulliNB(),
    bow_train,
    train["stars"],
    bow_test,
    test["stars"],
)

Accuracy    0.450800
F1          0.430264
R2          0.304645
MAE         0.865000
dtype: float64

`gensim` really shines in other areas than its text preprocessing; its preprocessing tools, while convenient and fast, aren't really doing anything fancy.  Under the hood, most of it is just regular expressions and removing word that appear in a pre-made stopword list.  You could probably implement these functions yourself pretty easily (and it's a good exercize to try doing that), but `gensim` provides them in one nice, convenient place.  And, compared to more linguistically-savvy methods (like `spacy`, which we'll see in the notebok 01c), `gensim` is extremely fast.  If you find yourself with a stupidly large corpus, `gensim` might be a good preprocessing join just because of its speed.

In notebook 02 we'll see some things `gensim` is much more specialized in and much better at, namely training our own Word2Vec models.  `gensim`'s real killer features are that kind of model--the (mostly unsupervised) text models like embeddings and topic models.