# Training your own word embeddings with Gensim

So far, we've been focusing on how to convert text to numbers by _counting words_ (but we can very easily generalize to "counting features" even if the features aren't strictly words).  But there are other ways.

We saw that spaCy ships with _word vectors_ in the `en_core_web_lg` model.  Let's talk about these vectors for a second, then we'll see how we can train our own using Gensim.

# Word vectors: Dense representations

We were focused, before, on representing _documents_ as vectors.  But we can also represent _words_ as vectors, using one of two options:

1. A word is the vector of documents it appears in, along with its counts.  (In other words: a word vector is the column in a document-term matrix).
2. A word is a "one-hot" vector that just indicates what the current word is.

The former definition is very rarely used, since the size of that vector will be as big as the number of documents you have, which can often be a lot.  E.g.: in our examples, we had tens of thousands of documents at the low end--having a bunch of 10,000-long vectors (and longer) is not really feasible.

The latter definition is more common.  In practice it looks something like this:

$$
\begin{align}
\vec{the} &= [1,0,0,0,0]\\
\vec{cat} &= [0,1,0,0,0]\\
\vec{sat} &= [0,0,1,0,0]\\
\vec{on} &=  [0,0,0,1,0]\\
\vec{mat} &= [0,0,0,0,1]
\end{align}
$$

I.e.: each word is a vector of all zeros, but with a single 1.  The position of the 1 corresponds to which word it is: if the first number in the vector is 1, then the word is "the".  If the second number is 1, the word is "cat."  (this 1 is sometimes called the "hot" value--since every vector has only one hot value, these are "one hot" vectors).

We can represent a document by just summing the vectors of its words, element-wise:

$$
\begin{align}
\text{The cat sat on the mat} =& \vec{the} + \vec{cat} + \vec{sat} + \vec{on} + \vec{the} + \vec{mat} \\
=&\quad [1,0,0,0,0]\\
&+ [0,1,0,0,0]\\
&+ [0,0,1,0,0]\\
&+ [0,0,0,1,0]\\
&+ [1,0,0,0,0]\\
&+ [0,0,0,0,1]\\
=&\quad [2,1,1,1,1]
\end{align}
$$

However, there are some issues with this approach.  We've already mentioned the issue of _sparsity_: in the real world, you'll end up with document vectors that are mostly 0s, which can hurt both speed and accuracy.  But there's a deeper issue here.  This sort of representation can't encode anything about the similarity between words.  If every word is a one-hot vector, then the similarity between any two pairs of words is always the exact same number.

Even though we're rarely concerned with "is this linguistically sound," this should give us a bit of pause because of how egregiously un-language-like it is.  But there are also other, practical ramifications: if we show our model some new data full of words it's never seen before, it will just ignore them, even if they're actually important.  If we need to do some sort of similarity query between documents--e.g., "give me documents that are semantically similar to this one"--a word-count model will struggle, since it's only going to be able to look for documents that use _the same words_.  Yet, semantic similarity is much richer than "using the same words."  Consider the following two sentences:
- This book is great.
- That novel was enjoyable.

These two sentences could easily be saying the same thing, but they share no words in common.  A word-count approach will treat these two sentences as having nothing in common.  You might get some similarity if you're using a stemmer that maps "this" and "that" to a common form, and maps "is" and "was" to a common form, but "uses the same stopwords" is a bad way to measure similarity.  What we want is some representation that knows "book" and "novel" are similar, and that "great" and "enjoyable" are similar.

Enter _word embeddings._  Word embeddings try to do exactly this.  Rather than representing words as large, spare, one-hot vectors, they represent words as smaller, dense, floating point vectors with (in practice) no zeros.  The vectors are created by training them on a large corpus of text, and letting the embedding algorithm figure out good vectors.  Each embedding algorithm has a different notion of what a "good vector" is, but all useful embedding algorithms will ultimately give you vectors that encode some meaningful notion of "similarity" between words.

Once we have these embeddings, we can use them--instead of our one-hot encodings--to train our models.  (we can't do this with most topic models, though; the math behind them assumes, and requires, word counts.  But any supervised learning task, we can absolutely use word embeddings).

Emprically, word embedding work extremely well.  There are a few reasons, but the most interesting one--to me--is that they sort of bring "outside knowledge" to bear on your problem.  You can train word embeddings on _unlabeled_ corpora, which are easy to get.  The word embeddings will then encode something about "what words mean" based on that corpus (specifically: based on what words tend to appear around what other words), and then you can apply that knowledge to your data, even if your data is a lot smaller.

Word embedding still suffer from some of the issues of sparse/one-hot representations.  If you encounter a word that doesn't have an embedding, it gets ignored.  But this is a much rarer problem, since it's easy to train the embeddings on a huge corpus with lots of different words that might not be in your labeled data.

Word embeddings are also very data-hungry.  A good rule of thumb is that it's not worth training your own embeddings until you have about a million words.  Below that very rough threshold, you probably won't see much benefit, and training your own word embeddings might give _worse_ results than using bag-of-words, or using someone else's word embeddings (like spaCy's, which are generally pretty good).

Word embeddings also don't bypass the issue of domain specificity and inappropriate transfer.  If you train word embeddings on a bunch of scientific papers, they will learn what language in scientific papers looks like.  That probably won't transfer very well to something like reflective journaling of high school students.

Anyways, let's train our own with Gensim.  We'll use the same dataset--just the electronics, like with the topic modeling--and train a Word2Vec embedding model.  (There are many alternatives--GloVe and FastText are the biggest--but Word2Vec was the first big breakthrough embedding model, and remains an excellent choice).  We'll apply the same preprocessing and filtering that we used for the topic modeling.  Unlike with the topic modeling, though, we're going to keep the number of stars.  We'll train the Word2Vec model on _all_ the reviews--regardless of how many stars--and then transform only a moderately sized subset of positive and negative reviews.

In [1]:
import os
import numpy as np
import pandas as pd

if not os.path.isfile("electronics_topic_modeling.parquet"):
    reviews = pd.read_json(
        "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz",
        lines=True,
    )[["reviewText", "overall"]] # we'll use the "oveall" column in the next notebook
    reviews.to_parquet("electronics_topic_modeling.parquet")
else:
    reviews = pd.read_parquet("electronics_topic_modeling.parquet")
    
reviews = reviews.dropna()
reviews["overall"] = [
    1 if i > 3 else 0 if i < 3 else np.nan 
    for i in reviews["overall"]
]
print(f"{reviews.shape[0]:,} reviews.")

1,689,188 reviews.


In [2]:
# Preprocessing the text for Word2Vec
from joblib import Parallel, delayed
from gensim.parsing import preprocessing
from tqdm.notebook import tqdm

def preprocess(s):
    """Apply some of gensim's preprocessing tools."""
    s = preprocessing.strip_punctuation(s)
    s = preprocessing.strip_numeric(s)
    s = preprocessing.remove_stopwords(s.lower())
    s = preprocessing.strip_short(s)
    s = preprocessing.stem_text(s)
    return s.split()

parsed = Parallel(-1)(
    delayed(preprocess)(i)
    for i in tqdm(reviews["reviewText"], unit_scale=True, smoothing=0)
)

# We'll re-use this later, so save it back to the dataframe.
reviews["Preprocessed"] = parsed

  0%|          | 0.00/1.69M [00:00<?, ?it/s]

In [3]:
# Set up environment variables, like before
import os

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MLK_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

Like Gensim's LDA, Word2Vec does have a method for adding callbacks that run at the end of each pass over the data.  But these only run at the end of the epoch, so we'll re-use the `Corpus` class from the last notebook to track progress _within_ each epoch.

In [4]:
class Corpus:
    def __init__(self, it):
        self.it = it
        self.n_passes = 1
        
    def __len__(self):
        return len(self.it)
    
    def __iter__(self):
        for i in tqdm(
            self.it,
            desc=f"Pass {self.n_passes}",
            unit_scale=True,
            smoothing=0,
        ):
            yield i
        self.n_passes += 1

In [5]:
# Train word2vec
from gensim.models.word2vec import Word2Vec

# Remove any very short documents.  I've picked a 20 word
# threshold more or less at random.
parsed = [i for i in parsed if len(i) >= 20]

w2v = Word2Vec(
    sentences=Corpus(parsed), # iterable of list of strings
    vector_size=300, # 300-dimensional vectors--pretty standard size
    sg=0, # skip-gram sampling
    hs=0, # hierarchical softmax
    workers=10,
    epochs=10, # 2 passes over the data
)

Pass 1:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 2:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 3:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 4:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 5:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 6:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 7:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 8:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 9:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 10:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Pass 11:   0%|          | 0.00/1.04M [00:00<?, ?it/s]

Note that this took some time.  You can tweak some of the settings to make this go faster, but training a Word2Vec model just takes some time.  _However:_ it is very common to train a very large Word2Vec model once, and re-use it for multiple projects later.  E.g., you may train one large Word2Vec model on a huge amount of learner writing data (forum posts, tweets, assignments, you name it), and re-use that same embedding model when you need to vectorize some text in the future.  Training a huge model and re-using it makes this a one-time cost.

Side note: this idea of "train it once, re-use it for a lot of things" is pretty similar to _transfer learning,_ which has taken over NLP with the advent of Transformer neural network models.  Transformers, like Word2Vec, are trained once on a very large dataset, then re-used by researchers and practitioners later.  Unlike Word2Vec, though, it's extremely common to fine-tune the Transformer model on your particular task, allowing it to "adapt" to the specifics of your data.  There are a few ways you can do this with word embeddings--you could put them into an embedding layer of a neural network, essentially using them as "starting values" that the network will update; or, you can sometimes continue training the model by adding more specific data later, but thisget pretty complicated.  In practice, word embeddings are rarely fine-tuned for a particular dataset or task, though.

We can get the vector for a single word by accessing the `.wv` attribute of the trained model, which behaves like a dictionary that stores word-vector mappings.

In [6]:
print(w2v.wv["camera"])

[ 0.42554554  0.62747514 -0.22150028  0.944202   -1.1443033   0.20978229
 -0.65895957 -1.3480129   0.37489632  1.4652174  -0.85104567 -1.097804
 -1.3921007   1.1738586  -0.28471017 -0.2805354  -2.2720065   0.32445905
 -0.19563293 -1.1132339  -0.4860228  -1.0906961  -0.6809474   2.2965057
 -0.19865383  1.1043419   0.8857479  -1.2602749   0.74213815  0.02746249
 -0.85240495  2.7085278  -0.5600649  -0.04280163 -0.92991936  0.41168886
 -0.5439028   0.4707479  -0.5980841   0.28451464  1.1783094  -0.4997206
 -0.45165917  0.8025407  -2.1046035   0.6293314  -1.702422    0.19345719
  0.79553413 -0.3079303   0.5627185  -0.8282605  -0.6337152   0.69203264
 -2.5596092   2.0322492   0.521643   -0.06507634 -1.4132161   1.7050178
 -2.25428    -1.2045457  -1.1985768   0.24758787  1.1664797  -0.40271232
  1.7382778  -0.14926419  2.1421995   2.252167    1.787789    1.7050977
  0.65031564 -1.2845238   0.52689946  0.01537131  0.9466607  -0.7311194
  1.0215     -3.6198084   1.2724478   3.3253195   0.523588

We can also ask the model to give us the words that are most similar to a word we're interested in:

In [7]:
for word, similarity in w2v.wv.most_similar("camera"):
    print(f"simiularity('camera', {repr(word)}) = {similarity:.4f}")

simiularity('camera', 'dslr') = 0.7348
simiularity('camera', 'slr') = 0.6980
simiularity('camera', 'nikon') = 0.6917
simiularity('camera', 'canon') = 0.6808
simiularity('camera', 'camer') = 0.6659
simiularity('camera', 'camcord') = 0.6652
simiularity('camera', 'shoot') = 0.6399
simiularity('camera', 'len') = 0.6348
simiularity('camera', 'dlsr') = 0.6328
simiularity('camera', 'olympu') = 0.6226


And, we can vectorize each document by just taking the element-wise sum of the vectors for each word within it.  This is the representation we'll use for building this final version of our score predictor.

Note: there are a few things that might go wrong here.  Word2Vec, by default, filters out words below a certain frequency threshold (anything appearing in <5 documents).  This can be changed, but I've opted not to change it for this notebook.  So, when we're vectorizing a document, we need to make sure we:
1. Preprocess it the same way we preprocessed the texts for Word2Vec.
2. Do something with words that aren't in the word2vec model.  Since we'll be summing up word vectors, we can just treat these as being all-zero, or skip them.
3. We probably want to filter out any documents that have no words that show up in the word2vec model.

Since the vectorization step migth take some time, we'll do all our filtering before we get there.  This might mean we end up with a slight imbalance in our classes, but it shouldn't be big enough to be a problem.

In [8]:
import numpy as np

def vectorize_document(text, w2v):
    text = np.array([
        # w2v.wv stores the word-vector mapping
        w2v.wv[i]
        if i in w2v.wv
        else np.zeros(300)
        for i in text
    ])
    text = np.sum(text, axis=0)
    return text

# Filter short documents and 3-star reviews
reviews = reviews[~pd.isnull(reviews["overall"])]
reviews = reviews[[len(i) >= 20 for i in reviews["Preprocessed"]]]

# Resample
reviews = reviews.groupby("overall").sample(10_000, replace=False)

# Vectorize
vectors = np.array([
    vectorize_document(i, w2v) 
    for i in tqdm(reviews["Preprocessed"])
])
targets = reviews["overall"].astype(int)

  0%|          | 0/20000 [00:00<?, ?it/s]

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn import metrics

train_x, test_x, train_y, test_y = train_test_split(
    vectors,
    targets,
    train_size=0.8,
    stratify=targets,
    random_state=0,
)

clf = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", LinearSVC(random_state=0, max_iter=2000))
])

# Just to show how long this can take, I'll use a
# %time "magic command" in Jupyter.  LinearSVC models
# are usually very fast, but this is a *lot* of data.
%time clf.fit(train_x, train_y)
preds = clf.predict(test_x)
print(metrics.f1_score(test_y, preds))

CPU times: total: 18.6 s
Wall time: 18.6 s
0.858422493157502


