# Word Embeddings

Word embeddings were all the rage from about 2013 (when the `word2vec` algorithm was published) through 2018 (when BERT was published and changed the entire field overnight).  Word vectors tend to have a lot of advantages over bag of words models:

- They encode texts _densely._  This tends to mean less chance of over-fitting a model, and that most models will train far more quickly.
- They can encode _degrees of similarity_ between word.  Bag-of-word treats all word features as orthogonal ("cat" and "dog" are as different as "cat" and "cats").  Word embeddings project these word into lower-dimensional spaces.
- They are _pretrained lookup tables_ that map words to fixed-length, dense vectors.  This means they can be trained once and re-used many times--a sort of primitive transfer learning.
- They can learn about the _co-occurrence patterns_ of words, which ends up being a _very_ good proxy for the "meaning" of words.  (and since this is derived purely from the data you provide, you can easily learn domain-specific or context-specific representations).

However, this training process can take some time if you have a lot of data. And you usually need a decent amount of data to train these in the first place--a good rule of thumb is "about a million" words is where it starts making sense to train your own vectors.

There are several algorithms for word embeddings, but the biggest by far are:
- Latent Semantic Analysis (aka Latent Semantic Indexing): build a document-term matrix, then run Singular Value Decomposition to reduce it down to fewer dimensions.  Fast, generally works well, but doesn't encode as much information as the other options.
- Word2Vec: learns to encode words based on what other words they occur around.  Can be easily run on moderately sized datasets and can learn very high-quality vectors.
- Global Vectors (GloVe): technically different from Word2Vec but most of that is implementation details.  There's rarely any meaningful difference in the quality of the vector between GloVe and Word2Vec.
- FastText: FastText is designed to capture sub-word inforation (e.g., morphology).  It can theoretically have good representations of brand new out-of-vocabulary vocabulary items.  In practice it tends to perform pretty similar to Word2Vec/GloVe for English, but FastText can work better for languages with lots of inflection information (e.g., Turkish).
- Having an embedding layer as part of a neural network that learns good embeddings for your specific task.  (this is basically a fancy way of training your own using word2vec, but by coupling the vector training much more tightly and explicitly to your particular task).

In this notebook we'll use the same data and prediction task as the Bag of Words notebook--predict the number of stars from an amazon review--but this time using a pretty simple neural network.  (word embeddings + neural networks, or random forests, or XGBoost, or the like is usually a very good combination!).  There are two main ways to use word vectors.  First option: convert each document into a 2d array (one row per word, one column per embedding dimension), then use something like a Convolutional Neural Network or LSTM.  This tends to provide a lot of very rich information, since this kind of representation with these kinds of models can learn something about conextual representations of each word.  Second option: convert each document into a single vector by summing or averaging the individual word vectors.  This is usually a good "first draft" since it's easy, and lightens the computational load that your model has to pick up.

# Using pre-trained word vectors

Most of the time, you can grab someone else's pre-trained vectors and just use those.  spaCy has such vectors ready to go for us, but we need to be using one of the `*_lg` models.

In [1]:
# requirements
# !conda install --yes tqdm pandas scikit-learn spacy
# !python -m spacy download en_core_web_lg

# NOTE: replace this next line with the platform-specific instructions
# for your system from: https://pytorch.org/get-started/locally/
# !conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

In [2]:
# tqdm is a magic library that gives you progerss bars when iterating
# through things.
from tqdm.notebook import tqdm

# register tqdm with pandas so we can get .progress_apply() method
# added to dataframes.  This is a version of pd.DataFrame.apply()
# but now it prints a progress bar!
tqdm.pandas(smoothing=0)

In [3]:
import pandas as pd

# load the data
train = pd.read_csv("../../data/train.csv")
test = pd.read_csv("../../data/test.csv")
val = pd.read_csv("../../data/validation.csv")

train.head()

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
0,en_0964290,product_en_0740675,reviewer_en_0342986,1,Arrived broken. Manufacturer defect. Two of th...,I'll spend twice the amount of time boxing up ...,en,furniture
1,en_0690095,product_en_0440378,reviewer_en_0133349,1,the cabinet dot were all detached from backing...,Not use able,en,home_improvement
2,en_0311558,product_en_0399702,reviewer_en_0152034,1,I received my first order of this product and ...,The product is junk.,en,home
3,en_0044972,product_en_0444063,reviewer_en_0656967,1,This product is a piece of shit. Do not buy. D...,Fucking waste of money,en,wireless
4,en_0784379,product_en_0139353,reviewer_en_0757638,1,went through 3 in one day doesn't fit correct ...,bubble,en,pc


In [4]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [5]:
import numpy as np

def text2array(docs, nlp):
    # if all we need is vectors, we can use nlp.make_doc()
    # for a pretty big speedup.  This will basically just do
    # tokenization, case normalization, and vector lookups.
    return np.array([
        i.vector
        for i in map(nlp.make_doc, tqdm(docs, smoothing=0))
    ])

train_docs = text2array(train["review_body"], nlp)
print(train_docs[0])

  0%|          | 0/200000 [00:00<?, ?it/s]

[ 1.02563680e-03  1.53662562e-01 -1.67447686e-01 -7.95046240e-02
  6.34283647e-02 -1.13214003e-02 -1.16581870e-02 -1.14358470e-01
 -4.93131531e-03  2.15048170e+00 -1.38603300e-01  9.16635990e-02
  1.21868968e-01 -4.79591042e-02 -1.14931360e-01 -6.06204830e-02
 -6.53301701e-02  1.24221015e+00 -1.92374930e-01  1.43992379e-02
  3.97625677e-02 -2.04175301e-02 -5.10083996e-02 -2.18137261e-02
 -2.08333917e-02  1.33769875e-02 -1.01538457e-01 -1.31367564e-01
  3.57617773e-02 -8.36472362e-02 -8.32881927e-02  8.36016834e-02
 -3.25288437e-02  9.63581651e-02  6.59466237e-02 -5.85295893e-02
  1.67492833e-02  7.65056163e-02 -4.94824238e-02 -1.13120914e-01
  4.69577452e-03  1.08310066e-01  2.43635625e-02 -6.97399229e-02
  4.36965227e-02  4.03517336e-02 -1.68145776e-01 -1.90215465e-02
 -3.84692363e-02  2.51397491e-02 -1.24465981e-02 -1.72056016e-02
  5.86093497e-03 -2.24877838e-02  3.24918181e-02 -1.33547625e-02
 -2.76379548e-02 -1.10679857e-01 -8.30884837e-03 -4.57795300e-02
 -4.39137295e-02 -2.24953

...and that's it.  Now let's just do the same to the training and validation sets, then throw a small multi-layer perceptron at it.

In [6]:
test_docs = text2array(test["review_body"], nlp)
val_docs = text2array(val["review_body"], nlp)

  0%|          | 0/5000 [00:00<?, ?it/s]

  0%|          | 0/5000 [00:00<?, ?it/s]

# PyTorch model

We'll use PyTorch to build a pretty basic multi-layer perceptron: 3 layers, of size 128, 64, and 64.  If you're not familiar with PyTorch, don't worry about the code in the next few cells; PyTorch gives you a lot more control over the nitty-gritty details of the training process, but none of the steps are especially complex.

In [7]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# We'll want our data to be in a particular format to ease the
# batching logic.
def prepare_dataset(x, y):
    x = torch.Tensor(x)
    y = torch.Tensor(pd.get_dummies(y["stars"]).values)
    return DataLoader(
        TensorDataset(x, y),
        # this dataset isn't very sensitive to batch size, so let's
        # pick a big one to let us iterate more quickly.
        batch_size=512,
        shuffle=True,
    )

train_torch = prepare_dataset(train_docs, train)
test_torch = prepare_dataset(test_docs, test)
val_torch = prepare_dataset(val_docs, val)

In [8]:
# use gpu if available
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")

In [9]:
from sklearn.metrics import f1_score

# training loop
def training_loop(
    model,
    training_dataset,
    validation_dataset,
):
    loss_fn = torch.nn.CrossEntropyLoss()
    opt = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    # trackers for early stopping--5 rounds with no improvement
    # before stopping.
    best_val_loss = np.inf
    early_stopping_counter = 0
    
    for epoch in range(10):
        for (x, y) in tqdm(training_dataset, desc=f"Epoch {epoch}"):
            # DEVICE is read from outer/global scope; defined in
            # previous cell
            x = x.to(DEVICE)
            y = y.to(DEVICE)
            
            # learn from batch
            opt.zero_grad()
            preds = model(x)
            loss = loss_fn(preds, y)
            loss.backward()
            opt.step()
            
        # validate after each epoch
        model.eval()
        val_loss = 0
        preds = []
        ys = []
        for (x, y) in validation_dataset:
            x = x.to(DEVICE)
            y = y.to(DEVICE)
            # loss-per-sample
            preds.append(model(x).to("cpu"))
            ys.append(y.to("cpu"))
        model.train()
        preds = torch.cat(preds)
        ys = torch.cat(ys)
        val_loss = loss_fn(preds, ys)
        
        preds = torch.argmax(preds, axis=1).numpy()
        ys = torch.argmax(ys, axis=1).numpy()
        val_acc = np.mean(preds == ys)
        val_f1 = f1_score(preds, ys, average="macro")
        print(f"{val_loss=:.6f} - {val_acc=:.3%} - {val_f1=:.6f}")
        
        # check early stopping criteria
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            early_stopping_counter = 0
        else:
            early_stopping_counter += 1
            
        if early_stopping_counter >= 5:
            break
    return model

In [10]:
# model specification
model = torch.nn.Sequential(
    torch.nn.BatchNorm1d(300),
    torch.nn.Linear(300, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 5),
)
model.to(DEVICE)
model = training_loop(model, train_torch, val_torch)

Epoch 0:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.182499 - val_acc=48.560% - val_f1=0.477031


Epoch 1:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.174276 - val_acc=48.660% - val_f1=0.481182


Epoch 2:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.163337 - val_acc=49.440% - val_f1=0.483215


Epoch 3:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.158106 - val_acc=49.660% - val_f1=0.485795


Epoch 4:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.154734 - val_acc=49.820% - val_f1=0.482703


Epoch 5:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.153258 - val_acc=49.820% - val_f1=0.483122


Epoch 6:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.156546 - val_acc=49.960% - val_f1=0.495740


Epoch 7:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.156823 - val_acc=50.040% - val_f1=0.496198


Epoch 8:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.156470 - val_acc=49.880% - val_f1=0.495189


Epoch 9:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.158228 - val_acc=50.060% - val_f1=0.495088


In [11]:
# final performance metrics
model.eval()
preds = []
ys = []
for (x, y) in val_torch:
    x = x.to(DEVICE)
    y = y.to(DEVICE)
    preds.append(model(x).to("cpu").detach().numpy())
    ys.append(y.to("cpu"))
preds = np.argmax(np.vstack(preds), axis=1)
ys = np.argmax(np.vstack(ys), axis=1)

acc = np.mean(preds == ys)
f1 = f1_score(preds, ys, average="macro")
print(f"Final test set scores: accuracy={acc:.2%}, f1={f1:.4f}")

Final test set scores: accuracy=50.06%, f1=0.4951


# Train your own word vectors with Gensim

One of the things Gensim is designed for is fast implementations of Word2Vec, FastText, and a few other algorithms.  Let's train our own word Word2Vec vectors.  To do this, we just need to get our data into a list of lists of strings (i.e.: a list of tokenized documents).  We'll use spaCy to do our tokenization and string normalization, but we won't lemmatize our texts.

For word vectors, we usually want to keep stopwords and such in--they can actually provide really useful co-occurrence information for the embedding algorithm to learn from.

In [12]:
def tokenize(s, nlp):
    return [tok.lower_ for tok in nlp.make_doc(s)]

# we could parallelize this for even more speed, but...eh.
docs = pd.concat((
    train["review_body"],
    test["review_body"],
    val["review_body"]
)).progress_apply(tokenize, nlp=nlp)
docs = list(docs)
print(docs[0])

  0%|          | 0/210000 [00:00<?, ?it/s]

['arrived', 'broken', '.', 'manufacturer', 'defect', '.', 'two', 'of', 'the', 'legs', 'of', 'the', 'base', 'were', 'not', 'completely', 'formed', ',', 'so', 'there', 'was', 'no', 'way', 'to', 'insert', 'the', 'casters', '.', 'i', 'unpackaged', 'the', 'entire', 'chair', 'and', 'hardware', 'before', 'noticing', 'this', '.', 'so', ',', 'i', "'ll", 'spend', 'twice', 'the', 'amount', 'of', 'time', 'boxing', 'up', 'the', 'whole', 'useless', 'thing', 'and', 'send', 'it', 'back', 'with', 'a', '1', '-', 'star', 'review', 'of', 'part', 'of', 'a', 'chair', 'i', 'never', 'got', 'to', 'sit', 'in', '.', 'i', 'will', 'go', 'so', 'far', 'as', 'to', 'include', 'a', 'picture', 'of', 'what', 'their', 'injection', 'molding', 'and', 'quality', 'assurance', 'process', 'missed', 'though', '.', 'i', 'will', 'be', 'hesitant', 'to', 'buy', 'again', '.', 'it', 'makes', 'me', 'wonder', 'if', 'there', 'are', "n't", 'missing', 'structures', 'and', 'supports', 'that', 'do', "n't", 'impede', 'the', 'assembly', 'proce

A quick digression: your choice of preprocessing matters, just like for bag of words models.  E.g.: sometimes you might know that verb tense, or singular versus plural, are generally important.  For really, really large datasets, doing gentler preprocessing is generally a good idea, but here we'll just keep the code simple.  Just make sure you do the same preprocessing to your text before applying your vectors!

There's also some advice out there that Word2Vec should be trained on _sentences,_ not _documents,_ but we're just going to ignore that.  It doesn't always matter very much in practice.

Now, we just throw the tokenized text into the Gensim Word2Vec model.

In [13]:
from gensim.models.word2vec import Word2Vec

class Corpus:
    """a class that prints progress as you iterate through it.
    like tqdm, but it properly re-initializes the progress bar
    after each run through.  I like to use this in place of 
    once-per-epoch callbacks in Gensim since this provides
    more constant and real-time feedback."""
    def __init__(self, it):
        self.it = it
        self.n = 1
        
    def __iter__(self):
        yield from tqdm(
            self.it,
            unit_scale=True,
            desc=f"Pass {self.n}",
            smoothing=0,
        )
        self.n += 1
        
    def __len__(self):
        return len(self.it)

# there are a LOT of Word2Vec parameters we could tweak, but
# we'll leave most of them at their default values.
w2v = Word2Vec(
    Corpus(docs),
    vector_size=300,
    # number of parallel CPU threads to use--set this lower or higher
    # depending on your system.
    workers=10,
    # sg=1 to use skip-gram, sg=0 to use CBOW
    sg=1,
    # hs=1 to use hierarchical softmax, 0 for negative sampling.
    hs=0,
    # ignore words with frewer than this many total occurences
    min_count=5,
    # number of passes over the corpus.  More epochs --> better vectors,
    # usually, but with diminishing returns.  25 i probably overkill
    # for this data but we're gonna do it anyways.
    epochs=25,
)

Pass 1:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 2:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 3:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 4:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 5:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 6:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 7:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 8:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 9:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 10:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 11:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 12:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 13:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 14:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 15:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 16:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 17:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 18:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 19:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 20:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 21:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 22:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 23:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 24:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 25:   0%|          | 0.00/210k [00:00<?, ?it/s]

Pass 26:   0%|          | 0.00/210k [00:00<?, ?it/s]

In [14]:
# Now to get word vectors out, just index into w2v.wv!
print(w2v.wv["arrive"])

[ 6.37350380e-01 -2.24434271e-01 -1.74850196e-01 -2.67496705e-01
  3.61776263e-01 -1.64233074e-01 -2.14357302e-01  2.67212898e-01
 -3.52885693e-01  5.56278646e-01  7.74217620e-02 -3.47563364e-02
 -7.70093799e-02 -4.31601435e-01  6.47711754e-02 -1.04609817e-01
 -4.85226996e-02  2.40687534e-01  2.02059209e-01  1.61600158e-01
  4.61019456e-01  3.08313966e-01 -1.41286567e-01  1.62408158e-01
 -1.55187503e-01  6.64419115e-01 -2.24275072e-03  4.54806477e-01
 -1.90380186e-01 -4.09426123e-01 -1.70356572e-01 -3.28126967e-01
  2.58329033e-04 -6.68878406e-02  2.92585306e-02 -1.12499088e-01
  2.53223404e-02  2.41018832e-01  2.34994560e-01  2.32210174e-01
  1.69653460e-01 -1.63170859e-01  1.98978186e-01  2.54277792e-03
  1.79365668e-02  1.37414619e-01 -3.17285091e-01  3.05866361e-01
 -5.12970030e-01 -2.61275142e-01 -9.35102701e-02  1.79125428e-01
 -2.10391685e-01 -1.50883906e-02  4.40478623e-01 -1.41883299e-01
 -1.85580269e-01  2.74730563e-01 -5.63141644e-01 -2.12939680e-02
  3.02858710e-01  1.40549

In [15]:
# preprocess our texts, apply the word-level vectorization,
# and represent each document as just the elementwise sum
# of its word vectors.

def vectorize(df, vectors):
    # we'll be fancy and do this with generators
    texts = df["review_body"].progress_apply(tokenize, nlp=nlp)
    
    # remove any words that didn't make it into the word2vec vocab
    texts = (
        [tok for tok in doc if tok in vectors.wv.key_to_index]
        for doc in texts
    )
    
    # get vectors and stack them into a single array
    texts = np.array([
        np.mean(w2v.wv[doc], axis=0)
        if len(doc) > 0
        else np.zeros(300)
        for doc in texts
    ])
    
    return texts

train_docs = vectorize(train, w2v)
test_docs = vectorize(test, w2v)
val_docs = vectorize(val, w2v)

  0%|          | 0/200000 [00:00<?, ?it/s]

  0%|          | 0/5000 [00:00<?, ?it/s]

  0%|          | 0/5000 [00:00<?, ?it/s]

Note that if a document didn't have any words that made it into the Word2Vec vectors, we're just giving that document an all-zero vector rather than dropping it. 

Now, we can just throw this same data into our neural network from before and see how it goes.  (I'm going to be bad and just copy-paste from before).

In [16]:
train_torch = prepare_dataset(train_docs, train)
test_torch = prepare_dataset(test_docs, test)
val_torch = prepare_dataset(val_docs, val)

# model specification
model = torch.nn.Sequential(
    torch.nn.BatchNorm1d(300),
    torch.nn.Linear(300, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 5),
)
model.to(DEVICE)
model = training_loop(model, train_torch, val_torch)

# final performance metrics
model.eval()
preds = []
ys = []
for (x, y) in val_torch:
    x = x.to(DEVICE)
    y = y.to(DEVICE)
    preds.append(model(x).to("cpu").detach().numpy())
    ys.append(y.to("cpu"))
preds = np.argmax(np.vstack(preds), axis=1)
ys = np.argmax(np.vstack(ys), axis=1)

acc = np.mean(preds == ys)
f1 = f1_score(preds, ys, average="macro")
print(f"Final test set scores: accuracy={acc:.2%}, f1={f1:.4f}")

Epoch 0:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.145728 - val_acc=50.700% - val_f1=0.505186


Epoch 1:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.128368 - val_acc=50.640% - val_f1=0.493484


Epoch 2:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.123204 - val_acc=50.800% - val_f1=0.500298


Epoch 3:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.111543 - val_acc=51.540% - val_f1=0.508853


Epoch 4:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.110667 - val_acc=51.820% - val_f1=0.511567


Epoch 5:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.120491 - val_acc=51.680% - val_f1=0.511343


Epoch 6:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.119671 - val_acc=51.680% - val_f1=0.512154


Epoch 7:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.122806 - val_acc=51.160% - val_f1=0.507976


Epoch 8:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.120328 - val_acc=51.340% - val_f1=0.507619


Epoch 9:   0%|          | 0/391 [00:00<?, ?it/s]

val_loss=1.139545 - val_acc=50.480% - val_f1=0.501121
Final test set scores: accuracy=50.48%, f1=0.5011


The difference here is pretty small compared to using spaCy's pre-trained vectors--just a small increase in accuracy/F1--but we also didn't bother to do any parameter tuning for the Word2Vec algorithm itself (which we would/should do in a serious production use case).  And our language is pretty "normal" English.  If we were working in a really specialized or unusual linguistic domain, the difference might be a lot bigger.  Plus, the spaCy developers train their vectors against a _huge_ corpus of text (specifically Common Crawl), so it's going to be hard to blow them out of the water with just a few hundred thousand short reviews.