In [1]:
import nltk

nltk.download('brown')
nltk.download('stopwords')

[nltk_data] Downloading package brown to /home/alex/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to /home/alex/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Word Embeddings

Word embeddings are part of  distributional semantics, similar to the topic models we discussed in the previous chapter. Unlike topic models, word embeddings do not work on term-document relationships. Instead, word embeddings work with smaller contexts like sentences or subsequences of tokens in a sentence.

The field of word embeddings is a rapidly evolving set of techniques.  The most popular technique, Word2vec, was developed in 2013 by Tomas Mikolov et al. at Google. Since then, there has been much research (and hype). The idea is that you use a neural network to build a language model. Once this model is learned, you can take some of the intermediate values in the network as representations of the input term.

In this chapter, we will look at the implementation of Word2vec in code. This will help give us a clear understanding of the fundamentals of this family of techniques. We will discuss the more recent approaches at a higher level because they can be quite resource intensive.

## Word2vec

One of the ideas behind deep learning is that the hidden layers are "higher level" representations of the data. This comes from analysis of the visual cortex. As the information travels from the eye through the brain, neurons appear to be associated with more complex shapes. The early layers of neurons recognize only points of light and dark, later neurons recognize lines and curves, and so on. Using this assumption, if we train a language model using a neural network, the hidden layers will be a "higher level" representation of the words.

There are two ways that Word2vec is commonly implemented: continuous bag-of-words (CBOW) and   continuous skip grams (often just skip grams). In CBOW, we build a model that tries to predict a word based on the nearby words. In the skip-gram approach, a word is used to predict the context.

In either approach, the model is trained using a neural network with one hidden layer. Let's say we want to represent words as K dimensional vectors, and let's say we have N words in our vocabulary. The weights we learn will become the vectors. The intuition behind this is based on how neural networks function. A neural network learns higher-level representations of input features. These higher-level representations are the intermediate values produced in evaluating a neural-network model. In classic CBOW, the vectors that are fed into the hidden layer are these higher-level features. This means that we can simply take the rows of the first weight matrix as our word vectors. Let's implement CBOW, so we can get a clearer understanding.

First, let's define our imports and load our data.

In [2]:
import sparknlp
from nltk.corpus import brown

spark = sparknlp.start()

In [3]:
def detokenize(sentence):
    text = ''
    for token in sentence:
        if text and any(c.isalnum() for c in token):
            text += ' '
        text += token
    return text

In [4]:
texts = []

for fid in brown.fileids():
    text = [detokenize(s) for s in brown.sents(fid)]
    text = ' '.join(text)
    texts.append((text,))
    
texts = spark.createDataFrame(texts, ['text'])

Now that we have our data, let's process and prepare it for building our model.

In [5]:
from pyspark.ml import Pipeline

from sparknlp import DocumentAssembler, Finisher
from sparknlp.annotator import *

In [6]:
assembler = DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')
sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences") \
    .setExplodeSentences(True)
tokenizer = Tokenizer()\
    .setInputCols(['sentences'])\
    .setOutputCol('token')
normalizer = Normalizer()\
    .setCleanupPatterns([
        '[^a-zA-Z.-]+', 
        '^[^a-zA-Z]+', 
        '[^a-zA-Z]+$',
    ])\
    .setInputCols(['token'])\
    .setOutputCol('normalized')\
    .setLowercase(True)
finisher = Finisher()\
    .setInputCols(['normalized'])\
    .setOutputCols(['normalized'])\
    .setOutputAsArray(True)

pipeline = Pipeline().setStages([
    assembler, sentence, tokenizer, 
    normalizer, finisher
]).fit(texts)

sentences = pipeline.transform(texts)
sentences = sentences.select('normalized').collect()
sentences = [r['normalized'] for r in sentences]

print(len(sentences)) # number of sentences

59091


Now we have performed the text processing, so let's build our encoding. There are tools to do this in most deep learning libraries, but let's do it ourselves.

In [7]:
from collections import Counter
import numpy as np
import pandas as pd

In [8]:
UNK = '???'
PAD = '###'
w2i = {PAD: 0, UNK: 1}
df = Counter()

for s in sentences:
    df.update(s)

df = pd.Series(df)
df = df[df > 10].sort_values(ascending=False)

for word in df.index:
    w2i[word] = len(w2i)
    

i2w = {ix: w for w, ix in w2i.items()}
vocab_size = len(i2w)

In [9]:
print(vocab_size)

7814


We include a marker for padding and a marker for unknown words. We will be tiling over our sentences, creating windows of tokens. The middle token is what we are trying to predict, and the surrounding tokens are our context. We need to pad our sentences, otherwise we will lose words at the beginning and ending of the sentences.

Let's also make some utility functions that will convert a sequence of tokens to a sequence of indices, and one that does the inverse.

In [10]:
def texts_to_sequences(texts):
    return [[w2i.get(w, w2i[UNK]) for w in s] for s in texts]

def sequences_to_texts(seqs):
    return [' '.join([i2w.get(ix, UNK) for ix in s]) for s in seqs]

In [11]:
seqs = texts_to_sequences(sentences)

Now let's build our context windows. We will go over each sentence and create a window for each token in the sentence.

In [12]:
w = 4
windows = []
Y = []

for k, seq in enumerate(seqs):
    for i in range(len(seq)):
        if seq[i] == w2i[UNK] or len(seq) < 2*w:
            continue
        window = []
        for j in range(-w, w+1):
            if i+j < 0:
                window.append(w2i[PAD])
            elif i+j >= len(seq):
                window.append(w2i[PAD])
            else:
                window.append(seq[i+j])
        windows.append(window)
        
windows = np.array(windows)

We can't just turn all of our data into vectors because that would take up too much memory. So we will need to implement a generator. First, let's write the function that will turn a collection of windows into numpy arrays. This will take the windows and produce a matrix containing the one-hot–encoded words and a matrix containing the one-hot–encoded target words.

In [13]:
def windows_to_batch(batch_windows):
    w = batch_windows.shape[1] // 2
    X = []
    Y = []
    for window in batch_windows:
        X.append(np.concatenate((window[:w], window[w+1:])))
        Y.append(window[w])
        
    X = np.array(X)
    Y = ku.to_categorical(Y, vocab_size)
    return X, Y

Now we write the function that actually produces the generator. The training method takes a Python generator, so we need a utility function that creates a generator of batches.

In [14]:
def generate_batch(windows, batch_size=100):
    while True:
        indices = np.arange(windows.shape[0])
        indices = np.random.choice(indices, batch_size)
        batch_windows = windows[indices, :]
        yield windows_to_batch(batch_windows)

Now we can implement our model. Let's define our model. We will be creating 50-dimension word vectors. The number of dimensions should be based on the size of your corpus. However, there is no hard-and-fast rule.

In [15]:
from keras.models import Sequential
from keras.layers import *
import keras.backend as K
import keras.utils as ku
from keras.callbacks import ModelCheckpoint

Using TensorFlow backend.


In [16]:
dim = 50

model = Sequential()
model.add(Embedding(vocab_size, dim, input_length=w*2))
model.add(Lambda(lambda x: K.mean(x, axis=1), (dim,)))
model.add(Dense(vocab_size, activation='softmax'))

The first layer is the actual embeddings we will be learning. The second layer collapses the context into a single vector. The last layer makes the prediction of what the word in the middle of the window should be.

In [17]:
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 8, 50)             390700    
_________________________________________________________________
lambda_1 (Lambda)            (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 7814)              398514    
Total params: 789,214
Trainable params: 789,214
Non-trainable params: 0
_________________________________________________________________
None


In [18]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

This is a relatively simple Word2vec model, yet we still need to learn more than 700,000 parameters. Word embeddings models get complicated quickly.

Let's store the weights for every 50 epochs. We make 50 calls to the generator for each epoch.

In [19]:
batch_size = 1000
steps = 100
generator = generate_batch(windows, batch_size)

mc = ModelCheckpoint('weights{epoch:05d}.h5', 
                     save_weights_only=True, 
                     period=50)

model.fit_generator(generator, steps_per_epoch=steps, 
                    epochs=500, callbacks=[mc])

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

Epoch 99/500
Epoch 100/500
Epoch 101/500
Epoch 102/500
Epoch 103/500
Epoch 104/500
Epoch 105/500
Epoch 106/500
Epoch 107/500
Epoch 108/500
Epoch 109/500
Epoch 110/500
Epoch 111/500
Epoch 112/500
Epoch 113/500
Epoch 114/500
Epoch 115/500
Epoch 116/500
Epoch 117/500
Epoch 118/500
Epoch 119/500
Epoch 120/500
Epoch 121/500
Epoch 122/500
Epoch 123/500
Epoch 124/500
Epoch 125/500
Epoch 126/500
Epoch 127/500
Epoch 128/500
Epoch 129/500
Epoch 130/500
Epoch 131/500
Epoch 132/500
Epoch 133/500
Epoch 134/500
Epoch 135/500
Epoch 136/500
Epoch 137/500
Epoch 138/500
Epoch 139/500
Epoch 140/500
Epoch 141/500
Epoch 142/500
Epoch 143/500
Epoch 144/500
Epoch 145/500
Epoch 146/500
Epoch 147/500
Epoch 148/500
Epoch 149/500
Epoch 150/500
Epoch 151/500
Epoch 152/500
Epoch 153/500
Epoch 154/500
Epoch 155/500
Epoch 156/500
Epoch 157/500
Epoch 158/500
Epoch 159/500
Epoch 160/500
Epoch 161/500
Epoch 162/500
Epoch 163/500
Epoch 164/500
Epoch 165/500
Epoch 166/500
Epoch 167/500
Epoch 168/500
Epoch 169/500
Epoch 1

Epoch 195/500
Epoch 196/500
Epoch 197/500
Epoch 198/500
Epoch 199/500
Epoch 200/500
Epoch 201/500
Epoch 202/500
Epoch 203/500
Epoch 204/500
Epoch 205/500
Epoch 206/500
Epoch 207/500
Epoch 208/500
Epoch 209/500
Epoch 210/500
Epoch 211/500
Epoch 212/500
Epoch 213/500
Epoch 214/500
Epoch 215/500
Epoch 216/500
Epoch 217/500
Epoch 218/500
Epoch 219/500
Epoch 220/500
Epoch 221/500
Epoch 222/500
Epoch 223/500
Epoch 224/500
Epoch 225/500
Epoch 226/500
Epoch 227/500
Epoch 228/500
Epoch 229/500
Epoch 230/500
Epoch 231/500
Epoch 232/500
Epoch 233/500
Epoch 234/500
Epoch 235/500
Epoch 236/500
Epoch 237/500
Epoch 238/500
Epoch 239/500
Epoch 240/500
Epoch 241/500
Epoch 242/500
Epoch 243/500
Epoch 244/500
Epoch 245/500
Epoch 246/500
Epoch 247/500
Epoch 248/500
Epoch 249/500
Epoch 250/500
Epoch 251/500
Epoch 252/500
Epoch 253/500
Epoch 254/500
Epoch 255/500
Epoch 256/500
Epoch 257/500
Epoch 258/500
Epoch 259/500
Epoch 260/500
Epoch 261/500
Epoch 262/500
Epoch 263/500
Epoch 264/500
Epoch 265/500
Epoch 

Epoch 291/500
Epoch 292/500
Epoch 293/500
Epoch 294/500
Epoch 295/500
Epoch 296/500
Epoch 297/500
Epoch 298/500
Epoch 299/500
Epoch 300/500
Epoch 301/500
Epoch 302/500
Epoch 303/500
Epoch 304/500
Epoch 305/500
Epoch 306/500
Epoch 307/500
Epoch 308/500
Epoch 309/500
Epoch 310/500
Epoch 311/500
Epoch 312/500
Epoch 313/500
Epoch 314/500
Epoch 315/500
Epoch 316/500
Epoch 317/500
Epoch 318/500
Epoch 319/500
Epoch 320/500
Epoch 321/500
Epoch 322/500
Epoch 323/500
Epoch 324/500
Epoch 325/500
Epoch 326/500
Epoch 327/500
Epoch 328/500
Epoch 329/500
Epoch 330/500
Epoch 331/500
Epoch 332/500
Epoch 333/500
Epoch 334/500
Epoch 335/500
Epoch 336/500
Epoch 337/500
Epoch 338/500
Epoch 339/500
Epoch 340/500
Epoch 341/500
Epoch 342/500
Epoch 343/500
Epoch 344/500
Epoch 345/500
Epoch 346/500
Epoch 347/500
Epoch 348/500
Epoch 349/500
Epoch 350/500
Epoch 351/500
Epoch 352/500
Epoch 353/500
Epoch 354/500
Epoch 355/500
Epoch 356/500
Epoch 357/500
Epoch 358/500
Epoch 359/500
Epoch 360/500
Epoch 361/500
Epoch 

Epoch 387/500
Epoch 388/500
Epoch 389/500
Epoch 390/500
Epoch 391/500
Epoch 392/500
Epoch 393/500
Epoch 394/500
Epoch 395/500
Epoch 396/500
Epoch 397/500
Epoch 398/500
Epoch 399/500
Epoch 400/500
Epoch 401/500
Epoch 402/500
Epoch 403/500
Epoch 404/500
Epoch 405/500
Epoch 406/500
Epoch 407/500
Epoch 408/500
Epoch 409/500
Epoch 410/500
Epoch 411/500
Epoch 412/500
Epoch 413/500
Epoch 414/500
Epoch 415/500
Epoch 416/500
Epoch 417/500
Epoch 418/500
Epoch 419/500
Epoch 420/500
Epoch 421/500
Epoch 422/500
Epoch 423/500
Epoch 424/500
Epoch 425/500
Epoch 426/500
Epoch 427/500
Epoch 428/500
Epoch 429/500
Epoch 430/500
Epoch 431/500
Epoch 432/500
Epoch 433/500
Epoch 434/500
Epoch 435/500
Epoch 436/500
Epoch 437/500
Epoch 438/500
Epoch 439/500
Epoch 440/500
Epoch 441/500
Epoch 442/500
Epoch 443/500
Epoch 444/500
Epoch 445/500
Epoch 446/500
Epoch 447/500
Epoch 448/500
Epoch 449/500
Epoch 450/500
Epoch 451/500
Epoch 452/500
Epoch 453/500
Epoch 454/500
Epoch 455/500
Epoch 456/500
Epoch 457/500
Epoch 

Epoch 483/500
Epoch 484/500
Epoch 485/500
Epoch 486/500
Epoch 487/500
Epoch 488/500
Epoch 489/500
Epoch 490/500
Epoch 491/500
Epoch 492/500
Epoch 493/500
Epoch 494/500
Epoch 495/500
Epoch 496/500
Epoch 497/500
Epoch 498/500
Epoch 499/500
Epoch 500/500


<keras.callbacks.callbacks.History at 0x7f3b14031198>

Now let's look at the data. First, let's implement a class to represent the embeddings data. We will be using cosine similarity to compare vectors.

In [20]:
class Word2VecData(object):
    def __init__(self, word_vectors, w2i, i2w):
        self.word_vectors = word_vectors
        self.w2i = w2i
        self.i2w = i2w
        ## the implementation of cosine similarity uses the 
        ## normalized vectors. This means that we can precalculate
        ## the vectors of our vocabulary
        self.normed_wv = np.divide(
            word_vectors.T, 
            np.linalg.norm(word_vectors, axis=1)
        ).T
        self.all_sims = np.dot(self.normed_wv, self.normed_wv.T)
        self.all_sims = np.triu(self.all_sims)
        self.all_sims = self.all_sims[self.all_sims > 0]
        
    ## this transforms a word into a vector
    def w2v(self, word):
        return self.word_vectors[self.w2i[word],:]

    ## this calculates cosine similarity of the input word to all words
    def _get_sims(self, word):
        if isinstance(word, str):
            v = self.w2v(word)
        else:
            v = word
        v = np.divide(v, np.linalg.norm(v))
        return np.dot(self.normed_wv, v)

    def nearest_words(self, word, k=10):
        sims = self._get_sims(word)
        nearest = sims.argsort()[-k:][::-1]
        ret = []
        for ix in nearest:
            ret.append((self.i2w[ix], sims[ix]))
        return ret

    def compare_words(self, u, v):
        if isinstance(u, str):
            u = self.w2v(u)
        if isinstance(v, str):
            v = self.w2v(v)
        u = np.divide(u, np.linalg.norm(u))
        v = np.divide(v, np.linalg.norm(v))
        return np.dot(u, v)

Let's also implement something to output the results. We want to look at a couple things when looking at Word2vec. We want to find what words are similar to other words. If the model has learned information about the words, you should see related words.

There are also word analogies. One of the interesting uses of Word2vec was a word "algebra." The common example is `king – man + woman ~ queen`. This means that you subtract the `man` vector from the `king` vector, then add the `woman` vector. The result is approximately the `queen` vector. This generally works well only with a large diverse vocabulary. Our vocabulary is more limited because our data set is small.

Let's plot the histogram of all word-to-word similarities.

In [21]:
import matplotlib.pyplot as plt
%matplotlib inline

In [22]:
def display_Word2vec(model, weight_path, words, analogies):
    model.load_weights(weight_path)
    word_vectors = model.layers[0].get_weights()[0]
    W2V = Word2VecData(word_vectors, w2i, i2w)

    for word in words:
        for w, sim in W2V.nearest_words(word):
            print(w, sim)
        print()
        
    for w1, w2, w3, w4 in analogies:
        v1 = W2V.w2v(w1)
        v2 = W2V.w2v(w2)
        v3 = W2V.w2v(w3)
        v4 = W2V.w2v(w4)
        x = v1 - v2 + v3
        for w, sim in W2V.nearest_words(x):
            print(w, sim)
        print()
        print(w4, W2V.compare_words(x, v4))
        print()
        print('{}-{}+{}~{} quantile'.format(w1, w2, w3, w4), 
              (W2V.all_sims < W2V.compare_words(x, v4)).mean())
        print()

    plt.hist(W2V.all_sims)
    plt.title('Word-to-Word similarity histogram')
    
    plt.show()

Let's look at the results from the 50th epoch. First, let's look at the words similar to "space." This is a list of the nearest 10 words to "space" by cosine similarity.

```
space 0.9999999
shear 0.96719706
section 0.9615698
chapter 0.9592927
output 0.958699
phase 0.9580841
corporate 0.95798546
points 0.9575049
density 0.9573466
institute 0.9545017
```

Now let's look at the words similar to "polynomial."

```
polynomial 1.0000001
formula 0.9805055
factor 0.9684353
positive 0.96643007
produces 0.9631797
remarkably 0.96068406
equation 0.9601216
assumption 0.95971704
moral 0.9586859
unique 0.95754766
```

Now let's look at the `king – man + woman ~ queen` analogy. We will print out the words closest to the result vector, `king – man + woman`. Then we can look at the similarity of the result to the `queen` vector. Finally, let's look at what the quantile is for `queen`. The higher it is, the better the analogy works.

```
mountains 0.96987706
emperor 0.96913254
crowds 0.9688335
generals 0.9669207
masters 0.9664976
kings 0.9663711
roof 0.9653381
ceiling 0.96467453
ridge 0.96467185
woods 0.96466273

queen 0.9404894

king-man+woman~queen quantile 0.942
```
That the `queen` vector is closer than 94% of other words is a good sign, but that some of the other top results, like "ceiling," are so close is a sign that our data set may be too small and perhaps too specialized to learn such general relationships.

Finally, let's look at the histogram

![Histogram of word-to-word similarities at epoch 50](https://i.imgur.com/xDAFpzw.png)  
_Histogram of word-to-word similarities at epoch 50_


Most of the similarities are on the high side. This means that at epoch 50 the words are quite similar. Let's look at the histogram at epoch 100

![Histogram of word-to-word similarities at epoch 100](https://i.imgur.com/PUFNpcJ.png)
_Histogram of word-to-word similarities at epoch 100_

The weight of the histogram has moved toward the middle. This means that we are seeing more differentiation between our words. Now let's look at 500 epochs

![Histogram of word-to-word similarities at epoch 500](https://i.imgur.com/7DwSZav.png)
_Histogram of word-to-word similarities at epoch 500_

Note how the mass of the histogram has moved to the left, so most words are dissimilar to each other. This means that the words are more separated in the word-vector space. But remember that this may mean that we are overfitting to this data set.

Spark NLP lets us incorporate externally trained Word2vec models. Let's see how we can use these word embeddings in Spark NLP. First, let's write the embeddings to a file in a format that Spark NLP is familiar with.

In [23]:
model.load_weights('weights00500.h5')
word_vectors = model.layers[0].get_weights()[0]

with open('cbow.csv', 'w') as out:
    for ix in range(vocab_size):
        word = i2w[ix]
        vec = list(word_vectors[ix, :])
        line = ['{}'] + ['{:.18e}'] * dim
        line = ' '.join(line) + '\n'
        line = line.format(word, *vec)
        out.write(line)

Now we can create an embeddings annotator.

In [24]:
Word2vec = WordEmbeddings() \
    .setInputCols(['document', 'normalized']) \
    .setOutputCol('embeddings') \
    .setDimension(dim) \
    .setStoragePath('cbow.csv', 'TEXT')

pipeline = Pipeline().setStages([
    assembler, sentence, tokenizer, 
    normalizer, Word2vec
]).fit(texts)

Let's get out the embeddings generated by our model.

In [25]:
pipeline.transform(texts).select('embeddings.embeddings') \
    .first()['embeddings']

[[-0.04116252064704895,
  0.30205145478248596,
  0.15484963357448578,
  -0.44641587138175964,
  0.7547354102134705,
  0.2733045816421509,
  -0.525144100189209,
  0.6141830086708069,
  0.08831439912319183,
  0.7251712083816528,
  -0.011696807108819485,
  0.5672895908355713,
  1.5210978984832764,
  0.5929486751556396,
  -0.31905245780944824,
  -0.41560977697372437,
  1.7453875541687012,
  -0.6388887166976929,
  0.6858738660812378,
  0.41741904616355896,
  -0.7397053241729736,
  0.05437026917934418,
  0.4052659571170807,
  -1.698611855506897,
  -0.6512079834938049,
  0.3355835974216461,
  0.38396722078323364,
  -0.48381197452545166,
  -0.44944122433662415,
  -0.8787884712219238,
  0.832815945148468,
  0.41619688272476196,
  1.553146481513977,
  -0.799826443195343,
  -1.3038822412490845,
  -0.12028307467699051,
  0.04588770121335983,
  0.05268083140254021,
  -1.1570786237716675,
  -0.7518450617790222,
  -0.40213412046432495,
  0.7952238917350769,
  0.32961830496788025,
  1.0909059047698975

Spark has an implementation of the skip-gram approach. Let's look at how to use that.

In [26]:
from pyspark.ml.feature import Word2Vec

In [27]:
Word2vec = Word2Vec() \
    .setInputCol('normalized') \
    .setOutputCol('word_vectors') \
    .setVectorSize(dim) \
    .setMinCount(5)
finisher = Finisher()\
    .setInputCols(['normalized'])\
    .setOutputCols(['normalized'])\
    .setOutputAsArray(True)

pipeline = Pipeline().setStages([
    assembler, sentence, tokenizer, 
    normalizer, finisher, Word2vec
]).fit(texts)

In [28]:
pipeline.transform(texts).select('word_vectors') \
    .first()['word_vectors']

DenseVector([-0.0961, 0.0817, 0.0397, -0.01, 0.0533, -0.038, -0.0766, 0.0437, -0.0191, 0.0328, -0.0455, 0.0108, 0.0665, 0.054, -0.0215, -0.0708, 0.0374, -0.0371, -0.0504, 0.0228, -0.0435, 0.002, 0.0449, -0.0503, 0.0587, 0.0817, 0.0349, -0.0152, -0.0407, 0.0533, 0.0413, 0.0335, 0.0159, -0.0604, -0.0181, 0.0313, -0.0746, -0.0307, -0.029, -0.015, 0.0384, -0.0138, -0.0349, 0.0314, 0.0447, -0.0311, 0.0234, -0.0211, -0.0344, 0.0738])

## GloVe

GloVe (Global Vectors), created by Jeffrey Pennington, Richard Socher, and Christopher Manning from Stanford, is actually more similar to the techniques we covered in the previous chapter, like LSI. Instead of using a neural network to build a language model, GloVe attempts to learn the co-occurrence statistics of words. It outperformed many common Word2vec models on the word analogy task.

One benefit of GloVe is that it is the result of directly modeling relationships, instead of getting them as a side effect of training a language model.

Let's see how to use GloVe in Spark NLP:

In [29]:
glove = WordEmbeddingsModel.pretrained(name='glove_100d') \
    .setInputCols(['document', 'normalized']) \
    .setOutputCol('embeddings') \

pipeline = Pipeline().setStages([
    assembler, sentence, tokenizer, 
    normalizer, glove
]).fit(texts)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [30]:
pipeline.transform(texts).select('embeddings.embeddings') \
    .first()['embeddings']

[[-0.03819400072097778,
  -0.24487000703811646,
  0.7281200289726257,
  -0.3996100127696991,
  0.08317200094461441,
  0.043953001499176025,
  -0.3914099931716919,
  0.3343999981880188,
  -0.5754500031471252,
  0.08745899796485901,
  0.28786998987197876,
  -0.06730999797582626,
  0.3090600073337555,
  -0.263839989900589,
  -0.13231000304222107,
  -0.20757000148296356,
  0.333950012922287,
  -0.33847999572753906,
  -0.3174299895763397,
  -0.4833599925041199,
  0.14640000462532043,
  -0.37303999066352844,
  0.345770001411438,
  0.05204100161790848,
  0.4494599997997284,
  -0.46970999240875244,
  0.026280000805854797,
  -0.5415499806404114,
  -0.15518000721931458,
  -0.14106999337673187,
  -0.03972199931740761,
  0.2827700078487396,
  0.14393000304698944,
  0.2346400022506714,
  -0.3102099895477295,
  0.08617299795150757,
  0.20397000014781952,
  0.5262399911880493,
  0.17163999378681183,
  -0.08237800002098083,
  -0.7178699970245361,
  -0.41530999541282654,
  0.2033499926328659,
  -0.1276

## fastText

In 2015, Facebook research developed an extension to Word2vec called fastText. One common problem with Word2vec is the way it treats words that are not in the vocabulary of the training corpus. For some problems, it may make sense to simply drop these words, under the assumption that they are too rare to have a significant effect on the outcome of downstream processes. In corpora with specialized vocabulary, like a clinical corpus, it is not uncommon to find a word that is important to the document that may not be found in training data. This out-of-vocabulary problem also makes it difficult to do transfer learning.  Transfer learning is where you take a part or the whole model trained on one data set and task and use it on a different data set and even a different task. In fact, Word2vec itself is transfer learning. You build a model to solve a language-modeling problem, often contrived, and use part of this model in some other NLP-related task.

fastText makes transfer learning with word embeddings easier by learning character-level information. So, instead of learning higher-level representation of tokens, it learns a higher-level representation of character sequences. Once these character sequences are learned, we take the sum of the character-sequence vectors that make up a word for the word's vector.

## Transformers

In 2017, researchers at Google created a new approach for modeling attention. Attention is a concept from sequence modeling. A sequence model that does not have a fixed context must learn how long to retain information from earlier in the sequence. Being able to better capture long-distance relationships is very important to automatic machine translation. Most words have multiple senses or meanings, and clarifying requires broader context. In linguistics, the property of having multiple senses is known as polysemy or homonymy.

  Polysemy is when the senses are different but related, and homonymy is when the senses are not related in meaning. For example, let's look at the word "rock." When used as a noun, "rock" means a piece of stone. This is completely unrelated to the other meaning, the verb "to rock." The verb "to rock" refers to a back-and-forth motion, and it can also mean to perform or enjoy rock-and-roll music. So "rock" (a stone) is a homonym of "rock" (to move back and forth), which is polyseme that also has the meaning to perform or enjoy rock-and-roll music.

Cues to disambiguate homonyms and polysemes generally come from other words in the context. The example given in the paper defines the Transformer (Vaswani et al.) as a "bank," which has two meanings. The first is a financial institution, and the second is the edge of a river. This homonymous relationship does not translate. For example, in Spanish, the institution is "banco," and the edge of a river is "orilla." So if you are translating, using a neural network, it would be advantageous to represent the two words differently. To do this, you must encode your words with their context.

The word vectors of previous methods represent aggregations of these different senses. This allows a much richer representation of the text. However, it comes at a severe cost. These models are computationally much more intense to train and use. In fact, most of the current methods, at the time of writing in 2019, are not feasible without using GPUs or even more specialized hardware.

## ELMo, BERT, and XLNet
Newer embedding techniques are based on the idea of representing words in a context-dependent way. This means that a full neural-network model is needed to use the embeddings, unlike in static embeddings where there is simply a lookup.

  Embeddings from language models (ELMo) is a model that was developed at the Allen Institute in 2018. The language model that is being learned is bidirectional. This means that the model is learning to predict a word based on the words that come both before it and after it. The model learns at the character level, but the embeddings themselves are actually word based.

  Bidirectional Encoder Representations from Transformers (BERT), published in 2018, is doing something very similar to ELMo, but it is using Google's Transformers—hence the name. The intent is that this model can be fine-tuned. This is done by building a generic pretrained model on a data set. There are other approaches that allow fine-tuning, but the authors of the BERT paper note that those are either unidirectional approaches, or more specialized bidirectional approaches. BERT is intended to solve the problem of needing to choose by building a model that tries to identify randomly masked words.

BERT became very popular by achieving high scores on a number of benchmarks.   Roughly a year later, XLNet was published. XLNet was built to learn a model without the masking needed by BERT. The idea is that the masking is creating discrepancies between what the BERT model sees at training time and what it sees at time of use. XLNet then went on to achieve yet higher benchmarks.

Let's look at how to use BERT embeddings in Spark NLP.

In [31]:
bert = BertEmbeddings.pretrained() \
      .setInputCols(["sentences", "normalized"]) \
      .setOutputCol("bert")

pipeline = Pipeline().setStages([
    assembler, sentence, tokenizer, 
    normalizer, bert
]).fit(texts)

bert_base_cased download started this may take some time.
Approximate size to download 389.2 MB
[OK!]


In [32]:
pipeline.transform(texts).select('bert.embeddings') \
    .first()['embeddings']

[[-1.1980020999908447,
  0.3962576389312744,
  0.7419608235359192,
  0.7973726391792297,
  -1.0004487037658691,
  0.8969652056694031,
  -0.3867361843585968,
  0.17260757088661194,
  -0.26956284046173096,
  -0.3295387625694275,
  0.6897872686386108,
  -0.39754605293273926,
  0.5433205366134644,
  0.7603238224983215,
  0.27005520462989807,
  -1.9060765504837036,
  -0.9963284730911255,
  0.9765754342079163,
  -0.26099157333374023,
  0.369789719581604,
  -0.43807271122932434,
  0.7488493919372559,
  -0.5378984808921814,
  0.8647488951683044,
  -0.14781543612480164,
  0.23633432388305664,
  -0.6737574934959412,
  -0.8317551016807556,
  0.49321049451828003,
  0.22750777006149292,
  0.886649489402771,
  -2.1890995502471924,
  -0.3851615786552429,
  0.6819204092025757,
  0.2723952531814575,
  -0.274985134601593,
  0.8294179439544678,
  -1.1374274492263794,
  -0.9435877203941345,
  -0.20892567932605743,
  0.9044886827468872,
  -0.5634653568267822,
  -0.4172305464744568,
  0.3451480269432068,
  

A caveat to those interested in these techniques: one must always return to first principles when evaluating such new and complicated approaches. First, always consider what your product actually needs. Is your product similar to one of the tasks for which the BERT and XLNet achieved high scores? What is the level of accuracy you need versus the amount you are willing to spend on developer time, training time, and hardware? Just because these techniques are very popular with people in the field of NLP does not mean they are the best for every application.

In fact, there is the possibility that these techniques can overfit in difficult-to-detect ways.  Researchers at the National Cheng Kung University in Taiwan created an adversarial data set for a question-and-answer task called Argument Reasoning Comprehension Task. Here, a model must take in a piece of text that makes some arguments and draw a conclusion. BERT was able to achieve scores higher than human scores on this task. The researchers modified the data set with examples that were contradictory. The BERT model was evaluated on this new adversarial data set, and it performed worse than humans and little better than models built with older, simpler techniques. A model should be able to make a conclusion based solely on the input text; that is, it should not be using statements in other examples.

## doc2vec

Doc2vec is the set of techniques that lets us turn a document into a vector. Often, we want to use embeddings as sparser features for other tasks, like classification. How can we combine these word-level features into document-level features? One common way is to simply average the word-level vectors. This makes intuitive sense because we are hoping that the vector space represents a vague idea of meaning. So if we average all the vectors in a document, we should get the "average" meaning of the document.

The problem with this approach comes in when we consider rarity of words. Recalling our conversation on TF.IDF, it is often the case that unimportant words have a high frequency. For example, consider a clinical note. We may find a large number of generic words that are common to all notes. This can present a problem because these words will pull all of our documents toward a small number of places in our vector space. A model could still separate them, but it will converge more slowly. Worse yet, if there are natural clusters within the corpus, in other words, clinical notes from different departments, we may have a number of tightly packed clusters of documents. What we want is to be able to characterize a document by the words that are most important to that unique document. There are a few approaches for doing this.

You can perform weighted averaging on the word vectors by using IDF values as weights. This will help reduce the effect that more common words have on the document vector. This approach has the benefit of being simple to implement. Indeed, if you are using a static embedding technique, you can simply scale the vectors by the IDF values. There will be no need to compute this at evaluation time. The downside is that this is a bag-of-words approach and does not take into account the relationships between words.

  Another approach is called distributed memory paragraph vectors. This is essentially CBOW, but with an additional set of weights that needs to be learned. In CBOW, we are predicting a word from its context, so we have vectors representing the context as input. For distributed memory paragraph vectors, we concatenate a one-hot encoding of our document IDs to input. This allows us to represent the document with the same dimension as the words.

 A third approach to doc2vec parallels the skip-gram approach, distributed bag-of-words paragraph vectors. Recall that skip grams predict the context from a word. In distributed bag-of-words paragraph vectors, you learn to predict a context of the document from its document ID.

These last two approaches have the benefit of learning the relationships between co-occurring words. They are also relatively straightforward to implement if you can implement Word2vec. Their downside is that they learn only documents you have on hand. If you get a new document, you will not be able to produce a vector for it. So these approaches can be used only on offline processes.

When talking about doc2vec, also sometimes known as paragraph2vec, it's important to keep in mind that it can be applied to different sizes of text, from phrases to whole documents. However, if you are interested in converting phrases to vectors, you may also want to consider incorporating this into your tokenization. You can produce phrases as tokens and then learn one of the word-level embeddings discussed previously.

## Exercises

Let's see how these techniques work on our classification problem from the Information Extraction chapter.

This time, writing code is up to you. Try Spark's skip-gram implementation, Spark NLP's pretrained GloVe model, and Spark NLP's BERT model. 