# Using Pre-trained Word Embeddings

Here we introduce how to use pre-trained word embeddings via `mxnet.gluon.text`. 

The used GloVe and fastText word embeddings in this tutorial are from the following sources:

* GloVe project website：https://nlp.stanford.edu/projects/glove/
* fastText project website：https://fasttext.cc/

Let us first import the following packages.

In [1]:
from mxnet import gluon
from mxnet import nd

import os
import sys
curr_path = os.getcwd()
sys.path.append(os.path.join(curr_path, '..', '..'))

from gluonnlp import Vocab, embedding, data

## Creating Vocabulary with Word Embeddings

As a common use case, let us index words, attach pre-trained word embeddings for them, and use such embeddings in `gluon` in just a few lines of code.

### Creating Vocabulary from Data Sets

To begin with, suppose that we have a simple text data set in the string format. We can count word frequency in the data set.

In [2]:
text = " hello world \n hello nice world \n hi world \n"

import re
def simple_tokenize(source_str, token_delim=' ', seq_delim='\n'):
    return filter(None, re.split(token_delim + '|' + seq_delim, source_str))

counter = data.count_tokens(simple_tokenize(text))

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. This allows us to filter out infrequent words via `Vocab` arguments such as `max_size` and `min_freq`. Suppose that we want to build indices for all the keys in counter. We need a `Vocab` instance with counter as its argument.

In [3]:
vocab = Vocab(counter)

To attach word embedding to indexed words in `vocab`, let us go on to create a fastText word embedding instance by specifying the embedding name `fasttext` and the source name `wiki.simple.vec`.

In [4]:
fasttext_simple = embedding.create('fasttext', source='wiki.simple.vec')

  .format(line_num, pretrained_file_path))


So we can attach word embedding `fasttext_simple` to indexed words in `vocab`.

In [5]:
vocab.set_embedding(fasttext_simple)

To see other source names under the fastText word embedding, we can use `text.embedding.list_sources`.

In [6]:
embedding.list_sources('fasttext')[:5]

['wiki.tw.vec', 'wiki.mg.vec', 'wiki.new.vec', 'wiki.sg.vec', 'wiki.lmo.vec']

The created vocabulary `vocab` includes four different words and a special unknown token. Let us check the size of `vocab`.

In [7]:
len(vocab)

8

By default, the vector of any token that is unknown to `vocab` is a zero vector. Its length is equal to the vector dimension of the fastText word embeddings: 300.

In [8]:
vocab.embedding['beautiful'].shape

(300,)

The first five elements of the vector of any unknown token are zeros.

In [9]:
vocab.embedding['beautiful'][:5]


[0. 0. 0. 0. 0.]
<NDArray 5 @cpu(0)>

Let us check the shape of the embedding of words 'hello' and 'world' from `vocab`.

In [10]:
vocab.embedding['hello', 'world'].shape

(2, 300)

We can access the first five elements of the embedding of 'hello' and 'world'.

In [11]:
vocab.embedding['hello', 'world'][:, :5]


[[ 0.39567   0.21454  -0.035389 -0.24299  -0.095645]
 [ 0.10444  -0.10858   0.27212   0.13299  -0.33165 ]]
<NDArray 2x5 @cpu(0)>

### Using Pre-trained Word Embeddings in  `gluon.nn.Embedding`

To demonstrate how to use pre-trained word embeddings in the `gluon` package, let us first obtain indices of the words 'hello' and 'world'.

In [12]:
vocab['hello', 'world']

[5, 4]

We can obtain the vectors for the words 'hello' and 'world' by specifying their indices (2 and 1) and the weight matrix `vocab.embedding.idx_to_vec` in `gluon.nn.Embedding`.

In [13]:
input_dim, output_dim = vocab.embedding.idx_to_vec.shape
layer = gluon.nn.Embedding(input_dim, output_dim)
layer.initialize()
layer.weight.set_data(vocab.embedding.idx_to_vec)
layer(nd.array([5, 4]))[:, :5]


[[ 0.39567   0.21454  -0.035389 -0.24299  -0.095645]
 [ 0.10444  -0.10858   0.27212   0.13299  -0.33165 ]]
<NDArray 2x5 @cpu(0)>

### Creating Vocabulary from Pre-trained Word Embeddings

We can also create vocabulary by using vocabulary of pre-trained word embeddings, such as GloVe. Below are a few pre-trained file names under the GloVe word embedding.

In [14]:
embedding.list_sources('glove')[:5]

['glove.twitter.27B.200d.txt',
 'glove.twitter.27B.50d.txt',
 'glove.6B.300d.txt',
 'glove.42B.300d.txt',
 'glove.6B.100d.txt']

For simplicity of demonstration, we use a smaller word embedding file, such as the 50-dimensional one. 

In [15]:
glove_6b50d = embedding.create('glove', source='glove.6B.50d.txt')

Now we create vocabulary by using all the tokens from `glove_6b50d`.

In [16]:
vocab = Vocab(data.Counter(glove_6b50d.idx_to_token))
vocab.set_embedding(glove_6b50d)

Below shows the size of `vocab` including a special unknown token.

In [17]:
len(vocab.idx_to_token)

400004

We can access attributes of `vocab`.

In [18]:
print(vocab['beautiful'])
print(vocab.idx_to_token[71424])

71424
beautiful


## Applications of Word Embeddings

To apply word embeddings, we need to define cosine similarity. It can compare similarity of two vectors.

In [19]:
from mxnet import nd
def cos_sim(x, y):
    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))

The range of cosine similarity between two vectors is between -1 and 1. The larger the value, the similarity between two vectors.

In [20]:
x = nd.array([1, 2])
y = nd.array([10, 20])
z = nd.array([-1, -2])

print(cos_sim(x, y))
print(cos_sim(x, z))


[1.]
<NDArray 1 @cpu(0)>

[-1.]
<NDArray 1 @cpu(0)>


### Word Similarity

Given an input word, we can find the nearest $k$ words from the vocabulary (400,000 words excluding the unknown token) by similarity. The similarity between any pair of words can be represented by the cosine similarity of their vectors.

In [21]:
def norm_vecs_by_row(x):
    return x / nd.sqrt(nd.sum(x * x, axis=1)).reshape((-1,1))

def get_knn(vocab, k, word):
    word_vec = vocab.embedding[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_vec)
    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+4, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    # Remove unknown and input tokens.
    return vocab.to_tokens(indices[4:])

Let us find the 5 most similar words of 'baby' from the vocabulary (size: 400,000 words).

In [22]:
get_knn(vocab, 5, 'baby')

['baby', 'babies', 'boy', 'girl', 'newborn']

We can verify the cosine similarity of vectors of 'baby' and 'babies'.

In [23]:
cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])


[0.838713]
<NDArray 1 @cpu(0)>

Let us find the 5 most similar words of 'computers' from the vocabulary.

In [24]:
get_knn(vocab, 5, 'computers')

['computers', 'computer', 'phones', 'pcs', 'machines']

Let us find the 5 most similar words of 'run' from the vocabulary.

In [25]:
get_knn(vocab, 5, 'run')

['run', 'running', 'runs', 'went', 'start']

Let us find the 5 most similar words of 'beautiful' from the vocabulary.

In [26]:
get_knn(vocab, 5, 'beautiful')

['beautiful', 'lovely', 'gorgeous', 'wonderful', 'charming']

### Word Analogy

We can also apply pre-trained word embeddings to the word analogy problem. For instance, "man : woman :: son : daughter" is an analogy. The word analogy completion problem is defined as: for analogy 'a : b :: c : d', given teh first three words 'a', 'b', 'c', find 'd'. The idea is to find the most similar word vector for vec('c') + (vec('b')-vec('a')).

In this example, we will find words by analogy from the 400,000 indexed words in `vocab`.

In [27]:
def get_top_k_by_analogy(vocab, k, word1, word2, word3):
    word_vecs = vocab.embedding[word1, word2, word3]
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2]).reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_diff)
    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+4, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]

    # Filter out unknown tokens.
    if vocab.to_tokens(indices[0]) == vocab.unknown_token:
        return vocab.to_tokens(indices[4:])
    else:
        return vocab.to_tokens(indices[:-4])

Complete word analogy 'man : woman :: son :'.

In [28]:
get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')

['daughter']

Let us verify the cosine similarity between vec('son')+vec('woman')-vec('man') and vec('daughter')

In [29]:
def cos_sim_word_analogy(vocab, word1, word2, word3, word4):
    words = [word1, word2, word3, word4]
    vecs = vocab.embedding[words]
    return cos_sim(vecs[1] - vecs[0] + vecs[2], vecs[3])

cos_sim_word_analogy(vocab, 'man', 'woman', 'son', 'daughter')


[0.9658343]
<NDArray 1 @cpu(0)>

Complete word analogy 'beijing : china :: tokyo : '.

In [30]:
get_top_k_by_analogy(vocab, 1, 'beijing', 'china', 'tokyo')

['japan']

Complete word analogy 'bad : worst :: big : '.

In [31]:
get_top_k_by_analogy(vocab, 1, 'bad', 'worst', 'big')

['biggest']

Complete word analogy 'do : did :: go :'.

In [32]:
get_top_k_by_analogy(vocab, 1, 'do', 'did', 'go')

['went']