# Using Pre-trained Word Embeddings

In this notebook, we'll demonstrate how to use pre-trained word embeddings.

To see why word embeddings are useful, it's worth comparing them to the alternative.
Without word embeddings, we might represent each word by a one-hot vector `[0, ...,0, 1, 0, ... 0]`,
that takes value `1` at the index corresponding to the appropriate vocabulary word, 
and value `0` everywhere else. 
The weight matrices connecting our word-level inputs to the network's hidden layers would each be $v \times h$,
where $v$ is the size of the vocabulary and $h$ is the size of the hidden layer. 
With 100,000 words feeding into an LSTM layer with $1000$ nodes, we would need to learn
$4$ different weight matrices (one for each of the LSTM gates), each with 100M weights, and thus 400 million parameters in total.

Fortunately, it turns out that a number of efficient techniques 
can quickly discover broadly useful word embeddings in an *unsupervised* manner.
These embeddings map each word onto a low-dimensional vector $w \in R^d$ with $d$ commonly chosen to be roughly $100$.
Intuitively, these embeddings are chosen based on the contexts in which words appear. 
Words that appear in similar contexts (like "tennis" and "racquet") should have similar embeddings
while words that do not like (like "rat" and "gourmet") should have dissimilar embeddings.

Practitioners of deep learning for NLP typically inititalize their models 
using *pretrained* word embeddings, bringing in outside information, and reducing the number of parameters that a neural network needs to learn from scratch.


Two popular word embeddings are GloVe and fastText. 
The following examples uses pre-trained word embeddings drawn from the following sources:

* GloVe project website：https://nlp.stanford.edu/projects/glove/
* fastText project website：https://fasttext.cc/

To begin, let's first import a few packages that we'll need for this example:

In [1]:
from mxnet import gluon

import warnings
warnings.simplefilter('ignore')

from mxnet import nd
import gluonnlp as nlp
import re

## Pre-trained Word Embeddings

GluonNLP provides a number of pre-trained Word Embeddings.

In [2]:
nlp.embedding.list_sources('glove')[:5]

['glove.42B.300d',
 'glove.6B.100d',
 'glove.6B.200d',
 'glove.6B.300d',
 'glove.6B.50d']

For simplicity of demonstration, we use a smaller word embedding file, such as
the 50-dimensional one.

In [3]:
glove_6b50d = nlp.embedding.create('glove', source='glove.6B.50d', unknown_token='<unk>')

Below shows the size the embedding including a special unknown token.

In [4]:
print(len(glove_6b50d.idx_to_token))
print(glove_6b50d.idx_to_vec.shape)

400001
(400001, 50)


We can access the mapping between tokens and indices.

In [5]:
print(glove_6b50d.token_to_idx['beautiful'])
print(glove_6b50d.idx_to_token[3367])

3367
beautiful


In [6]:
glove_6b50d['beautiful'][:10]


[ 0.54623   1.2042   -1.1288   -0.1325    0.95529   0.040524 -0.47863
 -0.3397   -0.28056   0.71761 ]
<NDArray 10 @cpu(0)>

### Basic unknown token handling

For some tokens there is no pretrained embedding vector.
This happens if the token was not seen during the pre-training procedure used to learn the embeddings.

In [7]:
print('beautiful' in glove_6b50d)
print('unknownword' in glove_6b50d)

True
False


If a `unknown_token` was specified for the `TokenEmbedding`, such unknown tokens will be replaced by the `unknown_token`: 

In [8]:
print(glove_6b50d.idx_to_token[0])
print(glove_6b50d.token_to_idx['unknownword'])

<unk>
0


In [9]:
print(glove_6b50d['unknownword'][:10])


[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 10 @cpu(0)>


### Unknown token handling based on subword information

Instead of falling back to a fixed vector for all unknown words,
fastText proposed constructing a vector for unknown words based on
a set of character ngrams.

When loading a pretrained fastText embedding, `load_ngrams=True` can be specified
to load the ngram vectors for improved unknown token handling.

In [10]:
fasttext_emb = nlp.embedding.create('fasttext', source='wiki.simple',
                                            unknown_token=None, load_ngrams=True)
print(fasttext_emb.idx_to_vec.shape)

(111051, 300)


Note that 'unknownword' is still not part of the `TokenEmbedding`'s `idx_to_token` mapping.

In [11]:
'unknownword' in fasttext_emb.idx_to_token

False

However, looking up a vector for `'unknownword'`,
now falls back transparently to using subword information.

In [12]:
fasttext_emb['unknownword'][:5]


[ 0.0716579  -0.11446684  0.06558052 -0.21048756  0.19394603]
<NDArray 5 @cpu(0)>

## Vocabulary

Each pre-trained token embedding is based on a list of tokens for which it provides
vectors (as well as potentially further features for computing vectors for unknown tokens).

However, the tokens included in the pre-trained embedding may not be the same as required for the task at hand.
GluonNLP provides a Vocabulary class, that holds information for indexing words
required in the task at hand and allows attaching embedding vectors for them.
The Vocabulary further provides functionality for handling special tokens, such as sentence seperators.


Now we'll demonstrate how to index words.
First, let's assign a unique ID and word vector to each word in the vocabulary
in just a few lines of code.

### Creating Vocabulary from Data Sets

To begin, suppose that we have a simple text data set consisting of newline-separated strings.

In [13]:
text = " hello world \n hello nice world \n hi unknownworld \n"

To start, let's implement a simple tokenizer to separate the words and then count the frequency of each word in the data set.

In [14]:
def simple_tokenize(source_str, token_delim=' ', seq_delim='\n'):
    return filter(None, re.split(token_delim + '|' + seq_delim, source_str))

counter = nlp.data.count_tokens(simple_tokenize(text))

The obtained `counter` behaves like a Python dictionary whose key-value pairs consist of words and their frequencies, respectively. 
We can then instantiate a `Vocab` object with a counter. 
Because `counter` tracks word frequencies, we are able to specify arguments 
such as `max_size` and `min_freq` to the `Vocab` constructor for restricting the size of the resulting vocabulary. Suppose that we want to build indices for all the keys in counter. 
If we simply want to construct a  `Vocab` containing every word, then we can supply `counter`  the only argument.

In [15]:
vocab = nlp.Vocab(counter)

A `Vocab` object associates each word with an index. We can easily access words by their indices using the `vocab.idx_to_token` attribute.

In [16]:
for word in vocab.idx_to_token:
    print(word)

<unk>
<pad>
<bos>
<eos>
hello
world
hi
nice
unknownworld


In the opposite direction, we can grab an idex given a token using `vocab.token_to_idx`.

In [17]:
print(vocab.token_to_idx["<unk>"])
print(vocab.token_to_idx["world"])

0
5


### Attaching word embeddings

To attach the word embeddings `fasttext_emb` to indexed words in `vocab`, we simply call vocab's `set_embedding` method:

In [18]:
vocab.set_embedding(fasttext_emb)

In [19]:
print('Vocabulary has length', len(vocab))
print(vocab.idx_to_token)
print(vocab.embedding.idx_to_vec.shape)

Vocabulary has length 9
['<unk>', '<pad>', '<bos>', '<eos>', 'hello', 'world', 'hi', 'nice', 'unknownworld']
(9, 300)


A `TokenEmbedding` is now attached to the vocabulary, providing pre-computed vectors
for all tokens in the vocabulary.

Note that the word `unknownworld` is not part of the fastText embedding,
but that a vector for it was constructed based on the character ngram vectors described above. 

In [20]:
print('unknownworld' in vocab)
print('unknownworld' in fasttext_emb)
print(fasttext_emb['unknownworld'][:10])
print(vocab.embedding['unknownworld'][:10])

True
False

[ 0.05788881 -0.19273444  0.03354993 -0.10133035  0.00389384  0.39772603
 -0.12440216  0.08243105  0.10507106  0.08101919]
<NDArray 10 @cpu(0)>

[ 0.05788881 -0.19273444  0.03354993 -0.10133035  0.00389384  0.39772603
 -0.12440216  0.08243105  0.10507106  0.08101919]
<NDArray 10 @cpu(0)>


#### Known words

Let us check the shape of the embedding of words 'hello' and 'world' from
`vocab`.

In [21]:
vocab.embedding['hello', 'world'].shape

(2, 300)

We can access the first five elements of the embedding of 'hello' and 'world'.

In [22]:
vocab.embedding['hello', 'world'][:, :5]


[[ 0.39567   0.21454  -0.035389 -0.24299  -0.095645]
 [ 0.10444  -0.10858   0.27212   0.13299  -0.33165 ]]
<NDArray 2x5 @cpu(0)>

### Using Pre-trained Word Embeddings in Gluon

To demonstrate how to use pre-
trained word embeddings in Gluon, let us first obtain indices of the words
'hello' and 'world'.

In [23]:
vocab['hello', 'world']

[4, 5]

We can obtain the vectors for the words 'hello' and 'world' by specifying their
indices (2 and 1) and the weight matrix `vocab.embedding.idx_to_vec` in
`gluon.nn.Embedding`.

In [24]:
input_dim, output_dim = vocab.embedding.idx_to_vec.shape
layer = gluon.nn.Embedding(input_dim, output_dim)
layer.initialize()
layer.weight.set_data(vocab.embedding.idx_to_vec)

In [25]:
layer(nd.array([5, 4]))[:, :5]


[[ 0.10444  -0.10858   0.27212   0.13299  -0.33165 ]
 [ 0.39567   0.21454  -0.035389 -0.24299  -0.095645]]
<NDArray 2x5 @cpu(0)>

### Creating Vocabulary from Pre-trained Word Embeddings

We can also create
vocabulary by using vocabulary of pre-trained word embeddings, such as GloVe.

Now we create vocabulary by using all the tokens from `glove_6b50d`.

In [26]:
vocab = nlp.Vocab(nlp.data.Counter(glove_6b50d.idx_to_token))
vocab.set_embedding(glove_6b50d)

Below shows the size of `vocab` including a special unknown token.

In [27]:
len(vocab.idx_to_token)

400004

We can access attributes of `vocab`.

In [28]:
print(vocab['beautiful'])
print(vocab.idx_to_token[71424])

71424
beautiful


## Applications of Word Embeddings

To apply word embeddings, we need to define
cosine similarity. It can compare similarity of two vectors.

In [29]:
from mxnet import nd
def cos_sim(x, y):
    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))

The range of cosine similarity between two vectors is between -1 and 1. The
larger the value, the similarity between two vectors.

In [30]:
x = nd.array([1, 2])
y = nd.array([10, 20])
z = nd.array([-1, -2])

print(cos_sim(x, y))
print(cos_sim(x, z))


[1.]
<NDArray 1 @cpu(0)>

[-1.]
<NDArray 1 @cpu(0)>


### Word Similarity

Given an input word, we can find the nearest $k$ words from
the vocabulary (400,000 words excluding the unknown token) by similarity. The
similarity between any pair of words can be represented by the cosine similarity
of their vectors.

In [31]:
def norm_vecs_by_row(x):
    return x / nd.sqrt(nd.sum(x * x, axis=1) + 1E-10).reshape((-1,1))

In [32]:
def get_knn(vocab, k, word):
    word_vec = vocab.embedding[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_vec)
    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+1, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    # Remove input tokens.
    return vocab.to_tokens(indices[1:])

Let us find the 5 most similar words of 'baby' from the vocabulary (size:
400,000 words).

In [33]:
get_knn(vocab, 5, 'baby')

['babies', 'boy', 'girl', 'newborn', 'pregnant']

We can verify the cosine similarity of vectors of 'baby' and 'babies'.

In [34]:
cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])


[0.83871305]
<NDArray 1 @cpu(0)>

Let us find the 5 most similar words of 'computers' from the vocabulary.

In [35]:
get_knn(vocab, 5, 'computers')

['computer', 'phones', 'pcs', 'machines', 'devices']

Let us find the 5 most similar words of 'run' from the vocabulary.

In [36]:
get_knn(vocab, 5, 'run')

['running', 'runs', 'went', 'start', 'ran']

Let us find the 5 most similar words of 'beautiful' from the vocabulary.

In [37]:
get_knn(vocab, 5, 'beautiful')

['lovely', 'gorgeous', 'wonderful', 'charming', 'beauty']

### Word Analogy

We can also apply pre-trained word embeddings to the word
analogy problem. 

For instance, "man : woman :: son : daughter" is an analogy.

The word analogy completion problem is defined as: for analogy 'a : b :: c : d',
given the first three words 'a', 'b', 'c', find 'd'. The idea is to find the
most similar word vector for vec('c') + (vec('b')-vec('a')).

In this example, we will find words by analogy from the 400,000 indexed words in `vocab`.

In [38]:
def get_top_k_by_analogy(vocab, k, word1, word2, word3):
    word_vecs = vocab.embedding[word1, word2, word3]
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2]).reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_diff)
    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    return vocab.to_tokens(indices)

Complete word analogy 'man : woman :: son :'.

In [39]:
get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')

['daughter']

Let us verify the cosine similarity between vec('son')+vec('woman')-vec('man')
and vec('daughter')

In [40]:
def cos_sim_word_analogy(vocab, word1, word2, word3, word4):
    words = [word1, word2, word3, word4]
    vecs = vocab.embedding[words]
    return cos_sim(vecs[1] - vecs[0] + vecs[2], vecs[3])

cos_sim_word_analogy(vocab, 'man', 'woman', 'son', 'daughter')


[0.9658341]
<NDArray 1 @cpu(0)>

Complete word analogy 'beijing : china :: tokyo : '.

In [41]:
get_top_k_by_analogy(vocab, 1, 'beijing', 'china', 'tokyo')

['japan']

Complete word analogy 'bad : worst :: big : '.

In [42]:
get_top_k_by_analogy(vocab, 1, 'bad', 'worst', 'big')

['biggest']

Complete word analogy 'do : did :: go :'.

In [43]:
get_top_k_by_analogy(vocab, 1, 'do', 'did', 'go')

['went']