# Pre-trained Word Embeddings

This tutorial refers to https://gluon-nlp.mxnet.io/examples/word_embedding/word_embedding.html

In [1]:
import warnings
warnings.filterwarnings('ignore')

from mxnet import gluon
from mxnet import nd
import gluonnlp as nlp
import re

In [27]:
text = "내년 1월 집권 후반기를 맞는 도널드 트럼프 미국 대통령이 개각을 예고했다."

In [28]:
def simple_tokenize(source_str, token_delim=' ', seq_delim='\n'):
    return filter(None, re.split(token_delim + '|' + seq_delim, source_str))
counter = nlp.data.count_tokens(simple_tokenize(text))

In [29]:
counter

Counter({'내년': 1,
         '1월': 1,
         '집권': 1,
         '후반기를': 1,
         '맞는': 1,
         '도널드': 1,
         '트럼프': 1,
         '미국': 1,
         '대통령이': 1,
         '개각을': 1,
         '예고했다.': 1})

In [5]:
len(counter)

11

In [30]:
vocab = nlp.Vocab(counter)

In [31]:
for word in vocab.idx_to_token:
    print(word)

<unk>
<pad>
<bos>
<eos>
1월
개각을
내년
대통령이
도널드
맞는
미국
예고했다.
집권
트럼프
후반기를


In [32]:
print(vocab.token_to_idx["<unk>"])
print(vocab.token_to_idx["world"])

0
0


## Attaching word embeddings

List of data provided by gluon<br>
https://github.com/dmlc/gluon-nlp/blob/d49a7896ae92307cf3c930f2eb2e3d516a278fe7/src/gluonnlp/_constants.py

In [37]:
fasttext_simple = nlp.embedding.create('fasttext', source='wiki.ko')

To attach the newly loaded word embeddings fasttext_simple to indexed words in vocab, we simply call vocab’s set_embedding method:

In [38]:
vocab.set_embedding(fasttext_simple)

To see other available sources of pretrained word embeddings using the fastText algorithm, we can call text.embedding.list_sources.

In [39]:
nlp.embedding.list_sources('fasttext')[:10]

['crawl-300d-2M',
 'crawl-300d-2M-subword',
 'wiki.aa',
 'wiki.ab',
 'wiki.ace',
 'wiki.ady',
 'wiki.af',
 'wiki.ak',
 'wiki.als',
 'wiki.am']

In [40]:
len(vocab)  # len(counter) + <unk>, <pad>, <bos>, <eos>

15

By default, the vector of any token that is unknown to vocab is a zero vector.<br> 
Its length is equal to the vector dimension of the fastText word embeddings: 300.

In [41]:
vocab.embedding['없는단어'].shape

(300,)

In [15]:
vocab.embedding['없는단어'][:10]


[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 10 @cpu(0)>

In [42]:
vocab.embedding['개각을'].shape

(300,)

In [51]:
vocab.embedding['내년'][:10]


[ 0.16632  -0.05013  -0.49189  -0.067787  0.16502   0.66457  -0.086991
 -0.26712   0.23955   0.17926 ]
<NDArray 10 @cpu(0)>

This is reference to embedding.TokenEmbedding<br>
https://gluon-nlp.mxnet.io/api/embedding.html?highlight=embedding#gluonnlp.embedding.TokenEmbedding

## Using Pre-trained Word Embeddings in Gluon

TODO Why do I put the vector into the layer once more without using it right away?

In [52]:
vocab['개각을', '내년']

[5, 6]

In [46]:
vocab.embedding.idx_to_vec


[[ 0.        0.        0.       ...  0.        0.        0.      ]
 [ 0.        0.        0.       ...  0.        0.        0.      ]
 [ 0.        0.        0.       ...  0.        0.        0.      ]
 ...
 [ 0.04047  -0.23973  -0.028091 ...  0.38481  -0.32082  -0.4323  ]
 [-0.51446   0.17012  -0.59943  ...  0.18196   0.015024 -0.11012 ]
 [ 0.049091  0.18873  -0.16464  ... -0.2455    0.18002   0.283   ]]
<NDArray 15x300 @cpu(0)>

We can obtain the vectors for the words ‘개각을’ and ‘내년’ by specifying their indices (5 and 6) and<br> 
the weight matrix <span style="color:red">vocab.embedding.idx_to_vec in gluon.nn.Embedding.</span>

In [47]:
input_dim, output_dim = vocab.embedding.idx_to_vec.shape
layer = gluon.nn.Embedding(input_dim, output_dim)
layer.initialize()
layer.weight.set_data(vocab.embedding.idx_to_vec)

In [53]:
nd.array([5, 6])


[5. 6.]
<NDArray 2 @cpu(0)>

In [58]:
layer(nd.array([5, 6]))[:, :5]


[[-0.11939   0.14281  -0.4036   -0.09584  -0.24916 ]
 [ 0.16632  -0.05013  -0.49189  -0.067787  0.16502 ]]
<NDArray 2x5 @cpu(0)>

## Creating Vocabulary from Pre-trained Word Embeddings

We can also create vocabulary by using vocabulary of pre-trained word embeddings, such as GloVe. Below are a few pre-trained file names under the GloVe word embedding.

In [59]:
nlp.embedding.list_sources('glove')[:5]

['glove.42B.300d',
 'glove.6B.100d',
 'glove.6B.200d',
 'glove.6B.300d',
 'glove.6B.50d']

In [60]:
glove_6b50d = nlp.embedding.create('glove', source='glove.6B.50d')

Embedding file glove.6B.50d.npz is not found. Downloading from Gluon Repository. This may take some time.
Downloading /home/chatbot/.mxnet/embedding/glove/glove.6B.50d.npz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/embeddings/glove/glove.6B.50d.npz...


In [62]:
vocab = nlp.Vocab(nlp.data.Counter(glove_6b50d.idx_to_token))
vocab.set_embedding(glove_6b50d)

In [64]:
len(vocab.idx_to_token)

400004

In [65]:
print(vocab['beautiful'])
print(vocab.idx_to_token[71424])

71424
beautiful


## Applications of Word Embeddings

To apply word embeddings, we need to define cosine similarity. It can compare similarity of two vectors.



In [66]:
def cos_sim(x, y):
    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))

In [67]:
x = nd.array([1, 2])
y = nd.array([10, 20])
z = nd.array([-1, -2])

print(cos_sim(x, y))
print(cos_sim(x, z))


[1.]
<NDArray 1 @cpu(0)>

[-1.]
<NDArray 1 @cpu(0)>


## Word Similarity

Given an input word, we can find the nearest  k  words from the vocabulary (400,000 words excluding the unknown token) by similarity. The similarity between any pair of words can be represented by the cosine similarity of their vectors.

In [68]:
def norm_vecs_by_row(x):
    return x / nd.sqrt(nd.sum(x * x, axis=1) + 1E-10).reshape((-1,1))

def get_knn(vocab, k, word):
    word_vec = vocab.embedding[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_vec)
    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+1, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    # Remove unknown and input tokens.
    return vocab.to_tokens(indices[1:])

In [69]:
get_knn(vocab, 5, 'baby')

['babies', 'boy', 'girl', 'newborn', 'pregnant']

In [70]:
cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])


[0.83871305]
<NDArray 1 @cpu(0)>

In [71]:
get_knn(vocab, 5, 'computers')

['computer', 'phones', 'pcs', 'machines', 'devices']

In [72]:
get_knn(vocab, 5, 'run')

['running', 'runs', 'went', 'start', 'ran']

In [73]:
get_knn(vocab, 5, 'beautiful')

['lovely', 'gorgeous', 'wonderful', 'charming', 'beauty']

## Word Analogy

In [74]:
def get_top_k_by_analogy(vocab, k, word1, word2, word3):
    word_vecs = vocab.embedding[word1, word2, word3]
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2]).reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_diff)
    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    return vocab.to_tokens(indices)

In [76]:
get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')

['daughter']

In [77]:
def cos_sim_word_analogy(vocab, word1, word2, word3, word4):
    words = [word1, word2, word3, word4]
    vecs = vocab.embedding[words]
    return cos_sim(vecs[1] - vecs[0] + vecs[2], vecs[3])

cos_sim_word_analogy(vocab, 'man', 'woman', 'son', 'daughter')


[0.9658341]
<NDArray 1 @cpu(0)>

In [78]:
get_top_k_by_analogy(vocab, 1, 'bad', 'worst', 'big')

['biggest']

In [79]:
get_top_k_by_analogy(vocab, 1, 'do', 'did', 'go')

['went']