# Embedding with word2vec

In this note boook, I will show you how to implement vectorization of words using the Skip-Gram model. I am following Tensorflow's official [tutorial](https://www.tensorflow.org/tutorials/representation/word2vec). The original paper is by [Mikolov et al](https://arxiv.org/pdf/1301.3781.pdf). The following code is based on the [basic implementation of word2vec](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py) on github, and is reduced to only show the main body of the implementation. More advanced code can be found at [this link](https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec.py).

In [1]:
import tensorflow as tf
import numpy as np

## Fetch data

In [2]:
import sys
import os
import urllib

In [3]:
url = 'http://mattmahoney.net/dc/'

In [4]:
def fetch_data(filename, expected_bytes = None):
    '''Download a file if not found'''
    data_dir = os.path.join(os.getcwd(), 'data')
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    local_filename = os.path.join(data_dir, filename)
    if not os.path.exists(local_filename):
        local_filename, _ = urllib.request.urlretrieve(url+filename, local_filename)
    
    filesize = os.stat(local_filename).st_size
    if expected_bytes and filesize == expected_bytes:
        print('Found and verified', filename)
    else:
        print('Downloaded file', filename, 'with size of', filesize)
        if expected_bytes:
            raise Exception('Fail to verify'+local_filename)
    return local_filename

In [5]:
filename = fetch_data('text8.zip', 31344016)

Found and verified text8.zip


## Read data

In [6]:
import zipfile

In [7]:
def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
    return data

In [8]:
vocabulary = read_data(filename)

In [9]:
vocabulary[:10]

['anarchism',
 'originated',
 'as',
 'a',
 'term',
 'of',
 'abuse',
 'first',
 'used',
 'against']

## Build dictionary replace rare words with UNK token

In [10]:
import collections

In [11]:
vocabulary_size = 50000

In [12]:
def build_dicts(words, n_words):
    '''Build reference dictionaries'''
    count = [['UNK', -1]] # why is it initialized as -1 instead of 0?
    count.extend(collections.Counter(words).most_common(n_words-1))
    word2ind = {}
    for word, _ in count:
        word2ind[word] = len(word2ind)
    data = []
    unk_count = 0
    for word in words:
        index = word2ind.get(word, 0)
        if index == 0:
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    ind2word = dict(zip(word2ind.values(), word2ind.keys()))
    return data, count, word2ind, ind2word

In [13]:
data, count, word2ind, ind2word = build_dicts(vocabulary, vocabulary_size)

In [14]:
del vocabulary

In [15]:
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [ind2word[_] for _ in data[:10]])

Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


## Generate training batches using skip-gram

Given a center word, randomly select a word from its context window. Use this context word as training data, while the center word is the target, we can then build a batch of training dataset.

In [16]:
import random

In [17]:
data_index = 0

In [18]:
def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size%num_skips == 0 # make sure each skip has same number of dataset
    assert num_skips < 2*skip_window+1
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2*skip_window+1 # [skip_window, target, skip_window]
    buffer = collections.deque(maxlen=span) # create a queue to store window
    if data_index+span > len(data):
        data_index = 0 # reinitialize if current window exceeds the length
    buffer.extend(data[data_index:data_index+span]) # append current window to the queue
    data_index += span # number of window shifts is len(data)-span
    # create a list of indexes for context words
    context_words = [item for item in range(span) if item != skip_window]
    for i in range(batch_size//num_skips):
        words_from_window = random.sample(context_words, num_skips)
        for j, target_word in enumerate(words_from_window):
            # use center word as data
            batch[i*num_skips+j] = buffer[skip_window]
            # use random word from skip window as target
            labels[i*num_skips+j, 0] = buffer[target_word] 
        if data_index == len(data):
            # reinitialize if reaches to the end
            buffer.extend(data[:span])
            data_index = span
        else:
            buffer.append(data[data_index])
            data_index += 1
    # backtrack to avoid skipping words in the end of a batch
    data_index = (data_index+len(data)-span)%len(data)
    return batch, labels

A glance at the training dataset

In [19]:
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
[(ind2word[i], ind2word[j[0]]) for i, j in zip(batch, labels)]

[('originated', 'as'),
 ('originated', 'anarchism'),
 ('as', 'a'),
 ('as', 'originated'),
 ('a', 'term'),
 ('a', 'as'),
 ('term', 'a'),
 ('term', 'of')]

## skip-gram model

Define some constants

In [20]:
BATCH_SIZE = 128
EMBEDDING_SIZE = 128 # dimention of the embedding vector
SKIP_WINDOW = 1
NUM_SKIPS = 2
NUM_SAMPLED = 64

Some parameters for validation

In [21]:
VALID_SIZE = 16
VALID_WINDOW = 100 # only select the first 100 words to evaluate
valid_examples = np.random.choice(VALID_WINDOW, VALID_SIZE, replace=False)

Build the model with tensorflow. For convinience, just build the essential parts

In [22]:
# inputs
train_inputs = tf.placeholder(tf.int32, shape=[BATCH_SIZE])
train_labels = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

In [23]:
# initialize an embedding and then look up word vector from this embedding
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, EMBEDDING_SIZE], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

[NCE loss model](https://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf) is a way to reduce the cost of softmax function by replacing it with logistic regression over much smaller sample size, which consists of a positive sample and several other negative samples

In [33]:
# Build the NCE loss model
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, EMBEDDING_SIZE],
                                             stddev=1.0/np.sqrt(EMBEDDING_SIZE)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

loss = tf.reduce_mean(
    tf.nn.nce_loss(
        weights=nce_weights,
        biases=nce_biases,
        labels=train_labels,
        inputs=embed,
        num_sampled=NUM_SAMPLED,
        num_classes=vocabulary_size
    )
)

In [52]:
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

In [35]:
# Validation
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings/norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,
                                        valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

In [55]:
num_steps = 10000

In [56]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for step in range(num_steps):
        batch_inputs, batch_labels = generate_batch(BATCH_SIZE, NUM_SKIPS,
                                                   SKIP_WINDOW)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
        _, loss_train = sess.run([optimizer, loss], feed_dict=feed_dict)
        if step%100 == 0:
            print(loss_train)
    final_embeddings = normalized_embeddings.eval()

280.397
195.874
166.827
168.633
165.543
95.4131
124.41
163.022
161.284
142.33
79.6432
53.8843
153.529
115.499
82.0457
73.0976
62.2828
67.9734
86.612
68.8936
73.6611
56.5932
66.9391
98.8964
49.5145
68.8724
39.0783
53.9223
31.3299
62.6922
35.2591
32.3451
47.4218
32.6992
30.9951
63.1502
51.3414
47.8295
36.6661
26.3181
34.6017
35.4397
37.6986
73.3687
51.6947
32.6635
26.1504
39.9885
40.0424
11.9627
27.6031
46.4044
43.1606
7.60296
25.0566
19.238
42.1086
28.7141
52.1673
24.5821
23.0984
35.3962
58.4864
35.8538
24.8007
60.674
46.226
41.0131
29.0721
34.6343
20.4665
21.5675
16.381
39.0353
25.8767
27.4252
17.5337
20.7826
14.6143
33.0125
39.725
31.9468
17.5575
43.666
20.301
7.0637
30.215
18.5779
14.3694
21.5498
8.37587
16.0859
18.2649
15.1885
14.9982
17.8057
13.0541
18.8006
16.3858
14.6411


In [57]:
print(ind2word[data[0]], 'and its corresponding vector:\n', final_embeddings[data[0]])

anarchism and its corresponding vector:
 [ 0.09323777 -0.01108901  0.03200433 -0.10120311  0.13836347 -0.0466122
  0.02027027 -0.0148513  -0.01205099 -0.07231995 -0.06766896 -0.00947833
  0.16027215 -0.0833019   0.09467531 -0.05385627  0.03214695 -0.01431454
 -0.07629175 -0.04613779  0.07980682 -0.0216409  -0.05562487  0.08287641
 -0.0306728  -0.02641387  0.05044418  0.03224257  0.00646495  0.10593925
 -0.15757279 -0.10095137  0.13243385  0.09088251  0.16086781 -0.01930537
  0.114754   -0.09769886  0.10203437 -0.08148076  0.05227184  0.09976345
  0.00686788 -0.14601114  0.05952242 -0.08311615 -0.07713651  0.03563159
 -0.02959167  0.07112483  0.16031735 -0.00664568 -0.11878739  0.11108568
 -0.05220035  0.10424857 -0.13774247 -0.1038047   0.0459159  -0.04485492
  0.10481818 -0.15517806  0.03311496  0.12910315 -0.09689718  0.08632062
 -0.08316438  0.06241824 -0.04387079 -0.07843991  0.0839075  -0.05629348
  0.12754366 -0.06455048 -0.01348033 -0.09118778 -0.06397358  0.08144351
 -0.0762676