# Skip-gram word2vec

In this notebook, I'll lead you through using TensorFlow to implement the word2vec algorithm using the skip-gram architecture. By implementing this, you'll learn about embedding words for use in natural language processing. This will come in handy when dealing with things like translations.

## Readings

Here are the resources I used to build this notebook. I suggest reading these either beforehand or while you're working on this material.

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of word2vec from Chris McCormick 
* [First word2vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [NIPS paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for word2vec also from Mikolov et al.
* An [implementation of word2vec](http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/) from Thushan Ganegedara
* TensorFlow [word2vec tutorial](https://www.tensorflow.org/tutorials/word2vec)

## Word embeddings

When you're dealing with language and words, you end up with tens of thousands of classes to predict, one for each word. Trying to one-hot encode these words is massively inefficient, you'll have one element set to 1 and the other 50,000 set to 0. The word2vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. Words that show up in similar contexts, such as "black", "white", and "red" will have vectors near each other. There are two architectures for implementing word2vec, CBOW (Continuous Bag-Of-Words) and Skip-gram.

<img src="assets/word2vec_architectures.png" width="500">

In this implementation, we'll be using the skip-gram architecture because it performs better than CBOW. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.

First up, importing packages.

In [1]:
import time
import numpy as np
import tensorflow as tf

import sys
sys.path.insert(0, '../')
# import text_utils methods
import text_utils
import utils

from collections import Counter, namedtuple
import random

import pickle
import math

Load the [text8 dataset](http://mattmahoney.net/dc/textdata.html), a file of cleaned up Wikipedia articles from Matt Mahoney. The next cell will download the data set to the `data` folder. Then you can extract it and delete the archive file to save storage space.

## Restore the vocabolary

we restore the vocab_to_int and the int_to_vocab using the pickle file

In [2]:
with open('dictionary.cpkt', 'rb') as f:
    dictionaries = pickle.load(f)

In [3]:
vocab_to_int = dictionaries['vocab_to_int']
int_to_vocab = dictionaries['int_to_vocab']

## Building The Model

we build the model as before, but we are going to restore a saved checkpoint `word2vec_300.ckpt`

In [4]:
def build_model(graph, vocab_size, embed_dim, num_sampled=100):
    with graph.as_default():
        with tf.name_scope('Inputs'):
            inputs = tf.placeholder(tf.int32, [None], name='inputs')

        with tf.name_scope('Labels'):
            labels = tf.placeholder(tf.int32, [None, None], name='labels')
        
        with tf.device('/cpu:0'):
            with tf.name_scope('Embedding'):
                embeddings = tf.Variable(
                    initial_value = tf.random_uniform([vocab_size, embed_dim], -1.0, 1.0), 
                    name='embeddings')
                embed = tf.nn.embedding_lookup(embeddings, inputs, name='embed')

            with tf.name_scope('NegativeSampling'):
                softmax_w = tf.Variable(
                    tf.truncated_normal([vocab_size, embed_dim], stddev=1.0 / math.sqrt(embed_dim)), name='softmax_w')
                softmax_b = tf.Variable(tf.zeros([vocab_size]), name='softmax_b')

                #negative labels to sample
            with tf.name_scope('Loss'):
                loss = tf.reduce_mean(
                   tf.nn.sampled_softmax_loss(weights=softmax_w,
                      biases=softmax_b,
                      labels=labels, 
                      inputs=embed, 
                      num_sampled=num_sampled,
                      num_classes=vocab_size),
                    name = 'loss'
                    )
                tf.summary.scalar('loss',loss)
            
            optimizer = tf.train.AdamOptimizer(name='optimizer').minimize(loss)
        
        # merge all the summary in one node
        merged = tf.summary.merge_all()
    
    # Export the nodes
    export_nodes = ['inputs', 'labels', 'embeddings', 'embed', 
                    'softmax_w', 'softmax_b', 'loss', 'optimizer', 'merged']

    
    Model = namedtuple('Model', export_nodes)
    local_dict = locals()
    model = Model(*[local_dict[each] for each in export_nodes])
    
    return model   

## Validation Functions

This code is from Thushan Ganegedara's implementation. Here we're going to choose a few common words and few uncommon words. Then, we'll print out the closest words to them. It's a nice way to check that our embedding table is grouping together words with similar semantic meanings.

In [5]:
def validate(embeddings, int_codes):
    sample_examples = np.array(int_codes)
    sample_dataset = tf.constant(sample_examples, dtype=tf.int32)
    
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embedding = embeddings / norm
    sample_embedding = tf.nn.embedding_lookup(normalized_embedding, sample_dataset)
    return tf.matmul(sample_embedding, tf.transpose(normalized_embedding))

## Restore the model graph

In [6]:
train_graph = tf.Graph()

model = build_model(
    graph = train_graph,
    vocab_size = len(vocab_to_int),
    embed_dim = 50,
    num_sampled = 100
)

In [7]:
model_path = './checkpoints/word2vec_1.ckpt'

In [8]:
with train_graph.as_default():
    saver = tf.train.Saver()

## Lookup for embeddings

In [9]:
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess,model_path)
    
    sample_examples = list(range(10))
    sample_examples.append(186)
    sample_size = len(sample_examples)
    
    sim = validate(model.embeddings,sample_examples).eval()
    for i in range(sample_size):
        sample_word = int_to_vocab[sample_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % sample_word
        for k in range(top_k):
            close_word = int_to_vocab[nearest[k]]
            log = '%s %s,' % (log, close_word)
        print(log)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 186]
Nearest to the: of, in, and, from, to, were, by, at,
Nearest to of: the, in, on, from, and, to, most, a,
Nearest to and: by, the, as, to, in, they, for, also,
Nearest to one: five, nine, two, zero, eight, seven, four, six,
Nearest to in: the, of, and, was, were, to, by, after,
Nearest to a: is, for, as, which, each, the, interest, on,
Nearest to to: the, they, and, that, have, not, as, will,
Nearest to zero: two, one, four, six, three, nine, five, eight,
Nearest to nine: one, seven, four, six, five, eight, american, zero,
Nearest to two: zero, five, one, four, six, three, seven, nine,
Nearest to king: had, asserted, strange, turning, rumored, september, descendant, vigor,
