# Skip-gram word2vec

In this notebook, I'll lead you through using TensorFlow to implement the word2vec algorithm using the skip-gram architecture. By implementing this, you'll learn about embedding words for use in natural language processing. This will come in handy when dealing with things like translations.

## Readings

Here are the resources I used to build this notebook. I suggest reading these either beforehand or while you're working on this material.

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of word2vec from Chris McCormick 
* [First word2vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [NIPS paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for word2vec also from Mikolov et al.
* An [implementation of word2vec](http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/) from Thushan Ganegedara
* TensorFlow [word2vec tutorial](https://www.tensorflow.org/tutorials/word2vec)

## Word embeddings

When you're dealing with language and words, you end up with tens of thousands of classes to predict, one for each word. Trying to one-hot encode these words is massively inefficient, you'll have one element set to 1 and the other 50,000 set to 0. The word2vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. Words that show up in similar contexts, such as "black", "white", and "red" will have vectors near each other. There are two architectures for implementing word2vec, CBOW (Continuous Bag-Of-Words) and Skip-gram.

<img src="assets/word2vec_architectures.png" width="500">

In this implementation, we'll be using the skip-gram architecture because it performs better than CBOW. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.

First up, importing packages.

In [None]:
import time
import numpy as np
import tensorflow as tf

import sys
sys.path.insert(0, '../')
# import text_utils methods
import text_utils
import utils

from collections import Counter, namedtuple
import random

import pickle

Load the [text8 dataset](http://mattmahoney.net/dc/textdata.html), a file of cleaned up Wikipedia articles from Matt Mahoney. The next cell will download the data set to the `data` folder. Then you can extract it and delete the archive file to save storage space.

## Restore the vocabolary

we restore the vocab_to_int and the int_to_vocab using the pickle file

In [None]:
with open('dictionary.cpkt', 'rb') as f:
    dictionaries = pickle.load(f)

In [None]:
vocab_to_int = dictionaries['vocab_to_int']
int_to_vocab = dictionaries['int_to_vocab']

## Building The Model

we build the model as before, but we are going to restore a saved checkpoint `word2vec_300.ckpt`

In [None]:
model_path = './checkpoints/word2vec_300.ckpt'

In [None]:
def build_model(vocab_size, embed_dim, num_sampled=100):
    with tf.name_scope('Inputs'):
        inputs = tf.placeholder(tf.int32, [None], name='inputs')

    with tf.name_scope('Labels'):
        labels = tf.placeholder(tf.int32, [None, None], name='labels')

    with tf.name_scope('Embedding'):
        embeddings = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1.0, 1.0), name='embeddings')
        embed = tf.nn.embedding_lookup(embeddings, inputs, name='embed')

    with tf.name_scope('NegativeSampling'):
        softmax_w = tf.Variable(tf.truncated_normal((vocab_size, embed_dim), stddev=0.1), name='softmax_w')
        softmax_b = tf.Variable(tf.zeros(vocab_size), name='softmax_b')
        tf.summary.histogram('softmax_w', softmax_w)
        tf.summary.histogram('softmax_b',softmax_b)
    #negative labels to sample
    with tf.name_scope('Cost'):
        loss = tf.nn.sampled_softmax_loss(softmax_w,softmax_b,labels,embed, num_sampled,vocab_size, name='loss')
        cost = tf.reduce_mean(loss, name='cost')
        tf.summary.scalar('cost',cost)

    with tf.name_scope('Optimizer'):
        optimizer = tf.train.AdamOptimizer(name='optimizer').minimize(cost)
        
    # merge all the summary in one node
    merged = tf.summary.merge_all()
    
    # Export the nodes
    export_nodes = ['inputs', 'labels', 'embeddings', 'embed', 
                    'softmax_w', 'softmax_b', 'cost', 'optimizer', 'merged']

    
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])
    
    return graph        

In [None]:
model = build_model(
    vocab_size = len(vocab_to_int),
    embed_dim = 300,
    num_sampled = 100
)
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, model_path)

## Find the top k nearest neighbour function


In [None]:
def find_top_k(embeddings, input_codes, top_k = 8):
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sample_examples = np.array(input_codes)
        sample_dataset = tf.constant(sample_examples, dtype=tf.int32)
    
        # We use the cosine distance:
        norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
        normalized_embedding = embeddings / norm
        sample_embedding = tf.nn.embedding_lookup(normalized_embedding, sample_dataset)
        similarity = tf.matmul(sample_embedding, tf.transpose(normalized_embedding))
        sim = similarity.eval()
        for c in input_codes:
            word = int_to_vocab[c]
            nearest = (-sim[i, :]).argsort()[1:top_k+1]
            log = 'Nearest to %s:' % word
            for k in range(top_k):
                close_words = int_to_vocab[nearest[k]]
                log = '%s %s,' % (log, close_words)
            print(log)

In [None]:
find_top_k(model.embeddings, [3,4])