## word2vec without eager

### word embedding

将源数据映射到另外一个空间，
单词嵌入，就是把X所属空间的单词映射为到Y空间的多维向量，那么该多维向量相当于嵌入到Y所属空间中。

给出一个文档，文档就是一个单词序列比如 “A B A C B F G”, 希望对文档中每个不同的单词都得到一个对应的向量(往往是低维向量)表示。
比如，对于这样的“A B A C B F G”的一个序列，也许我们最后能得到：A对应的向量为[0.1 0.6 -0.5]，B对应的向量为[-0.2 0.9 0.7] （此处的数值只用于示意）
之所以希望把每个单词变成一个向量，目的还是为了方便计算，比如“求单词A的同义词”，就可以通过“求与单词A在cos距离下最相似的向量”来做到。

比较简单的方法：
基于BOW的one-hot；
缺点：没有相邻单词的信息，向量可能会非常的长

#### 共现矩阵Cocurrence matrix：

一个非常重要的思想是，我们认为某个词的意思跟它临近的单词是紧密相关的。这是我们可以设定一个窗口（大小一般是5~10），如下窗口大小是2，那么在这个窗口内，与rests 共同出现的单词就有life、he、in、peace。然后我们就利用这种共现关系来生成词向量。
<img src="https://img-blog.csdn.net/20170904134137027" style="width:450px;height:100px">
例如，现在我们的语料库包括下面三份文档资料：
I like deep learning. 
I like NLP. 
I enjoy flying.
作为示例，我们设定的窗口大小为1，也就是只看某个单词周围紧邻着的那个单词。此时，将得到一个对称矩阵——共现矩阵。因为在我们的语料库中，I 和 like做为邻居同时出现在窗口中的次数是2，所以下表中I 和like相交的位置其值就是2。这样我们也实现了将word变成向量的设想，在共现矩阵每一行（或每一列）都是对应单词的一个向量表示。

<img src="https://img-blog.csdn.net/20170904134757679" style="width:400px;height:300px">

为了缩小向量，会采用SVD或PCA等降维方法。但是SVD操作计算量巨大(此处不赘述，以后有空补充)

而深度学习流行之后，Tomas Mikolov 提出了Word2vec，其中涉及了两种基于neural network的方法：CBOW 和 Skip-Gram。

## Skip-Gram 

In [10]:
""" starter code for word2vec skip-gram model with NCE loss
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 04
"""

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector #版本更新，embedding投影工具
import tensorflow as tf

import utils
import word2vec_utils   #这两个文件在example文件夹下，别忘了加

TensorBoard 的一个内置的可视化工具 Embedding Projector, 是个交互式的可视化，可用来分析诸如 embeddings 的高维数据。 


embedding projector 将从你的 checkpoint 文件中读取 embeddings。 


默认情况下，embedding projector 会用 PCA 主成分分析方法将高维数据投影到 3D 空间, 还有一种投影方法是 T-SNE。



In [11]:
# Model hyperparameters
VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128            # dimension of the word embedding vectors
SKIP_WINDOW = 1             # the context window
NUM_SAMPLED = 64            # number of negative examples to sample
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 100000
VISUAL_FLD = 'visualization'
SKIP_STEP = 5000

In [12]:
# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'
EXPECTED_BYTES = 31344016
NUM_VISUALIZE = 3000        # number of tokens to visualize

In [13]:
def word2vec(dataset):
    """ Build the graph for word2vec model and train it """
    # Step 1: get input, output from the dataset
    with tf.name_scope('data'):
        iterator = dataset.make_initializable_iterator()
        center_words, target_words = iterator.get_next()

    """ Step 2 + 3: define weights and embedding lookup.
    In word2vec, it's actually the weights that we care about 
    """
    with tf.name_scope('embed'):
        embed_matrix = tf.get_variable('embed_matrix', 
                                        shape=[VOCAB_SIZE, EMBED_SIZE],
                                        initializer=tf.random_uniform_initializer())
        embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embedding')

    # Step 4: construct variables for NCE loss and define loss function
    with tf.name_scope('loss'):
        nce_weight = tf.get_variable('nce_weight', shape=[VOCAB_SIZE, EMBED_SIZE],
                        initializer=tf.truncated_normal_initializer(stddev=1.0 / (EMBED_SIZE ** 0.5)))
        nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE]))

        # define loss function to be NCE loss function
        loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                            biases=nce_bias, 
                                            labels=target_words, 
                                            inputs=embed, 
                                            num_sampled=NUM_SAMPLED, 
                                            num_classes=VOCAB_SIZE), name='loss')

    # Step 5: define optimizer
    with tf.name_scope('optimizer'):
        optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)
    
    utils.safe_mkdir('checkpoints')

    with tf.Session() as sess:
        sess.run(iterator.initializer)
        sess.run(tf.global_variables_initializer())

        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('graphs/word2vec_simple', sess.graph)

        for index in range(NUM_TRAIN_STEPS):
            try:
                loss_batch, _ = sess.run([loss, optimizer])
                total_loss += loss_batch
                if (index + 1) % SKIP_STEP == 0:
                    print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                    total_loss = 0.0
            except tf.errors.OutOfRangeError:
                sess.run(iterator.initializer)
        writer.close()

In [14]:
def gen():
    yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, 
                                        BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD)

In [15]:
def main():
    dataset = tf.data.Dataset.from_generator(gen, 
                                (tf.int32, tf.int32), 
                                (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1])))
    word2vec(dataset)

if __name__ == '__main__':
    main()

data/text8.zip already exists
Average loss at step 4999:  65.2
Average loss at step 9999:  18.4
Average loss at step 14999:   9.7
Average loss at step 19999:   6.7
Average loss at step 24999:   5.7
Average loss at step 29999:   5.2
Average loss at step 34999:   5.0
Average loss at step 39999:   4.8
Average loss at step 44999:   4.8
Average loss at step 49999:   4.8
Average loss at step 54999:   4.7
Average loss at step 59999:   4.7
Average loss at step 64999:   4.6
Average loss at step 69999:   4.7
Average loss at step 74999:   4.6
Average loss at step 79999:   4.6
Average loss at step 84999:   4.7
Average loss at step 89999:   4.7
Average loss at step 94999:   4.6
Average loss at step 99999:   4.6
