## 3. 构建 word embedding

### 思路：

1. 读取数据
2. 清理数据
3. 创建词典、reverce_dict、count
4. 生成随机样本
5. 建立模型

### 环境说明：

In [1]:
%load_ext watermark
%watermark -a 'Scott Ming' -v -m -d -p numpy,pandas,matplotlib,tensorflow

Scott Ming 2017-04-23 

CPython 3.6.0
IPython 6.0.0

numpy 1.12.1
pandas 0.19.2
matplotlib 2.0.0
tensorflow 1.0.1

compiler   : GCC 4.9.2
system     : Linux
release    : 3.16.0-4-amd64
machine    : x86_64
processor  : 
CPU cores  : 4
interpreter: 64bit


In [2]:
import jieba
import collections
import math
import os
import random
import zipfile
import string
import numpy as np
import urllib.request
import tensorflow as tf
import zhon.hanzi as zh

### 3.1 读取数据

为了方便，直接用官方的数据集了

In [3]:
def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):  # 如果不存在，就下载
    filename, _ = urllib.request.urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified', filename)
  else:
    print(statinfo.st_size)
    raise Exception(
        'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename


def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data

In [4]:
filename = maybe_download('text8.zip', 31344016)
words = read_data(filename)[:800000]  # 读取成 list
print('Data size', len(words))  

Found and verified text8.zip
Data size 800000


设定高频截取及确认词

In [5]:
vocabulary_size = 10000
window_size = 2  # 一个词最远能联系到的距离，如果为 1，说明左右是两个
valid_words = ['were', 'five', 'three', 'which']

### 3.2 创建词典和编号及训练数据

创建一个词典包含所有单词的编号

In [6]:
def build_dictionary(words, vocabulary_size):
    count = [['RARE', -1]]
    count.extend(collections.Counter(words).most_common(vocabulary_size-1))
    # 创建编号字典
    word_dict = {}
    for word, word_count in count:
        word_dict[word] = len(word_dict)
    reverse_dict = dict(zip(word_dict.values(), word_dict.keys()))
    return word_dict, reverse_dict, count

In [7]:
word_dict, reverse_dict, count = build_dictionary(words, vocabulary_size)
valid_examples = [word_dict[x] for x in valid_words]

给语料中的所有单词编号

In [8]:
def words_to_number(words, words_dict):
    words_num = []
    unk_count = 0
    for word in words:
        if word in word_dict:
            index = word_dict[word]
        else:
            index = 0  # words_dict[0]
            unk_count += 1 
        words_num.append(index)   
    return words_num

In [9]:
words_num = words_to_number(words, word_dict)

创建输入数据和 target

In [10]:
def build_data(words_num, window_size):
    window_sequences = [words_num[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(words_num)]
    # 生成中心词的索引
    label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]
    # 利用 list 相加会合并成长 list 的特性把两边词和中间词分开
    input_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
    # cbow 需要至少 2*windows_size 的维度
    input_and_labels = [(x,y) for x,y in input_and_labels if len(x)==2*window_size]
    input_data, label_data = [list(x) for x in zip(*input_and_labels)]
    # 各自转为 array
    input_data = np.array(input_data)  # n, 2 dim
    label_data = np.transpose(np.array([label_data]))  # n, 1 dim
    
    return input_data, label_data

In [11]:
input_data, label_data = build_data(words_num, window_size)
datasize = input_data.shape[0]

### 3.3 模型构建

#### 1. 定义模型基本参数

In [12]:
tf.reset_default_graph()

In [13]:
batch_size = 128        # How many sets of words to train on at once.
embedding_size = 200    # The embedding size of each word to train.
generations = 50000     # How many iterations we will perform the training on.
print_loss_every = 2000  # Print the loss every so many iterations
num_sampled = int(batch_size/2)  # 负样本数量
model_learning_rate = 0.01  # 学习率
print_valid_every = 5000

#### 2. 定义 embedding 等参数

In [14]:
# Define Embeddings:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# NCE loss 相关参数
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                                               stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Create data/target placeholders
x_inputs = tf.placeholder(tf.int32, shape=[batch_size, 2*window_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Lookup the word embedding
# Embedding 相加
embed = tf.zeros([batch_size, embedding_size])
for element in range(2*window_size):
    embed += tf.nn.embedding_lookup(embeddings, x_inputs[:, element])  # 左右两边的 words embedding 相加

#### 3. 定义损失函数及相似度计算

In [15]:
# Get loss from prediction
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                     biases=nce_biases,
                                     labels=y_target,
                                     inputs=embed,
                                     num_sampled=num_sampled,
                                     num_classes=vocabulary_size), name='loss')

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=model_learning_rate).minimize(loss)

# 计算 cosine 相似度
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

In [16]:
loss_summary = tf.summary.scalar('loss', loss)
embed_summary = tf.summary.histogram('embed', embed)
merged = tf.summary.merge_all()

In [17]:
writer = tf.summary.FileWriter('tf_log/')

In [18]:
with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)

    # Run the cbow gram model.
    loss_vec = []
    loss_x_vec = []
    train_step = 50000
    for i in range(train_step):
        start = (i*batch_size) % (datasize - (datasize % batch_size))
        end = (i*batch_size) % (datasize - (datasize % batch_size)) + batch_size
        feed_dict={x_inputs: input_data[start:end], y_target: label_data[start:end]}
        # Run the train step
        _, merged_summary = sess.run([optimizer, merged], feed_dict=feed_dict)
        writer.add_summary(merged_summary, i)
    
        # Return the loss
        if (i+1) % print_loss_every == 0:
            loss_val = sess.run(loss, feed_dict=feed_dict)
            loss_vec.append(loss_val)
            loss_x_vec.append(i+1)
            print("Loss at step {} : {}".format(i+1, loss_val))
        
        if (i+1) % print_valid_every == 0:
            sim = sess.run(similarity, feed_dict=feed_dict)
            for j in range(len(valid_words)):
                valid_word = reverse_dict[valid_examples[j]]
                top_k = 10 # number of nearest neighbors
                nearest = (-sim[j, :]).argsort()[1:top_k+1]
                log_str = "Nearest to {}:".format(valid_word)
                for k in range(top_k):
                    close_word = reverse_dict[nearest[k]]
                    log_str = "%s %s," % (log_str, close_word)
                print(log_str)

Loss at step 2000 : 44.070274353027344
Loss at step 4000 : 29.16555404663086
Nearest to were: to, gurps, pride, tactic, contest, the, in, singer, this, descended,
Nearest to five: RARE, the, to, a, one, that, is, of, and, as,
Nearest to three: is, in, the, and, RARE, of, by, to, one, a,
Nearest to which: the, RARE, by, three, and, of, s, to, in, nine,
Loss at step 6000 : 18.81061553955078
Loss at step 8000 : 20.462358474731445
Loss at step 10000 : 23.814836502075195
Nearest to were: to, in, this, and, the, by, be, of, for, also,
Nearest to five: RARE, to, the, a, one, that, is, as, and, of,
Nearest to three: is, in, by, the, and, of, RARE, his, to, that,
Nearest to which: by, the, RARE, and, s, to, three, of, in, that,
Loss at step 12000 : 13.884153366088867
Loss at step 14000 : 11.480175018310547
Nearest to were: to, this, in, by, be, for, and, also, the, an,
Nearest to five: to, RARE, that, a, one, is, as, the, be, s,
Nearest to three: is, in, by, and, his, the, that, of, to, one,
Ne

## Refrences:

### CS224n & CS224d:

* 课程索引：
    * [Stanford University CS224d: Deep Learning for Natural Language Processing](http://cs224d.stanford.edu/syllabus.html)
    * [CS224n: Natural Language Processing with Deep Learning](http://web.stanford.edu/class/cs224n/syllabus.html)
* 参考资料：
    * [Word2Vec Tutorial](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
* Slides:
    * [cs224n slide leture2 p13](http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture2.pdf)
    * [cs224d slide leture2](http://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf)
* 视频：
    * [CS224d Lecture 2](https://www.youtube.com/watch?v=aRqn8t1hLxs&list=PLlJy-eBtNFt4CSVWYqscHDdP58M3zFHIG)
    * [CS224d Lecture 3](https://www.youtube.com/watch?v=CP9bIt4IPVo&index=3&list=PLlJy-eBtNFt4CSVWYqscHDdP58M3zFHIG)
* 笔记：
    * [cs224d lecture2 note](http://www.52nlp.cn/%E6%96%AF%E5%9D%A6%E7%A6%8F%E5%A4%A7%E5%AD%A6%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E4%B8%8E%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E7%AC%AC%E4%BA%8C%E8%AE%B2%E8%AF%8D%E5%90%91%E9%87%8F)

### 其他资料:

* word2vec 解析：
    * [原始论文](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
    * [Approximating the Softmax for Learning Word Embeddings](http://sebastianruder.com/word-embeddings-softmax/)
    * [word2vec 中的数学原理详解（多图，WIFI下阅读） - 机器学习 - 算法组](http://suanfazu.com/t/word2vec-zhong-de-shu-xue-yuan-li-xiang-jie-duo-tu-wifixia-yue-du/178)
    * [Word2Vec原理之层次Softmax算法 | 一灯@qiancy.com](http://qiancy.com/2016/08/17/word2vec-hierarchical-softmax/)
    * [Tensorflow 的Word2vec demo解析 - 阁子 - 博客园](http://www.cnblogs.com/rocketfan/p/4976806.html)
    * [Word2Vec-知其然知其所以然 - 作业部落 Cmd Markdown 编辑阅读器](https://www.zybuluo.com/Dounm/note/591752#322-使用negative-sampling优化)
    
* 问答：
    * [Tensorflow 的NCE-Loss的实现和word2vec - 简书](http://www.jianshu.com/p/fab82fa53e16)
    * [word2vec - TensorFlow Embedding Lookup - Stack Overflow](http://stackoverflow.com/questions/37897934/tensorflow-embedding-lookup)
    
* 书籍：
    * [TensorFlow实战 (豆瓣)](https://book.douban.com/subject/26974266/) 第7章有关 word2vec 的实现
    * [nfmcclure/tensorflow_cookbook: Code for Tensorflow Machine Learning Cookbook](https://github.com/nfmcclure/tensorflow_cookbook) 第7章