## 一个简单Word2Vec的例子

这个例子展示了如何利用DSW进行构造一个Word2Vec的学习，Word2Vec是NLP训练一个比较基础的数据处理方式，具体原理大家可以仔细阅读论文《[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)》

我们准备一个文章text，然后通过这个文章的信息来学习出文章中单词的一个向量表达，使得单词的空间距离能够体现单词语义距离，使得越靠近的单词语义越相似。

首先我们把文章读入到words这个数组中。可以看到这个文章总共包括17005207单词。

In [3]:
import tensorflow as tf
import zipfile

with zipfile.ZipFile("text.zip") as f:
    words = tf.compat.as_str(f.read(f.namelist()[0])).split()
    
print('words size', len(words))

words size 17005207


然后我们准备一个词典，例子中我们限制一下这个词典最大size为50000, 字典中我们保存文章最多出现49999个词，然后其他词都当做'UNK'

In [6]:
import collections
import math

vocabulary_size = 50000
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
print("最多5个单词以及出现次数", count[1:6])

最多5个单词以及出现次数 [('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201)]


In [None]:
为了后面训练的方便，我们把单词用字典的index来进行标识，并且把原文用这个进行编码

In [8]:
dictionary = dict()
for word, _ in count:
    dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
    if word in dictionary:
        index = dictionary[word]
    else:
        index = 0  # dictionary['UNK']
    unk_count += 1
    data.append(index)
count[0][1] = unk_count

print('编码后文章为', data[:10], '...')

编码后文章为 [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156]


建立一个方向查找表，等学习完，可以把单词的编码又变回到原单词

In [11]:
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
print([reverse_dictionary[i] for i in data[:10]])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


接下来我们进入正题，我们构造训练word2vec的样本，也就是skip gram论文中的方法，把这个定义为一个函数，留到后面训练时候用

In [17]:
import numpy as np
import random
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch, labels

我们观察这个训练样本的样子

In [18]:
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8):
    print(batch[i], reverse_dictionary[batch[i]],
        '->', labels[i, 0], reverse_dictionary[labels[i, 0]])

3081 originated -> 5234 anarchism
3081 originated -> 12 as
12 as -> 6 a
12 as -> 3081 originated
6 a -> 195 term
6 a -> 12 as
195 term -> 6 a
195 term -> 2 of
