## 一个简单Word2Vec的例子

这个例子展示了如何利用DSW进行构造一个Word2Vec的学习，Word2Vec是NLP训练一个比较基础的数据处理方式，具体原理大家可以仔细阅读论文《[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)》

我们准备一个文章text，然后通过这个文章的信息来学习出文章中单词的一个向量表达，使得单词的空间距离能够体现单词语义距离，使得越靠近的单词语义越相似。

首先我们把文章读入到words这个数组中。可以看到这个文章总共包括17005207单词。

In [25]:
import tensorflow as tf
import zipfile

with zipfile.ZipFile("text.zip") as f:
    words = tf.compat.as_str(f.read(f.namelist()[0])).split()
    
print('words size', len(words))

words size 17005207


然后我们准备一个词典，例子中我们限制一下这个词典最大size为50000, 字典中我们保存文章最多出现49999个词，然后其他词都当做'UNK'

In [27]:
import collections
import math

vocabulary_size = 50000
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
print("最多5个单词以及出现次数", count[1:6])

最多5个单词以及出现次数 [('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201)]


In [None]:
为了后面训练的方便，我们把单词用字典的index来进行标识，并且把原文用这个进行编码

In [28]:
dictionary = dict()
for word, _ in count:
    dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
    if word in dictionary:
        index = dictionary[word]
    else:
        index = 0  # dictionary['UNK']
    unk_count += 1
    data.append(index)
count[0][1] = unk_count

print('编码后文章为', data[:10], '...')

编码后文章为 [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156] ...


建立一个方向查找表，等学习完，可以把单词的编码又变回到原单词

In [29]:
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
print([reverse_dictionary[i] for i in data[:10]])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


接下来我们进入正题，我们构造训练word2vec的样本，也就是skip gram论文中的方法，把这个定义为一个函数，留到后面训练时候用

In [30]:
import numpy as np
import random
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch, labels

我们观察这个训练样本的样子

In [31]:
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8):
    print(batch[i], reverse_dictionary[batch[i]],
        '->', labels[i, 0], reverse_dictionary[labels[i, 0]])

3081 originated -> 12 as
3081 originated -> 5234 anarchism
12 as -> 3081 originated
12 as -> 6 a
6 a -> 12 as
6 a -> 195 term
195 term -> 2 of
195 term -> 6 a


现在我们来定义训练说需要的DNN模型

In [32]:
batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
skip_window = 1       # How many words to consider left and right.
num_skips = 2         # How many times to reuse an input to generate a label.

# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64    # Number of negative examples to sample.

In [33]:
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

nce_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

loss = tf.reduce_mean(
    tf.nn.nce_loss(weights=nce_weights,
                     biases=nce_biases,
                     labels=train_labels,
                     inputs=embed,
                     num_sampled=num_sampled,
                     num_classes=vocabulary_size))

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)


In [34]:
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

average_loss = 0
for step in range(100001):
    batch_inputs, batch_labels = generate_batch(
        batch_size, num_skips, skip_window)
    feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}

    # We perform one update step by evaluating the optimizer op (including it
    # in the list of returned values for session.run()
    _, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 2000 == 0:
        if step > 0:
            average_loss /= 2000
        print("Average loss at step ", step, ": ", average_loss)
        average_loss = 0

    if step % 10000 == 0:
        sim = similarity.eval()
        for i in range(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8 # number of nearest neighbors
            nearest = (-sim[i, :]).argsort()[1:top_k+1]
            log_str = "Nearest to %s:" % valid_word
            for k in range(top_k):
                close_word = reverse_dictionary[nearest[k]]
                log_str = "%s %s," % (log_str, close_word)
            print(log_str)
final_embeddings = normalized_embeddings.eval()




Average loss at step  0 :  293.93377685546875
Nearest to more: ona, conquistador, loa, vichy, escoffier, miss, humber, gulliver,
Nearest to six: degrees, californian, ascertaining, internal, undecidable, lease, meteorologists, terabytes,
Nearest to be: jupiter, surinam, detritus, involves, spongebob, cluniac, mulholland, postgraduate,
Nearest to known: noticing, senators, scripture, hahn, demolished, castilian, modernity, wiser,
Nearest to two: gions, composing, flames, foolish, tempting, ali, silt, captive,
Nearest to use: passport, morgue, uma, messiaen, scrap, studio, excelled, garcia,
Nearest to for: patricio, klerk, puritanical, modernity, arminius, belgica, lenovo, pentagonal,
Nearest to other: marianas, aarseth, legates, justus, confocal, exe, indented, rockwell,
Nearest to first: ingredients, massing, bigelow, disarm, allein, discouraged, cygnus, greenberg,
Nearest to from: batter, clarified, unpaired, neurological, monetarism, shoe, imprisoned, affair,
Nearest to after: righti

最后我们形象在画布上展示学习出来单词的向量

In [35]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 200
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels = [reverse_dictionary[i] for i in range(plot_only)]

plt.figure(figsize=(18, 18))  #in inches
for i, label in enumerate(labels):
    x, y = low_dim_embs[i,:]
    plt.scatter(x, y)
    plt.annotate(label,
                 xy=(x, y),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')

plt.savefig('result.png')

![result](./result.png)