### 10.1 词嵌入
* 词向量
    - 词的特征向量或表征
    - 把词映射为实数域向量的技术也叫词嵌入
* Word2Vec
    - 将每个词表示成一个定长的向量，并使得这些向量能较好的表达不同词之间的相似和类比关系
    - 包含两个模型：
        - skip-gram
        - CBOW

### 10.2 近似训练
* 负采样
    - 通过考虑同时含有正类样本和负样本的相互独立事件来构造损失函数
    - 训练中每一步的梯度计算开销与采样的噪声词的个数线性相关
* 层序softmax
    - 使用二叉树，并根据根节点到叶节点的路径来构造损失函数
    - 训练中每一步的梯度计算开销与词典大小的对数相关

##### 处理数据集

In [1]:
import collections
import math
import random
import sys
import time
import os
import numpy as np
import torch
from torch import nn
import torch.utils.data as Data

In [2]:
data_dir = 'data/ptb'
train_data = data_dir + '/ptb.train.txt'
with open(train_data, 'r') as f:
    lines = f.readlines()
    raw_dataset = [st.split() for st in lines]
print('sentence: %d' % len(raw_dataset))

sentence: 42068


In [3]:
for st in raw_dataset[:3]:
    print('tokens:', len(st), st[:5])

tokens: 24 ['aer', 'banknote', 'berlitz', 'calloway', 'centrust']
tokens: 15 ['pierre', '<unk>', 'N', 'years', 'old']
tokens: 11 ['mr.', '<unk>', 'is', 'chairman', 'of']


In [4]:
# 建立词语索引
counter = collections.Counter([tk for st in raw_dataset 
                               for tk in st])
print(len(counter))
counter = dict(filter(lambda x: x[1] >= 5, 
                      counter.items()))
print(len(counter))

9999
9858


In [5]:
idx_to_token = [tk for tk, _ in counter.items()]
token_to_idx = {tk:idx for idx, tk in 
                enumerate(idx_to_token)}
dataset = [[token_to_idx[tk] for tk in st 
            if tk in token_to_idx] 
           for st in raw_dataset]
num_tokens = sum([len(st) for st in dataset])
print('tokens: %d' % num_tokens)

tokens: 887100


In [6]:
# 二次采样 - 数据集中每个被索引词有一定概率被丢弃
# 丢弃概率为 P(w_i) = max((1 - sqrt(t / f(w_i))), 0)
# f(w_i)是数据集中词w_i的频率：即W_i的个数与总词数之比
# t是一个常数超参， 即只有f(w_i) > t时，才有可能丢弃词w_i
# 越高频的词被丢弃的概率越大
def discard(idx):
    return random.uniform(0, 1) < 1 - math.sqrt(
            1e-4 / counter[idx_to_token[idx]] * 
            num_tokens)
subsampled_dataset = [[tk for tk in st if not discard(tk)
                      ] for st in dataset]
print('tokens: %d' % sum([len(st) for st in 
                         subsampled_dataset]))

tokens: 376053


In [7]:
def compare_counts(token):
    return '%s: before=%d, after=%d' % (
        token, 
        sum([st.count(token_to_idx[token]) 
            for st in dataset]),
        sum([st.count(token_to_idx[token])
            for st in subsampled_dataset])
    )

In [8]:
compare_counts('the')

'the: before=50770, after=2135'

In [9]:
compare_counts('join')

'join: before=45, after=45'

##### 提取中心词和背景词

In [10]:
# 提取所有中心词和他们的背景词
# 将与中心词距离不超过背景窗口大小的词作为它的背景词
def get_centers_and_contexts(dataset, max_window_size):
    centers, contexts = [], []
    for st in dataset:
        if len(st) < 2: 
            continue
        centers += st
        for center_i in range(len(st)):
            window_size = random.randint(1, 
                                         max_window_size)
            indices = list(range(
                max(0, center_i - window_size), 
                min(len(st), center_i + 1 + window_size)))
            indices.remove(center_i)
            contexts.append([st[idx] for idx in indices])
    return centers, contexts

In [11]:
# test
tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(
        tiny_dataset, 2)):
    print('center', center, 'has contexts', context)

dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1, 2]
center 1 has contexts [0, 2, 3]
center 2 has contexts [1, 3]
center 3 has contexts [2, 4]
center 4 has contexts [2, 3, 5, 6]
center 5 has contexts [4, 6]
center 6 has contexts [4, 5]
center 7 has contexts [8, 9]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]


In [12]:
all_centers, all_contexts = get_centers_and_contexts(
        subsampled_dataset, 5)

In [13]:
print(len(all_centers), len(all_contexts))

375123 375123


##### 负采样

In [14]:
# 对于一对中心词和背景词，我们随机采样K个噪声词
# 噪声词采样概率P(w)设为w词频与总词频之比的0.75次方
def get_negatives(all_contexts, sampling_weights, K):
    all_negatives, neg_candidates, i = [], [], 0
    population = list(range(len(sampling_weights)))
    for contexts in all_contexts:
        negatives = []
        while len(negatives) < len(contexts) * K:
            if i == len(neg_candidates):
                i, neg_candidates = 0, random.choices(
                    population, sampling_weights, 
                    k=int(1e5))
            neg, i = neg_candidates[i], i + 1
            if neg not in set(contexts):
                negatives.append(neg)
        all_negatives.append(negatives)
    return all_negatives

In [15]:
sampling_weights = [counter[w] ** 0.75 
                    for w in idx_to_token]
all_negatives = get_negatives(all_contexts, 
                              sampling_weights, 5)

##### 读取数据

In [16]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, centers, contexts, negatives):
        assert (len(centers) == len(contexts) 
                            == len(negatives))
        self.centers = centers
        self.contexts = contexts
        self.negatives = negatives
    def __getitem__(self, index):
        return (self.centers[index], self.contexts[index],
                self.negatives[index])
    def __len__(self):
        return len(self.centers)

In [17]:
def batchify(data):
    '''用作DataLoader的参数collate_fn：
    输入是个长为batchsize的list，
    list中每个元素都是Dataset类调用__getitem__得到的结果
    '''
    max_len = max(len(c) + len(n) for _, c, n in data)
    #print(len(data))
    centers, contexts_negatives = [], []
    masks, labels = [], []
    for center, context, negative in data:
        #print(center, len(context), len(negative))
        cur_len = len(context) + len(negative)
        centers += [center]
        contexts_negatives += [context + negative + 
                               [0] * (max_len - cur_len)]
        masks += [[1] * cur_len + 
                  [0] * (max_len - cur_len)]
        labels += [[1] * len(context) + 
                   [0] * (max_len - len(context))]
    return (torch.tensor(centers).view(-1, 1), 
            torch.tensor(contexts_negatives), 
            torch.tensor(masks), 
            torch.tensor(labels))

In [18]:
batch_size = 512
num_workers = 0
dataset = MyDataset(all_centers, all_contexts, 
                    all_negatives)
data_iter = Data.DataLoader(dataset, batch_size, 
                            shuffle=True, 
                            collate_fn=batchify,
                            num_workers=num_workers)
for batch in data_iter:
    for name, data in zip(['centers', 
                           'contexts_negatives', 
                           'masks', 'labels'], batch):
        print(name, 'shape:', data.shape)
    break

centers shape: torch.Size([512, 1])
contexts_negatives shape: torch.Size([512, 60])
masks shape: torch.Size([512, 60])
labels shape: torch.Size([512, 60])


##### skip-gram model

In [19]:
# 嵌入层
embed = nn.Embedding(num_embeddings=20, embedding_dim=4)
embed.weight

Parameter containing:
tensor([[-1.0992, -1.8853,  0.6687, -0.9218],
        [ 0.3958,  0.3977, -0.4510, -2.2155],
        [ 1.6398, -1.3680, -0.2264,  0.7904],
        [ 0.3494, -0.9661,  0.3274, -1.7047],
        [ 1.1975, -0.4053, -0.0838, -1.1055],
        [-0.4982, -0.2221,  0.4674, -0.3799],
        [-1.0856, -0.1682, -0.3661, -1.0220],
        [-0.8017, -0.4501, -0.9369,  0.5541],
        [ 0.5104, -0.5680, -0.4287,  0.7494],
        [ 0.1044,  2.2826, -0.4913,  0.3344],
        [ 1.0642, -0.1926, -2.2577, -0.2016],
        [ 1.4299,  1.2044, -0.0434, -0.2899],
        [ 0.6104, -1.8030, -0.1180,  1.4232],
        [-0.9501,  1.7071, -1.8554,  1.2794],
        [ 0.1486, -0.4341, -2.4296,  0.7044],
        [ 0.2056,  0.6782,  0.7915, -0.2760],
        [ 0.5762, -1.9390,  0.5382, -1.0309],
        [ 0.8875, -0.3410, -0.5451,  0.1252],
        [-0.0163, -0.5292, -0.3095,  0.7328],
        [-1.4694, -0.0762, -0.0391,  0.5887]], requires_grad=True)

In [20]:
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.long)
embed(x)

tensor([[[ 0.3958,  0.3977, -0.4510, -2.2155],
         [ 1.6398, -1.3680, -0.2264,  0.7904],
         [ 0.3494, -0.9661,  0.3274, -1.7047]],

        [[ 1.1975, -0.4053, -0.0838, -1.1055],
         [-0.4982, -0.2221,  0.4674, -0.3799],
         [-1.0856, -0.1682, -0.3661, -1.0220]]], grad_fn=<EmbeddingBackward>)

In [21]:
# 小批量乘法
# 对两个batch中的矩阵做乘法
# (n, a, b) x (n, b, c) = (n, a, c)
X = torch.ones((2, 1, 4))
Y = torch.ones((2, 4, 6))
torch.bmm(X, Y).shape

torch.Size([2, 1, 6])

In [22]:
def skip_gram(center, contexts_and_negatives, 
              embed_v, embed_u):
    #print(center.shape, contexts_and_negatives.shape)
    # (512, 1) => (512, 1, 100)
    v = embed_v(center)
    # (512, 60) => (512, 60, 100)
    u = embed_u(contexts_and_negatives)
    #print(v.shape, u.shape)
    # (512, 1, 100) * (512, 100, 60)
    pred = torch.bmm(v, u.permute(0, 2, 1))
    #print(pred.shape)
    # (512, 1, 60)
    return pred

##### 训练模型

In [23]:
# 二元交叉熵损失函数
class SigmoidBinaryCrossEntropyLoss(nn.Module):
    def __init__(self):
        super(SigmoidBinaryCrossEntropyLoss, self
             ).__init__()
    def forward(self, inputs, targets, mask=None):
        '''
        input -  Tensor shape: (batch_size, len)
        output - Tensor of the same shape as input
        '''
        inputs = inputs.float()
        targets = targets.float()
        #print(inputs.shape, targets.shape)
        mask = mask.float()
        fn = (nn.functional
                .binary_cross_entropy_with_logits)
        res = fn(inputs, targets, reduction='none', 
                   weight=mask)
        return res.mean(dim=1)
        
loss = SigmoidBinaryCrossEntropyLoss()

In [24]:
pred = torch.tensor([[1.5, 0.3, -1, 2], 
                     [1.1, -0.6, 2.2, 0.4]])
label = torch.tensor([[1, 0, 0, 0], [1, 1, 0, 0]])
mask = torch.tensor([[1, 1, 1, 1], [1, 1, 1, 0]])
loss(pred, label, mask) * mask.shape[1] / mask.float(
            ).sum(dim=1)

tensor([0.8740, 1.2100])

In [25]:
def sigmd(x):
    return - math.log(1 / (1 + math.exp(-x)))

In [26]:
v = (sigmd(1.5) + sigmd(-0.3) + sigmd(1) + 
     sigmd(-2)) / 4
print('%.4f' % v)

0.8740


In [27]:
# 初始化模型参数
embed_size = 100
net = nn.Sequential(
    nn.Embedding(num_embeddings=len(idx_to_token), 
                 embedding_dim=embed_size),
    nn.Embedding(num_embeddings=len(idx_to_token),
                 embedding_dim=embed_size)
)

In [28]:
def train(net, lr, num_epochs):
    device = torch.device('cuda' 
            if torch.cuda.is_available() else 'cpu')
    print("train on ", device)
    net = net.to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    for epoch in range(num_epochs):
        start, l_sum, n = time.time(), 0.0, 0
        for batch in data_iter:
            center, context_negative, mask, label = [
                d.to(device) for d in batch]
            pred = skip_gram(center, context_negative,
                             net[0], net[1])
            l = (loss(pred.view(label.shape), label, 
                      mask) * mask.shape[1] / 
                 mask.float().sum(dim=1)).mean()
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            l_sum += l.cpu().item()
            n += 1
        print('epoch %d, loss %.2f, time %.2fs' % 
              (epoch + 1, l_sum / n, time.time() - start))

In [43]:
train(net, 0.01, 20)

train on  cpu
epoch 1, loss 0.32, time 121.81s
epoch 2, loss 0.31, time 120.82s
epoch 3, loss 0.30, time 121.74s
epoch 4, loss 0.30, time 121.09s
epoch 5, loss 0.30, time 120.62s
epoch 6, loss 0.29, time 121.14s
epoch 7, loss 0.29, time 121.47s
epoch 8, loss 0.29, time 121.88s
epoch 9, loss 0.29, time 120.55s
epoch 10, loss 0.28, time 120.61s
epoch 11, loss 0.28, time 121.13s
epoch 12, loss 0.28, time 121.82s
epoch 13, loss 0.28, time 121.14s
epoch 14, loss 0.28, time 121.36s
epoch 15, loss 0.28, time 121.51s
epoch 16, loss 0.28, time 121.10s
epoch 17, loss 0.28, time 120.02s
epoch 18, loss 0.28, time 121.21s
epoch 19, loss 0.28, time 120.18s
epoch 20, loss 0.27, time 119.96s


##### 模型应用

In [44]:
def get_similar_tokens(query_token, k, embed):
    W = embed.weight.data
    x = W[token_to_idx[query_token]]
    cos = (torch.matmul(W, x) / 
        (torch.sum(W * W, dim=1) * 
         torch.sum(x * x) + 
         1e-9).sqrt())
    _, topk = torch.topk(cos, k=k+1)
    topk = topk.cpu().numpy()
    for i in topk[1:]:
        print('cosine sim=%.3f: %s' % 
              (cos[i], (idx_to_token[i])))

In [45]:
get_similar_tokens('chip', 3, net[0])

cosine sim=0.472: intel
cosine sim=0.439: microprocessor
cosine sim=0.433: shipped


In [46]:
get_similar_tokens('chip', 3, net[1])

cosine sim=0.469: microsystems
cosine sim=0.425: distant
cosine sim=0.425: marketplace


In [55]:
word = 'movie'
get_similar_tokens(word, 3, net[0])
print('\n')
get_similar_tokens(word, 3, net[1])

cosine sim=0.448: evening
cosine sim=0.424: dominant
cosine sim=0.421: film


cosine sim=0.449: lucky
cosine sim=0.438: her
cosine sim=0.438: eduard


### 10.4 子词嵌入 FastText

* FastText提出了子词嵌入（subword emebdding）的方法，
* 在word2vec中的skip-gram模型的基础上，将中心词向量表示成单词的子词（subword）向量之和
* 子词嵌入利用构词上的规律，通常可以提升生僻词的质量

### 10.5 全局向量的词嵌入GLOVE

### 10.6 求近义词和类比词

##### 使用预训练的词向量

In [60]:
import torch
import torchtext.vocab as vocab

In [61]:
vocab.pretrained_aliases.keys()

dict_keys(['charngram.100d', 'fasttext.en.300d', 'fasttext.simple.300d', 'glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d'])

In [63]:
[key for key in vocab.pretrained_aliases.keys() 
    if 'glove' in key]

['glove.42B.300d',
 'glove.840B.300d',
 'glove.twitter.27B.25d',
 'glove.twitter.27B.50d',
 'glove.twitter.27B.100d',
 'glove.twitter.27B.200d',
 'glove.6B.50d',
 'glove.6B.100d',
 'glove.6B.200d',
 'glove.6B.300d']

In [86]:
cache_dir='~/Datasets/Word2Vec/glove'
#glove = vocab.GloVe(name='6B', dim=50, cache=cache_dir)
glove = vocab.pretrained_aliases['glove.6B.50d'](
    cache=cache_dir)

BadZipFile: File is not a zip file

##### 应用预训练词向量

### 10.7 文本情感分类：使用循环神经网络

### 10.8 文本情感分类：使用卷积神经网络