# 词向量Skip-gram模型

这节训练词向量，实现Skip-gram模型。

有以下几个知识点：
- 用Skip-thought模型训练词向量
- 学习使用PyTorch dataset和dataloader
- 学习定义PyTorch模型
- 学习torch.nn中常见的Module
    - Embedding
- 学习常见的PyTorch operations
    - bmm
    - logsigmoid
- 保存和读取PyTorch模型

下面复现论文[Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)中训练词向量的方法，实现Skip-gram模型，使用noice contrastive sampling的目标函数，没有用论文中subsampling方法。

In [55]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

from torch.nn.parameter import Parameter #参数更新和优化参数

from collections import Counter #计数器，统计词频
import numpy as np
import random
import math

import pandas as pd
import scipy
import sklearn
from sklearn.metrics.pairwise import cosine_similarity #余弦相似度函数
USE_MPS = torch.backends.mps.is_available() #判断是否支持MPS
print(USE_MPS)

True


In [56]:
random.seed(53113)
np.random.seed(53113)
torch.manual_seed(53113)
torch.cuda.manual_seed(53113)

#设置超参数
K = 100 #负样本随机采样数量
C = 3 #指定周围三个单词进行预测
NUM_EPOCHS = 5 #训练轮数，默认10轮
MAX_VOCAB_SIZE = 30_000 #词汇表大小
BATCH_SIZE = 32 #批处理大小
LEARNING_RATE = 0.2 #学习率
EMBEDDING_DIM = 100 #词向量维度

LOG_FILE = "word_embedding.log"

#tokenize函数，将文本转换为单词列表
def tokenize(text):
    return text.split()

- 从文本文件中读取所有的文字，通过这些文本创建一个vocabulary
- 由于单词数量可能太大，我们只选取最常见的MAX_VOCAB_SIZE个单词
- 我们添加一个UNK单词表示所有不常见的单词
- 我们需要记录单词到index的mapping，以及index到单词的mapping，单词的count，单词的(normalized) frequency，以及单词总数。

In [57]:
with open("data/nietzsche.txt", "r") as file:
    text = file.read()

print(f"text : {text[:200]}")

#将文本转换为单词列表
text = tokenize(text.lower())

#统计词频，字典格式， 把(MAX_VOCAB_SIZE-1)个最常见的单词取出来，-1为unk表示不常见的单词（unknow）
vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE - 1))

#unk表示不常见的单词 = 总词数 - 常见词数
vocab["<unk>"] = len(text) - np.sum(list(vocab.values()))

#取出字典的所有单词key
idx_to_word = [word for word in vocab.keys()]

#取出所有单词和对应的索引， 最常见的单词索引为0
word_to_idx = {word: i for i, word in enumerate(idx_to_word)}

#所有单词的频数values
word_counts = np.array(list(vocab.values()), dtype=np.float32)

#所有单词的频率
word_freqs = word_counts / np.sum(word_counts)

#论文里乘以3/4次方
word_freqs = word_freqs ** (3.0/4.0)

#归一化，重新计算频率
word_freqs = word_freqs / np.sum(word_freqs)

VOCAB_SIZE = len(idx_to_word) #词汇表大小 30000=MAX_VOCAB_SIZE
VOCAB_SIZE

text : PREFACE


SUPPOSING that Truth is a woman--what then? Is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists, have failed to understand women--that the terrib


17683

### 实现Dataloader

一个dataloader需要以下内容：

- 把所有text编码成数字，然后用subsampling预处理这些文字。
- 保存vocabulary，单词count，normalized word frequency
- 每个iteration sample一个中心词
- 根据当前的中心词返回context单词
- 根据中心词sample一些negative单词
- 返回单词的counts

这里有一个好的tutorial介绍如何使用[PyTorch dataloader](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html).
为了使用dataloader，我们需要定义以下两个function:

- ```__len__``` function需要返回整个数据集中有多少个item
- ```__getitem__``` 根据给定的index返回一个item

有了dataloader之后，我们可以轻松随机打乱整个数据集，拿到一个batch的数据等等。

In [58]:
class WordEmbeddingDataset(Dataset):
    def __init__(self, text, word_to_idx, idx_to_word, word_freqs, word_counts) -> None:
        """
        text: a list of words, all text from the data
        word_to_idx: the dictionary from word to idx
        idx_to_word: idx to word mapping
        word_freqs: the frequency of each word
        word_counts: the word counts
        """
        super().__init__()
        #字典get方法，函数返回指定键的值，如果值不在字典中返回默认值（第二个参数）。
        #取出text中每个单词对应的索引，不在字典里的单词返回unk的索引
        self.text_encoded = [word_to_idx.get(word, word_to_idx["<unk>"]) for word in text]

        self.text_encoded = torch.tensor(self.text_encoded).long() #转换为张量

        self.word_to_idx = word_to_idx
        self.idx_to_word = idx_to_word
        self.word_freqs = torch.tensor(word_freqs)
        self.word_counts = torch.tensor(word_counts)

    def __len__(self):
        return len(self.text_encoded)
    
    def __getitem__(self, idx):
        """
        返回三个部分的数据：
        - 中心词
        - 这个单词附近的positive单词
        - 随机采样的K单词作为negative sample
        """
        center_word = self.text_encoded[idx]

        #周围词索引，比如C=3，取中心词前后三个单词， idx=0时，pos_indices=[-3, -2, -1, 1, 2, 3]
        pos_indices = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1))

        #对于不在文本中的索引，进行处理
        pos_indices = [i % len(self.text_encoded) for i in pos_indices]

        pos_words = self.text_encoded[pos_indices]

        #负样本采样，根据频率采样K个单词
        #torch.multinomial 用于从多项分布中抽取样本。多项分布是一种描述多个可能结果的概率分布，例如抛硬币、掷骰子等。
        #参数1：权重，参数2：采样次数，参数3：是否有放回采样
        #输出的采样结果就是word_freqs中的索引
        #每个正样本对应K个负样本
        neg_words = torch.multinomial(self.word_freqs, num_samples= K * pos_words.shape[0], replacement=True)

        return center_word, pos_words, neg_words

In [59]:
dataset = WordEmbeddingDataset(text, word_to_idx, idx_to_word, word_freqs, word_counts)

list(dataset[0])

[tensor(5907),
 tensor([17681,     1,  5902,   705,     7,   178]),
 tensor([    1,   750,   276,  2561,  4582,    26, 11822, 11679,  7987,    25,
           824,    41,  7631,  4893,   255,  2434,  6848,    74,  1037, 16634,
           207, 11917,  9472,  6514,  5626, 10230,     5,   570,  1652,   791,
           100,     1,    58,    32,    46,   662,  1640,  5649,  3275,  4782,
         15518,   707, 13781,    94,    21,  4434,   104,  2089,     1,   142,
          1030,  1971, 17135,  9696,  2042,   612,  7202,     8,    93,     8,
         14046, 14720, 13009,   607,  9530,  1390,  7264,  7326,  7230, 13614,
          3710,  5503,   522, 11194,  1596,  1143,   107,    87,  9148,    21,
         14841,  1905,    39,  5270,     5,    78,  1720,    52,  1324,     8,
          1764,     3,   423,  2516,    62,    33,  2609,   173,    80,   552,
          6815,  8455,  1633,    47,    59,   481,   624,   392,  4906, 10137,
             2,   312,  1162,  1253, 17486,   738,    33,    12

In [60]:
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
dataloader

<torch.utils.data.dataloader.DataLoader at 0x12de3a010>

In [61]:
for i, (input_labels, pos_labels, neg_labels) in enumerate(dataloader):
    print(input_labels, pos_labels, neg_labels)
    break

tensor([    0,   438,    37,  1208,   152,     3,   638,     7,    23,    40,
          471,  1162,  4644,     2,   243,     0,  1809,     1,   421,   263,
            1,  9750,    28, 11882,    25,   441,    45,  6198,  5798,   277,
          184,     5]) tensor([[ 5682,    13,    33,   304,   135,    27],
        [  224,     4,   353,  3129,  1609,     9],
        [  133,    51,   512,     4,    95,    46],
        [ 1078,  8250,    27,  4387,     0,   155],
        [   62,    17,   406,  7968,  7969,   629],
        [   16,   415,  2649,    76,  3310,     6],
        [   34,  4251,    59,  8582,  8583,     8],
        [ 4719,     5,   152,   179,    41,   933],
        [    7,     0,  3191,  3904,    10,    17],
        [    6,   207,  1476,     6,     3,    12],
        [   44,    91,     1,     2,  1539,     7],
        [    1,    44,   361,    32, 11925,  5073],
        [    1,    22,   418,    10,    79,   750],
        [ 2443,     0,   596,    46,   187,   148],
        [    0,

# 定义PyTorch模型

In [62]:
class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_size) -> None:
        super().__init__()
        self.vocab_size = vocab_size #词汇表大小, 30000
        self.embed_size = embed_size #词向量维度, 100

        initrange = 0.5 / self.embed_size #初始化范围

        #模型输入nn.Embedding(30000, 100)
        self.in_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
        #初始化权重
        self.in_embed.weight.data.uniform_(-initrange, initrange)

        #模型输出nn.Embedding(30000, 100)
        self.out_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False) #输出层, sparse=False表示不使用稀疏张量
        #初始化权重
        self.out_embed.weight.data.uniform_(-initrange, initrange)

    def forward(self, input_labels, pos_labels, neg_labels):
        """
        input_labels: 中心词，[batch_size]
        pos_labels: 中心词周围的正样本，[batch_size, (window_size * 2)]
        neg_labels: 中心词周围的负样本，[batch_size, (window_size * 2 * K)]

        return: loss, [batch_size]
        """

        batch_size = input_labels.size(0)

        input_embedding = self.in_embed(input_labels) #中心词的词向量，[batch_size, embed_size]
        pos_embedding = self.out_embed(pos_labels) #正样本的词向量，[batch_size, (window_size * 2), embed_size]
        neg_embedding = self.out_embed(neg_labels) #负样本的词向量，[batch_size, (window_size * 2 * K), embed_size]

        #torch.bmm()为batch间的矩阵乘法【batch_size, n, m】 * 【batch_size, m, p】 = 【batch_size, n, p】
        log_pos = torch.bmm(
            pos_embedding, input_embedding.unsqueeze(2)
        ).squeeze() #正样本的相似度，[batch_size, (window_size * 2)]

        log_neg = torch.bmm(
            neg_embedding, -input_embedding.unsqueeze(2)
        ).squeeze() #负样本的相似度，[batch_size, (window_size * 2 * K)]

        #下面loss计算就是论文里的公式
        log_pos = F.logsigmoid(log_pos).sum(1) #正样本的相似度，[batch_size]
        log_neg = F.logsigmoid(log_neg).sum(1) #负样本的相似度，[batch_size]
        loss = log_pos + log_neg

        return -loss
    
    def input_embeddings(self): #取出self.in_embed的权重参数
        return self.in_embed.weight.data.cpu().numpy()

定义一个模型以及把模型移动到GPU

In [63]:
model = EmbeddingModel(VOCAB_SIZE, EMBEDDING_DIM)
device = torch.device("mps" if USE_MPS else "cpu")
print(f"device: {device}")

model = model.to(device)
model

device: mps


EmbeddingModel(
  (in_embed): Embedding(17683, 100)
  (out_embed): Embedding(17683, 100)
)

In [64]:
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

for epoch in range(NUM_EPOCHS):
    for i, (input_labels, pos_labels, neg_labels) in enumerate(dataloader):
        #转为longtensor
        input_labels = input_labels.long()
        pos_labels = pos_labels.long()
        neg_labels = neg_labels.long()

        input_labels = input_labels.to(device)
        pos_labels = pos_labels.to(device)
        neg_labels = neg_labels.to(device)

        optimizer.zero_grad()
        loss = model(input_labels, pos_labels, neg_labels).mean()
        loss.backward()
        optimizer.step()

        if i % 100 == 0:
            with open(LOG_FILE, "a") as file:
                file.write(f"epoch: {epoch}, iter: {i}, loss: {loss.item()}\n")
            
            print(f"epoch: {epoch}, iter: {i}, loss: {loss.item()}")

#保存模型
embedding_weights = model.input_embeddings()
np.save(f"embedding-{EMBEDDING_DIM}", embedding_weights)
torch.save(model.state_dict(), f"embedding-{EMBEDDING_DIM}.th")

epoch: 0, iter: 0, loss: 420.0471496582031
epoch: 0, iter: 100, loss: 420.0234375
epoch: 0, iter: 200, loss: 418.30194091796875
epoch: 0, iter: 300, loss: 407.61260986328125
epoch: 0, iter: 400, loss: 394.5010986328125
epoch: 0, iter: 500, loss: 374.0609130859375
epoch: 0, iter: 600, loss: 346.4581298828125
epoch: 0, iter: 700, loss: 328.9640808105469
epoch: 0, iter: 800, loss: 303.1462097167969
epoch: 0, iter: 900, loss: 289.53759765625
epoch: 0, iter: 1000, loss: 274.593505859375
epoch: 0, iter: 1100, loss: 231.27932739257812
epoch: 0, iter: 1200, loss: 231.7281494140625
epoch: 0, iter: 1300, loss: 248.93341064453125
epoch: 0, iter: 1400, loss: 215.7095947265625
epoch: 0, iter: 1500, loss: 223.58456420898438
epoch: 0, iter: 1600, loss: 251.60345458984375
epoch: 0, iter: 1700, loss: 224.01065063476562
epoch: 0, iter: 1800, loss: 229.17547607421875
epoch: 0, iter: 1900, loss: 186.27651977539062
epoch: 0, iter: 2000, loss: 181.14459228515625
epoch: 0, iter: 2100, loss: 226.1372680664062

In [65]:
model.load_state_dict(
    torch.load(f"embedding-{EMBEDDING_DIM}.th")
)

model

EmbeddingModel(
  (in_embed): Embedding(17683, 100)
  (out_embed): Embedding(17683, 100)
)

下面是评估模型的代码，以及训练模型的代码

In [66]:
import os

def evaluate(filename, embedding_weights):
    if not os.path.exists(filename):
        print(f"{filename} not found")
        return
    if filename.endswith(".csv"):
        data = pd.read_csv(filename, sep=",")
    else:
        data = pd.read_csv(filename, sep="\t")

    print(data.head())
    human_similarity = []
    model_similarity = []
    for i in range(len(data)):
        word1, word2 = data.iloc[i, 0], data.iloc[i, 1]
        if word1 not in word_to_idx or word2 not in word_to_idx:
            continue
        else:
            word1_idx, word2_idx = word_to_idx[word1], word_to_idx[word2]
            #embedding_weights是模型的权重参数,是一个矩阵
            #当我们输入单词的索引时，在计算embedding时，会取出embedding_weights中对应索引的行
            #这个行就是对应单词的词向量
            word1_embed, word2_embed = embedding_weights[[word1_idx]], embedding_weights[[word2_idx]]
            model_similarity.append(
                float(sklearn.metrics.pairwise.cosine_similarity(word1_embed, word2_embed)[0,0]) #cosine_similarity返回的是矩阵,但这里由于左右都是一个样本，所以取[0,0]就是这对样本的相似度
            )
            human_similarity.append(float(data.iloc[i, 2]))

    return scipy.stats.spearmanr(human_similarity, model_similarity) #计算相关系数, 返回相关系数和p值

def find_nearest(word):
    if word not in word_to_idx:
        return
    index = word_to_idx[word]
    embedding = embedding_weights[index]
    cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights]) #计算该单词与其他单词的余弦相似度
    return [idx_to_word[i] for i in cos_dis.argsort()[:10]] #返回最相似的10个单词





## 在 Simplex-999 数据集上做评估

In [67]:
embedding_weights = model.input_embeddings()
print("simlex-999", evaluate("data/en-simlex-999.txt", embedding_weights))

     old          new  1.58
0  smart  intelligent  9.20
1   hard    difficult  8.77
2  happy     cheerful  9.55
3   hard         easy  0.95
4   fast        rapid  8.75
simlex-999 SignificanceResult(statistic=np.float64(0.12613471770213477), pvalue=np.float64(0.017906650696305405))


## 寻找nearest neighbors

In [68]:
for word in ["good", "green", "like", "america", "chicago", "work", "building", "computer", "language"]:
    print(word, find_nearest(word))

good ['good', 'things', 'other', 'made', 'just', 'well', 'being', 'very', 'life', 'because']
green ['green', 'scholar,', 'treasure', 'impregnated', 'mysteries', 'tastes', 'permeated', 'indicate', 'surrenders', 'granting']
like ['like', 'once', 'about', 'yet', 'long', 'longer', 'then', 'precisely', 'too', 'is,']
america None
chicago None
work ['work', 'two', 'while', 'delight', 'therefore,', 'powerful', 'within', 'metaphysical', 'effect', 'former']
building ['building', 'produced,', 'still;', 'leaves,', 'dupers', 'ten', 'falsification', 'forms,', 'disposing,', 'poetical']
computer None
language ['language', 'words', 'and,', 'profound', 'cases', 'common', 'lower', 'both', 'here,', 'music']


## 单词之间的关系

In [69]:
man_idx = word_to_idx["man"]
king_idx = word_to_idx["king"]
woman_idx = word_to_idx["woman"]
embedding = embedding_weights[woman_idx] - embedding_weights[man_idx] + embedding_weights[king_idx]
cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])
for i in cos_dis.argsort()[:10]:
    print(idx_to_word[i])

clair,
but--dreadful
102.
e
moments-la,
suspects,
realm--what
wealthy
hat
now--better


In [70]:
os.remove(LOG_FILE)
os.remove(f"embedding-{EMBEDDING_DIM}.npy")
os.remove(f"embedding-{EMBEDDING_DIM}.th")