## 1.2 Word Embedding
因机器无法直接接收单词、词语、字符等标识符（token），所以把标识符数值化一
直是人们研究的内容。开始时人们用整数表示各标识符，这种方法简单但不够灵活，后
来人们开始用独热编码（One-Hot Encoding）来表示。这种编码方法虽然方便，但非常稀
疏，属于硬编码，且无法重载更多信息。此后，人们想到用数值向量或标识符嵌入（Token 
Embedding）来表示，即通常说的词嵌入（Word Embedding），又称为分布式表示。
不过 Word Embedding 方法真正流行起来，还要归功于 Google 的 word2vec。接下来我
们简单了解下 word2vec 的原理及实现方法。

### 1.2.1　word2vec 之前
从文本、标识符、独热编码到向量表示的整个过程，可以用下图（即书中的图 1-2）表示
![image.png](attachment:image.png)

1.利用平台的Embedding层学习词嵌入  
在完成任务的同时学习词嵌入，例如，把Embedding作为第一层，先随机初始化这些词向量，然后利用平台（如PyTorch、TensorFlow等平台）不断进行学习（包括正向学习和反向学习），最后得到需要的词向量。代码清单1-1 为通过PyTorch的nn.Embedding层生成词嵌入的简单实例。

In [1]:
from torch import nn
import torch
import jieba
import numpy as np

raw_text = """越努力就越幸运"""
#利用jieba进行分词
words = list(jieba.cut(raw_text))
print(words)
#对标识符去重，生成由索引:标识符构成的字典
word_to_ix = { i: word for i, word in enumerate(set(words))}
#定义嵌入维度，并用正态分布，初始化词嵌入
#nn.Embedding模块的输入是一个标注的下标列表，输出是对应的词嵌入
embeds = nn.Embedding(4, 3)  
print(embeds.weight[0])
#获取字典的关键字
keys=word_to_ix.keys()
keys_list=list(keys)
#把所有关键字构成的列表转换为张量
tensor_value=torch.LongTensor(keys_list)
#把张量输入到Embedding层，通过运算得到各标识符的词嵌入
embeds(tensor_value)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\wumgapp\AppData\Local\Temp\jieba.cache
Loading model cost 0.547 seconds.
Prefix dict has been built successfully.


['越', '努力', '就', '越', '幸运']
tensor([-0.0926,  0.6750, -0.1996], grad_fn=<SelectBackward>)


tensor([[-0.0926,  0.6750, -0.1996],
        [-0.0913, -1.1977, -1.1774],
        [ 1.4456, -0.4522,  0.0423],
        [-0.4129, -1.0080,  1.1987]], grad_fn=<EmbeddingBackward>)

2.使用预训练的词嵌入  
利用在较大语料上预训练好的词嵌入或预训练模型，把这些词嵌入加载到当前任务或模型中。预训练模型很多，如word2vec，ELMo、BERT、XLNet、ALBERT，等等，这里我们先介绍word2vec，后续将介绍其他预训练模型

## 1.2.3 Skip-Gram模型
Skip-Gram模型同样包含三层：输入层、映射层和输出层。具体架构如图1-4所示。Skip-Gram模型中的w(t)为输入词,在已知词w(t)的前提下预测词w(t)的上下文w(t-2)、w(t-1)、w(t+1)、w(t+2)，条件概率写为：p(context(w)/w)。目标函数为：
$$ \mathcal{L}=\sum_{w\in C}logp(Context\left(w\right)|w) $$
![image.png](attachment:image.png)

我们通过一个简单例子来说明Skip-Gram的基本思想。假设有一句话:  
the quick brown fox jumped over the lazy dog

接下来，我们根据Skip-Gram模型的基本思想，按这条语句生成一个由序列（输入，输出)构成的数据集。那么，如何构成这样一个数据集呢？我们首先对一些单词以及它们的上下文环境建立一个数据集。可以以任何合理的方式定义“上下文”，这里是把目标单词的左右单词视作一个上下文， 使用大小为1的窗口(即window_size=1），也就是说，仅选输入词前后各1个词和输入词进行组合，就得到一个由(上下文, 目标单词) 组成的数据集，具体如表1-1所示

|输入单词|左边单词（上文）|右边单词（下文）|（上下文，目标单词）|（输入，输出）skip-gram根据目标单词预测上下文|
|:-|:-|:-|:-|:-|
|quick|	the| brown|	([the, brown], quick)|	(quick, the)(quick, brown)|
|brown|	quick|fox |([quick, fox], brown)|	(brown, quick)(brown, fox)|
|fox|	brown|	jumped	|([brown, jumped], fox)	|(fox, brown)(fox, jumped)|
|...|	...|...	|...	|...|
|lazy|	the |dog|([the, dog], lazy)	|(lazy, the)(lazy, dog)|

## 1.2.4可视化Skip-Gram模型实现过程
前面我们简单介绍了Skip-Gram的原理及架构，至于Skip-Gram是如何把输入转换为词嵌入、其间有哪些关键点、面对大语料库可能出现哪些瓶颈等，并没有展开说明。而了解Skip-Gram的具体实现过程，有助于更好地了解word2vec，以及其他预训练模型，如BLMo、BERT、ALBERT等。所以，本节将详细介绍Skip-Gram的实现过程，加深读者对其原理与实现的理解。

整体的流程如下：

（1）预处理语料库 —— 数据获取、清洗、使标准化、分词   
（2）Skip-Gram模型架构图  
    Skip-Gram模型架构图、对应的矩阵表示图  
（3）生成中心词及其上下文的数据集    
（4）生成训练数据 —— 创建字典、为每个词生成one-hot编码、生成word2dic和dic2word的索引    
（5）Skip-Gram模型的正向传播 —— 通过正向传播先对词做编码，计算错误率    
（6）Skip-Gram模型的反向传播---通过反向传播和梯度下降不断降低loss

### 1. 预处理语料库
语料库：  
natural language processing and machine learning is fun and exciting

In [2]:
import numpy as np
from collections import defaultdict

In [3]:
text = "natural language processing and machine learning is fun and exciting"

#对语料库进行简单处理，分词及转换为小写
corpus = [[word.lower() for word in text.split()]]

In [4]:
corpus

[['natural',
  'language',
  'processing',
  'and',
  'machine',
  'learning',
  'is',
  'fun',
  'and',
  'exciting']]

### 2. Skip-Gram模型架构图

使用Skip-Gram模型，设置windows-size=2，目标词确定其上下文，即根据目标词预测其左边2个和右边2个单词。具体模型如图1-5所示
![image.png](attachment:image.png)

图1-5如果用矩阵来表示，可写成图1-6所示的形式。
![image.png](attachment:image.png)

### 3.生成中心词及其上下文的数据集

#window_size=2 									
![image.png](attachment:image.png)

### 4. 生成训练数据
把每个词用one-hot编码表示

In [5]:
settings = {
'window_size': 2,      # 目标词左(或右)取的单词数
'n': 10,               # 隐含层的维度
'epochs': 50,          # 训练的迭代次数
'learning_rate': 0.01   # 学习率
}

In [6]:
class word2vec():
  def __init__(self):
    self.n = settings['n']
    self.lr = settings['learning_rate']
    self.epochs = settings['epochs']
    self.window = settings['window_size']

  def generate_training_data(self, settings, corpus):
    # Find unique word counts using dictonary
    word_counts = defaultdict(int)
    for row in corpus:
      for word in row:
        word_counts[word] += 1
    ## 共有9个不同的单词
    self.v_count = len(word_counts.keys())
    # Generate Lookup Dictionaries (vocab)
    self.words_list = list(word_counts.keys())
    # Generate word:index
    self.word_index = dict((word, i) for i, word in enumerate(self.words_list))
    # Generate index:word
    self.index_word = dict((i, word) for i, word in enumerate(self.words_list))
    
    training_data = []
    # Cycle through each sentence in corpus
    for sentence in corpus:
      sent_len = len(sentence)
      # Cycle through each word in sentence
      for i, word in enumerate(sentence):
        # Convert target word to one-hot
        w_target = self.word2onehot(sentence[i])
        # Cycle through context window
        w_context = []
        # Note: window_size 2 will have range of 5 values
        for j in range(i - self.window, i + self.window+1):
          # Criteria for context word 
          # 1. Target word cannot be context word (j != i)
          # 2. Index must be greater or equal than 0 (j >= 0) - if not list index out of range
          # 3. Index must be less or equal than length of sentence (j <= sent_len-1) - if not list index out of range 
          if j != i and j <= sent_len-1 and j >= 0:
            # Append the one-hot representation of word to w_context
            w_context.append(self.word2onehot(sentence[j]))
            # print(sentence[i], sentence[j]) 
            # training_data contains a one-hot representation of the target word and context words
        training_data.append([w_target, w_context])
    return np.array(training_data)

  def word2onehot(self, word):
    # word_vec - initialise a blank vector
    word_vec = [0 for i in range(0, self.v_count)] # Alternative - np.zeros(self.v_count)
    # Get ID of word from word_index
    word_index = self.word_index[word]
    # Change value from 0 to 1 according to ID of the word
    word_vec[word_index] = 1
    return word_vec

In [7]:
# Initialise object
w2v = word2vec()
# Numpy ndarray with one-hot representation for [target_word, context_words]
training_data = w2v.generate_training_data(settings, corpus)

In [8]:
print("Target (natural):{} Context (language, processing):{}".format(training_data[0][0],training_data[0][1]))

Target (natural):[1, 0, 0, 0, 0, 0, 0, 0, 0] Context (language, processing):[[0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0]]


### 5.Skip-Gram模型的正向传播

(1)正向向传播

In [9]:
def forward_pass(self, x):
    h = np.dot(self.w1.T, x)
    u_c = np.dot(self.w2.T, h)
    y_c = self.softmax(u)
    return y_c, h, u

（2）定义softmax函数

In [10]:
def softmax(self, x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

正向传播示例
![image.png](attachment:image.png)

### 6.Skip-Gram模型的反向传播

In [11]:
# 反向传播
def backprop(self, e, h, x):
    dl_dw2 = np.outer(h, e)  
    dl_dw1 = np.outer(x, np.dot(self.w2, e.T))

    # 更新两个矩阵的参数
    self.w1 = self.w1 - (self.eta * dl_dw1)
    self.w2 = self.w2 - (self.eta * dl_dw2)

完整代码如下：

In [12]:
import numpy as np
import re
from collections import defaultdict

In [13]:
class word2vec():
    def __init__ (self):
        self.n = settings['n']
        self.eta = settings['learning_rate']
        self.epochs = settings['epochs']
        self.window = settings['window_size']
        
    
    
    # GENERATE TRAINING DATA
    def generate_training_data(self, settings, corpus):

        # GENERATE WORD COUNTS
        #defaultdict相当于{}，生成一个空字典。 如果访问存在的key，不会报错，但返回0
        word_counts = defaultdict(int)
        for row in corpus:
            for word in row:
                word_counts[word] += 1

        self.v_count = len(word_counts.keys())

        # GENERATE LOOKUP DICTIONARIES
        self.words_list = sorted(list(word_counts.keys()),reverse=False)
        self.word_index = dict((word, i) for i, word in enumerate(self.words_list))
        self.index_word = dict((i, word) for i, word in enumerate(self.words_list))

        training_data = []
        # CYCLE THROUGH EACH SENTENCE IN CORPUS
        for sentence in corpus:
            sent_len = len(sentence)

            # CYCLE THROUGH EACH WORD IN SENTENCE
            for i, word in enumerate(sentence):
                
                #w_target  = sentence[i]
                w_target = self.word2onehot(sentence[i])

                # CYCLE THROUGH CONTEXT WINDOW
                w_context = []
                for j in range(i-self.window, i+self.window+1):
                    if j!=i and j<=sent_len-1 and j>=0:
                        w_context.append(self.word2onehot(sentence[j]))
                training_data.append([w_target, w_context])
        return np.array(training_data)


    # SOFTMAX ACTIVATION FUNCTION
    def softmax(self, x):
        e_x = np.exp(x - np.max(x))
        return e_x / e_x.sum(axis=0)


    # CONVERT WORD TO ONE HOT ENCODING
    def word2onehot(self, word):
        word_vec = [0 for i in range(0, self.v_count)]
        word_index = self.word_index[word]
        word_vec[word_index] = 1
        return word_vec


    # FORWARD PASS
    def forward_pass(self, x):
        h = np.dot(self.w1.T, x)
        u = np.dot(self.w2.T, h)
        y_c = self.softmax(u)
        return y_c, h, u
                

    # BACKPROPAGATION
    def backprop(self, e, h, x):
        dl_dw2 = np.outer(h, e)  
        dl_dw1 = np.outer(x, np.dot(self.w2, e.T))

        # UPDATE WEIGHTS
        self.w1 = self.w1 - (self.eta * dl_dw1)
        self.w2 = self.w2 - (self.eta * dl_dw2)
        


    # TRAIN W2V model
    def train(self, training_data):
        # INITIALIZE WEIGHT MATRICES
        self.w1 = np.random.uniform(-0.8, 0.8, (self.v_count, self.n))     # embedding matrix
        self.w2 = np.random.uniform(-0.8, 0.8, (self.n, self.v_count))     # context matrix
        
        # CYCLE THROUGH EACH EPOCH
        for i in range(0, self.epochs):

            self.loss = 0

            # CYCLE THROUGH EACH TRAINING SAMPLE
            for w_t, w_c in training_data:

                # FORWARD PASS
                y_pred, h, u = self.forward_pass(w_t)
                               
                # CALCULATE ERROR
                EI = np.sum([np.subtract(y_pred, word) for word in w_c], axis=0)
                

                # BACKPROPAGATION
                self.backprop(EI, h, w_t)

                # CALCULATE LOSS
                self.loss += -np.sum([u[word.index(1)] for word in w_c]) + len(w_c) * np.log(np.sum(np.exp(u)))
                
                
            if i %500 ==0:
                #print("w_t:",w_t)                
                #print("h:",h)
                #print("y_pred:",y_pred)
                #print("w_c",w_c)
                #print("EI:",EI)
                print('EPOCH:{},LOSS:{}'.format(i,self.loss))
                


    # input a word, returns a vector (if available)
    def word_vec(self, word):
        w_index = self.word_index[word]
        v_w = self.w1[w_index]
        return v_w


    # input a vector, returns nearest word(s)
    def vec_sim(self, vec, top_n):

        # CYCLE THROUGH VOCAB
        word_sim = {}
        for i in range(self.v_count):
            v_w2 = self.w1[i]
            theta_num = np.dot(vec, v_w2)
            theta_den = np.linalg.norm(vec) * np.linalg.norm(v_w2)
            theta = theta_num / theta_den

            word = self.index_word[i]
            word_sim[word] = theta

        words_sorted = sorted(word_sim.items(), key=lambda sim:sim, reverse=True)

        for word, sim in words_sorted[:top_n]:
            print(word, sim)
            
       

    # input word, returns top [n] most similar words
    def word_sim(self, word, top_n):
        
        w1_index = self.word_index[word]
        v_w1 = self.w1[w1_index]

        # CYCLE THROUGH VOCAB
        word_sim = {}
        for i in range(self.v_count):
            v_w2 = self.w1[i]
            theta_num = np.dot(v_w1, v_w2)
            theta_den = np.linalg.norm(v_w1) * np.linalg.norm(v_w2)
            theta = theta_num / theta_den

            word = self.index_word[i]
            word_sim[word] = theta

        words_sorted = sorted(word_sim.items(), key=lambda x:x[1], reverse=True)

        for word, sim in words_sorted[:top_n]:
            print(word, sim)          
            

In [14]:
text = "natural language processing and machine learning is fun and exciting"

#对语料库进行简单处理，分词及转换为小写
corpus = [[word.lower() for word in text.split()]]

In [15]:
settings = {}
settings['n'] = 10                   # dimension of word embeddings
settings['window_size'] = 2         # context window +/- center word
settings['min_count'] = 0           # minimum word count
settings['epochs'] = 10000           # number of training epochs
settings['neg_samp'] = 10           # number of negative words to use during training
settings['learning_rate'] = 0.01    # learning rate
np.random.seed(0)                   # set the seed for reproducibility


# INITIALIZE W2V MODEL
w2v = word2vec()

# generate training data
training_data = w2v.generate_training_data(settings, corpus)

# train word2vec model
w2v.train(training_data)

EPOCH:0,LOSS:75.70350574932301
EPOCH:500,LOSS:47.85955989797406
EPOCH:1000,LOSS:47.687676382532665
EPOCH:1500,LOSS:47.64030996928298
EPOCH:2000,LOSS:47.61530096408818
EPOCH:2500,LOSS:47.598041555548946
EPOCH:3000,LOSS:47.58457602467279
EPOCH:3500,LOSS:47.57341604332083
EPOCH:4000,LOSS:47.56385909431508
EPOCH:4500,LOSS:47.55551220712128
EPOCH:5000,LOSS:47.548126265358
EPOCH:5500,LOSS:47.541528686893415
EPOCH:6000,LOSS:47.53559230106244
EPOCH:6500,LOSS:47.53021927137989
EPOCH:7000,LOSS:47.52533197235119
EPOCH:7500,LOSS:47.520867397392024
EPOCH:8000,LOSS:47.51677351252583
EPOCH:8500,LOSS:47.51300675892997
EPOCH:9000,LOSS:47.50953027360567
EPOCH:9500,LOSS:47.506312580433146


In [16]:
#查看词的onehot编码
w2v.word2onehot('machine')

[0, 0, 0, 0, 0, 0, 1, 0, 0]

In [17]:
#查看词向量
w2v.word_vec('machine')

array([-0.03096161, -1.41999537,  0.56853834, -0.82699764, -1.0779332 ,
        0.88248174,  1.59734871, -0.90061664,  0.81381965, -1.95577189])

In [18]:
#找到相似的词
w2v.word_sim('language',10)

language 1.0
machine 0.4126119527471495
natural 0.2257632037979959
processing 0.1071285663634383
exciting -0.03474354041836012
learning -0.13885117084926915
is -0.2018037689718594
fun -0.2510742666497979
and -0.609545946473526
