# Word Embedding

**How do we represent the meaning of a word?**

Definition:meaning(Webster dictionary)
- the idea that is represented by a word, phrase, etc.
- the idea that a person wants to express by using words, signs, etc.
- the idea that is expressed in a work of writing, art, etc. Commonest linguistic way of thinking of meaning: 
- signifier $\Leftrightarrow$ signified(idea or thing)=denotation

**How do we have usable meaning in a computer?**

Common answer: Use a taxonomy like WordNet that has hypernyms(is-a) relationships
```python
from nlyk.corpus import wordnet as wn
panda = wn.synset('panda.n.01')
hyper = lambda s: s.hypernyms()
list(panda.closure(hyper))
```

**Problems with this discrete representatiom**
- Great as a resource but missing nuances, e.g., **synonyms**:
   - adept, expert, good, practiced, proficient, skillful?
- Missing new words(impossible to keep up date): wicked, badass, nifty, crack, ace, wizard, genius, ninja
- Subjective
- Requires human labor to create and adapt
- Hard to compute accurate word similarity

The vast majority of rule-based and statistical NLP work regards words as atomic symbols.

In vector space terms, this is a vector with one 1 and a lot of zeros:$$[0\quad 0\quad 0...0\quad 1\quad 0...0]$$
Dimensionality:20K(speech)-50K(PTB)-500K(big vocab)-13M(Google 1T)

We call this a one-hot representation

It is a localist representation.

Its problem, e.g., for web search
- If user searches for [<span style="color:pink">Dell notebook battery size</span>], we would like to match documents with "<span style="color:pink">Dell laptop battery capacity</span>"
- If user searches for [<span style="color:pink">Seattle motel</span>], we would like to match documents containing "<span style="color:pink">Seattle hotel</span>"

But

motel $[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]^T$

hotel $[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0$

Our query and document vectors are <span style="color:purple">orthogonal</span>

There is no natural notion of similarity in a set of one-hot vectors

Could deal with similarity separately;instead we explore a direct approach where vectors encode it

**Distributional similarity based representations**

You can get a lot of value by representing a word by means of its neighbors

<span style="color:blue">"You shall know a word by the company it keeps"</span>

One of the most successful ideas of modern statistical NLP

**Word meaning is defined in terms of vectors**

We will build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context...those other words also being represented by vectors...it all gets a bit recursive.

**Directly learning low-dimensional word vectors**

Old idea. Relevant for this lecture \& deep learning:
- Learning representations by back-propagating errors
- A neural probabilistic language model
- NLP(almost) from Scratch
- A recent, even simpler and faster model: word2vec.

**How do we select input and output words?**
- Method 1: continuous bag-of-word(CBOW)
- Method 2: skip-gram(SG)

![CBOW and Skip-Gram](images/image1-3.png)

$$E=-\log p(w_{O,1},w_{O,2},...,w_{O,C}|w_I)$$
- Loss function is negative probability of predictions of context words
- The hidden layer simply selects a row of $W$ based on the input word
   $$h=x^TW=W_{(k,\cdot)}$$
- This is then mapped by another matrix to an output probability

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy
from collections import defaultdict

In [6]:
# 1. 准备一个简单的语料库
# 这个语料库特意设计得让 'king' 和 'queen' 以及 'man' 和 'woman' 出现在相似的上下文中
corpus = [
    "The king is a strong man",
    "The queen is a wise woman",
    "A man can be a king",
    "A woman can be a queen",
    "The king rules the country",
    "The queen leads the people"
]

# 2. 数据预处理
# 将所有句子合并成一个单词列表
words = []
for sentence in corpus:
    words.extend(sentence.lower().split())

# 创建词汇表
# 使用 defaultdict 方便处理未见过的词
vocab = defaultdict(lambda: len(vocab))
# 使用 set 去重，然后构建词汇表
unique_words = set(words)
for word in unique_words:
    vocab[word] # 这行代码会触发 defaultdict 的 lambda 函数，为每个词分配一个唯一索引

# 创建反向映射，方便从索引找回单词
ix_to_word = {i: word for word, i in vocab.items()}
vocab_size = len(vocab)

print(f"词汇表大小 (Vocab size): {vocab_size}")
print(f"词汇表示例 (Vocab example): {list(vocab.items())[:5]}")
print("-" * 30)

# 3. 生成 Skip-gram 训练数据
# Skip-gram 的任务是：给定一个中心词，预测它周围的上下文词
# 我们定义一个 "窗口大小" (window_size)，表示中心词左右各看几个词
window_size = 2
training_data = []

for i, word in enumerate(words):
    center_word_idx = vocab[word]
    # 遍历窗口内的上下文词
    for j in range(i - window_size, i + window_size + 1):
        # 确保索引在合法范围内，并且不是中心词本身
        if j >= 0 and j < len(words) and i != j:
            context_word_idx = vocab[words[j]]
            training_data.append((center_word_idx, context_word_idx))

print(f"生成训练数据对 {len(training_data)} 个")
print(f"训练数据示例 (center_word_idx, context_word_idx):")
# 打印前5个训练样本，并显示其对应的单词
for center_idx, context_idx in training_data[:5]:
    print(f"  ({ix_to_word[center_idx]}, {ix_to_word[context_idx]})")

词汇表大小 (Vocab size): 15
词汇表示例 (Vocab example): [('wise', 0), ('leads', 1), ('is', 2), ('rules', 3), ('the', 4)]
------------------------------
生成训练数据对 130 个
训练数据示例 (center_word_idx, context_word_idx):
  (the, king)
  (the, is)
  (king, the)
  (king, is)
  (king, a)


In [7]:
class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramModel, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        
        # 定义中心词的嵌入层 (W)
        # 这是一个大小为 (vocab_size, embedding_dim) 的矩阵
        # 当输入一个词的索引时，它会返回该词的嵌入向量
        self.center_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # 定义上下文词的嵌入层 (W')
        # 这也是一个大小为 (vocab_size, embedding_dim) 的矩阵
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, center_word_idx):
        # 1. 获取中心词的嵌入向量
        # center_word_idx 的 shape: (batch_size)
        # center_embedding 的 shape: (batch_size, embedding_dim)
        center_embedding = self.center_embeddings(center_word_idx)
        
        # 2. 计算中心词向量与所有上下文词向量的点积
        # self.context_embeddings.weight 的 shape: (vocab_size, embedding_dim)
        # 我们希望得到每个词作为上下文的得分，所以进行矩阵乘法
        # (batch_size, embedding_dim) @ (embedding_dim, vocab_size) -> (batch_size, vocab_size)
        scores = torch.matmul(center_embedding, self.context_embeddings.weight.t())
        
        # 返回的是 logits（原始得分），后续会传入 CrossEntropyLoss
        # CrossEntropyLoss 会在内部自动计算 log_softmax
        return scores

In [8]:
# 超参数
EMBEDDING_DIM = 10  # 词向量维度，实际应用中通常是 50, 100, 300
LEARNING_RATE = 0.01 # 学习率
EPOCHS = 50          # 训练轮数

# 初始化模型、损失函数和优化器
model = SkipGramModel(vocab_size, EMBEDDING_DIM)
# 损失函数：交叉熵损失。它适用于多分类问题，并且会自动处理 LogSoftmax
loss_function = nn.CrossEntropyLoss()
# 优化器：Adam 是一个常用的、效果很好的优化器
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [9]:
# 开始训练
for epoch in range(EPOCHS):
    total_loss = 0
    for center_word_idx, context_word_idx in training_data:
        # 将输入数据转换为 PyTorch Tensor
        # PyTorch 的 Embedding 层和 CrossEntropyLoss 都需要 LongTensor 类型的索引
        center_tensor = torch.LongTensor([center_word_idx])
        context_tensor = torch.LongTensor([context_word_idx])
        
        # 1. 梯度清零
        optimizer.zero_grad()
        
        # 2. 前向传播，得到预测得分
        scores = model(center_tensor)
        
        # 3. 计算损失
        # scores 的 shape 是 (1, vocab_size)，context_tensor 的 shape 是 (1)
        # CrossEntropyLoss 正好需要这种格式的输入
        loss = loss_function(scores, context_tensor)
        
        # 4. 反向传播
        loss.backward()
        
        # 5. 更新参数
        optimizer.step()
        
        total_loss += loss.item()
        
    # 每 10 轮打印一次损失
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{EPOCHS}, Loss: {total_loss/len(training_data):.4f}")

print("\n训练完成!")

Epoch 10/50, Loss: 2.1238
Epoch 20/50, Loss: 2.0294
Epoch 30/50, Loss: 1.9959
Epoch 40/50, Loss: 1.9798
Epoch 50/50, Loss: 1.9700

训练完成!


In [10]:
# 提取学习到的词向量 (我们通常使用中心词的嵌入作为最终的词向量)
word_vectors = model.center_embeddings.weight.data

def find_most_similar(word, top_n=5):
    """
    寻找与给定单词最相似的单词
    """
    if word not in vocab:
        print(f"'{word}' 不在词汇表中。")
        return

    # 获取输入单词的向量
    input_vec = word_vectors[vocab[word]].unsqueeze(0) # 增加一个维度以进行广播
    
    # 计算余弦相似度
    # (1, dim) @ (dim, vocab_size) -> (1, vocab_size)
    similarities = torch.matmul(input_vec, word_vectors.t()) / (torch.norm(input_vec) * torch.norm(word_vectors, dim=1))
    similarities = similarities.squeeze(0) # 降维
    
    # 排序并获取 top_n 结果
    # argsort 默认是升序，所以我们取最后的 n+1 个（因为最相似的是它自己）
    top_indices = torch.argsort(similarities, descending=True)[1:top_n+1]
    
    print(f"与 '{word}' 最相似的词:")
    for idx in top_indices:
        sim_score = similarities[idx].item()
        print(f"  - {ix_to_word[idx.item()]} (相似度: {sim_score:.3f})")

# --- 验证结果 ---
# 注意：由于我们的语料库非常小，结果可能不完美，但应该能展示出一些有趣的模式。
# 例如，'king' 应该与 'queen', 'man' 比较相似。
print("\n--- 词向量相似度测试 ---")
find_most_similar('king')
print("-" * 20)
find_most_similar('queen')
print("-" * 20)
find_most_similar('man')


--- 词向量相似度测试 ---
与 'king' 最相似的词:
  - man (相似度: 0.621)
  - country (相似度: 0.610)
  - leads (相似度: 0.444)
  - people (相似度: 0.340)
  - wise (相似度: 0.320)
--------------------
与 'queen' 最相似的词:
  - people (相似度: 0.456)
  - strong (相似度: 0.416)
  - woman (相似度: 0.295)
  - the (相似度: 0.267)
  - rules (相似度: 0.227)
--------------------
与 'man' 最相似的词:
  - king (相似度: 0.621)
  - country (相似度: 0.459)
  - woman (相似度: 0.453)
  - be (相似度: 0.366)
  - is (相似度: 0.352)
