<div align='right'><h3> Yes, we **used similarity** to <font color='red'>find <i>related<i> words</font> but it was designed to <font color='blue'>predict context words</font> ! <h3>  </div>
  

# Data

## 1. Get a corpus

Original model was trained on a 100 billion words part of Google News Corpus. 
I don't think I can find it, and we are for sure not going to be able to use it.

We'll stick with Wikitext 2 version 1

Links: [Wikitext 2 Description](https://paperswithcode.com/dataset/wikitext-2) [Wikitext 2 Datasets Page](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/train)

In [None]:
from collections import Counter 
from tqdm.auto import tqdm, trange
from datasets import load_dataset
from matplotlib import pyplot as plt
import numpy as np
from pathlib import Path
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

In [None]:
# Load the dataset
wikitext = load_dataset("wikitext", "wikitext-2-v1")
wikitext, wikitext['train'][10]['text']

# Goals

PS: we do not replicate the paper Word2Vec but do something in the same spirit for now

**predict the word given its surrounding words**

![Screenshot 2025-05-05 at 22.05.25.png](<attachment:Screenshot 2025-05-05 at 22.05.25.png>)


This is what we try to model: $ P(w_i | w_{i-2}, w_{i-1}, w_{i+1} \dots, w_{i+2})$ 

e.g. $P({\text{he} | \text{an},\text{offer}, \text{can't}, \text{refuse}})$

# Lets start by pre-processing the data

## Get a Vocab

You know the drill by now. Get a counter. Set n_words. Select top-n.

In [None]:
# Make a vocab
word_counter = Counter()
n_words = 10_000

...
    
len(word_counter), word_counter.most_common(10)

In [None]:
# Use the word frequencies to make the vocab
unk_token = '<unk>'
vocab = {'<unk>': 0, '<pad>': 1}
for ...

print(vocab['however'])

In [None]:
# Quickly convert all words to ids

# Break things into words
train_text = ...
valid_text = ...

# Just remove all docs which have no words
train_text = [x for x in train_text if len(x)>0]
valid_text = [x for x in valid_text if len(x)>0]

# Use vocab to turn them into ids
train_text_ids = []
for doc in train_text:
    train_text_ids.append(
        [vocab.get(word, vocab['<unk>']) for word in doc]
    )

valid_text_ids = []
for doc in valid_text:
    valid_text_ids.append(
        [vocab.get(word, vocab['<unk>']) for word in doc]
    )

In [None]:
train_text[0], train_text_ids[0]

In [None]:
doc = train_text[0]
_doc = ['<pad>', '<pad>'] + doc + ['<pad>', '<pad>']
i =4
_i = i+2
print(_doc)
_doc[_i-2:_i], _doc[_i+1:_i+3], _i, _i-2, _doc[:10]

In [None]:
# Making data loaders

# We want to have inputs be [w_-2, w_-1, w_+1, w_+2]. The label for this instance would be w

contexts, targets = [], []
for doc in tqdm(train_text):
    _doc = ['<pad>', '<pad>'] + doc + ['<pad>', '<pad>']
    print(_doc)
    for i, word in enumerate(doc):
        # hint _i = i +2
        ...
    break

contexts, targets

In [None]:
# Scale it up for the entire dataset
pad_id = vocab['<pad>']

train_contexts, train_targets = [], []
for doc in tqdm(train_text_ids):
    _doc = [pad_id, pad_id] + doc + [pad_id, pad_id]
    for i, word in enumerate(doc):
        ...

valid_contexts, valid_targets = [], []
for doc in tqdm(valid_text_ids):
    _doc = [pad_id, pad_id] + doc + [pad_id, pad_id]
    for i, word in enumerate(doc):
        ...


print(len(train_contexts), len(train_targets), len(valid_contexts), len(valid_targets))

In [None]:
# Throw them into a dataloader
train_contexts = ...
train_targets = ...
valid_contexts = ...
valid_targets = ...
print(train_contexts.shape, train_targets.shape)

cbow_train_dataset = TensorDataset(train_contexts, train_targets)
cbow_valid_dataset = TensorDataset(valid_contexts, valid_targets)

train_dataloader = DataLoader(cbow_train_dataset, batch_size=10_000, shuffle=True)
valid_dataloader = DataLoader(cbow_valid_dataset, batch_size=10_000, shuffle=True)

In [None]:
## Try it out:
for batch in train_dataloader:
    break

batch[0], batch[1]

![cbow](<../resources/cbow.png>)

# 2. Model

1. We start with the four context words
2. We assign each a vector (4 vectors, n dimensions) using an embedding layer/matrix
3. We average four vectors to create a 'context vector'
4. We pass the 'context vector' to the output layer
5. We get a probability distribution over the vocabulary

In [None]:
# Lets do it without a class now
inputs = torch.randint(1, 10_000, (1, 4))
inputs

In [None]:
# Lets make a class out of this
class CBOW(nn.Module):
    ...

---------------


# So let's start backproping???

![computetime](https://media.tenor.com/rDKZFPwK-00AAAAC/the-matrix-keanu-reeves.gif "backprop")

In [None]:
# Do the training

# Too slow?

Lets use a GPU

# <font color="red">Problems! </font>: Inefficient

For each word pair, we compute a distribution over the enitre vocabulary.
Why? To normalize the scores.

###### Recall: 

**score**: $f(u.v)$ or $(u^T v)$. Our $f(.)$ was $\text{exp}(.)$


**normalization**: $\sum_{i=0}^{|\text{vocab}|} f(u.v_i)$ <- **nicht gut!**


# Further Reading

A great overview of this entire thing - [Blogpost](https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)


Another implementation of the entire thing - [Github](https://github.com/lukysummer/SkipGram_with_NegativeSampling_Pytorch/blob/master/SkipGram_NegativeSampling.ipynb)

Skip-Gram embeddings with negative embeddings is implicit factorization of the co-occurance matrix - [Paper](https://papers.nips.cc/paper/2014/file/feab05aa91085b7a8012516bc3533958-Paper.pdf)

On Biases in Word Embeddings, and ways to counteract them (ony gender bias targeted in this paper) - [Paper](https://arxiv.org/pdf/1607.06520.pdf)

WEAT Test - [Paper](https://arxiv.org/pdf/1608.07187.pdf)
