<a href="https://colab.research.google.com/github/alasarerhan/Deep-Learning-Projects/blob/main/word2vec_vlms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec Model Using Skip-Gram Architecture

## Install Dependencies

In [41]:
!pip install -q numpy requests tqdm torch scikit-learn matplotlib seaborn

# Ensure Reproducibility with Random Seeds

Many of the algorithms used in this notebook, such as sub-sampling, negative sampling, and model initialization, rely on randomness to function. While this randomness can improve efficiency and generalization, it also means that running the same code multiple times may yield slightly different results.

To ensure consistent and reproducible results across different runs on the notebook, we set random seeds for all random processes used in the code. This includes Python's random module, NumPy, and PyTorch. By doing so, we guarantee that operations like sampling, model initialization, and training yield the same results every time the notebook is executed.

In [42]:
import random
import numpy as np
import torch

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

if torch.cuda.is_available():
  torch.cuda.manual_seed(SEED)
  torch.cuda.manual_seed_all(SEED)

# Configure Runtime for GPU Acceleration

Let's make sure that we have access to the GPU. We can use nvidia-smi command to do that.

In [43]:
!nvidia-smi

Tue Jan 28 19:25:13 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Download and Prepare Dataset

This section downloads the text8 dataset, a pre-processed collection of Wikipedia text commonly used for language modeling. The text8 dataset is already cleaned and formatted: it contains only lowercase alphabetic characters, with punctuation, numbers, and case distinctions removed. The dataset is tokenized into words, making it ready for vocabulary construction and subsequent preprocessing steps.

In [44]:
import os
import requests
import zipfile

URL = "http://mattmahoney.net/dc/text8.zip"
FILENAME = "text8.zip"

if not os.path.isfile(FILENAME):
    response = requests.get(URL, stream=True)
    with open(FILENAME, 'wb') as file_obj:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                file_obj.write(chunk)

if not os.path.isfile("text8"):
    with zipfile.ZipFile(FILENAME, 'r') as zipped_file:
        zipped_file.extractall(".")

def load_text_file():
    with open("text8", "r", encoding="utf-8") as file_obj:
        text_data = file_obj.read()
    return text_data.strip().split()

words = load_text_file()
print(words[:20])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']


In [45]:
print(f"Number of words in the text8 dataset: {len(words)}")
print(f"Number of unique words in the text8 dataset: {len(set(words))}")

Number of words in the text8 dataset: 17005207
Number of unique words in the text8 dataset: 253854


# Build Vocabulary with Most Frequent Tokens

This section constructs a vocabulary by retraining only words that appear at least a specified number of times. Words that do not meet this frequency threshold are discarded entirely, ensuring that the vocabulary reduces noise and memory consumption while aligning with the Word2Vec methodology.

In [46]:
from collections import Counter

def build_vocabulary(words, min_frequency):
    word_counter = Counter(words)
    mapping = {}
    for index, (word, count) in enumerate(word_counter.most_common()):
        if count < min_frequency:
            break
        mapping[word] = index
    return mapping

In [47]:

MIN_FREQUENCY = 10

word_to_index = build_vocabulary(words, min_frequency=MIN_FREQUENCY)
index_to_word = {val: key for key, val in word_to_index.items()}
vocabluary_size = len(word_to_index)

In [48]:
words = [w for w in words if w in word_to_index]
print(f"Number of words in the text8 dataset: {len(words)}")
print(f"Number of unique words in the text8 dataset: {len(set(words))}")

Number of words in the text8 dataset: 16561031
Number of unique words in the text8 dataset: 47134


# Apply Sub-sampling to Reduce Frequent Words

To mitigate the dominance of frequent tokens, a sub-sampling technique is applied. Tokens that appear excessively often are probabilistically downsampled, ensuring a balanced dataset and enhancing the efficiency of embedding learning.

High-frequency tokens(e.g., "the", "and")have the higher frequencies and therefore lower probabilites, meaning they are more likely to be skipped. Low-frequency tokens have higher probabilities and more likely to be included.

In [49]:
def subsample_words(words, threshold):
    word_counter = Counter(words)
    total = len(words)

    def should_discard(word):
        frequency = word_counter[word] / total
        if frequency > threshold:
            p = 1 - np.sqrt(threshold / frequency)
            return random.random() < p
        return False

    return [word for word in words if not should_discard(word)]

In [50]:

THRESHOLD_FREQUENCY = 1e-5

subsampled_words = subsample_words(words, threshold=THRESHOLD_FREQUENCY)
print("Original number of words:", len(words))
print("Number of words after sub-sampling:", len(subsampled_words))

Original number of words: 16561031
Number of words after sub-sampling: 4496739


In [51]:
print(words[:20])
print(subsampled_words[:20])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']
['anarchism', 'abuse', 'radicals', 'diggers', 'revolution', 'sans', 'revolution', 'pejorative', 'way', 'violent', 'up', 'label', 'defined', 'anarchists', 'anarchism', 'archons', 'ruler', 'chief', 'anarchism', 'rulers']


# Sub-sampling Analysis

After applying sub-sampling to reduce the dominance of high-frequency words, it's helpful to compare how many times each word appears before and after sub-sampling The snippet below displays the first 20 words (sorted by their original frequency), along with their original and subsampled counts.

This provides a clear demonstration of how sub-sampling removes excessively frequent words while retaining less common (but potentially more informative)
words

In [52]:
from collections import Counter
from tabulate import tabulate

counts = Counter(words)
subsampled_counts = Counter(subsampled_words)

# we'll focus on the first 20 words in the dataset, sorted by original frequency (descending).
sample_words = sorted(set(words[:20]), key=lambda w: counts[w], reverse=True)

table_data = []
for w in sample_words:
    original_count = counts[w]
    after_count = subsampled_counts.get(w, 0)
    table_data.append([w, original_count, after_count])

print(tabulate(table_data, headers=["Word", "Original Count", "Subsampled Count"], tablefmt="simple"))

Word          Original Count    Subsampled Count
----------  ----------------  ------------------
the                  1061396               13047
of                    593677                9910
a                     325873                7249
as                    131815                4610
first                  28810                2208
used                   22737                1944
english                11868                1346
early                  10172                1321
including               9633                1162
against                 8432                1175
term                    7219                1070
class                   3412                 714
working                 2271                 615
originated               572                 281
abuse                    563                 304
anarchism                303                 231
radicals                 116                 116
diggers                   25                  25


# Prepare Negative Sampling Distribution
Negative sampling is a technique introduced in the Word2Vec paper to make training embeddings computationally efficent and effective. Instead of computing the gradients for all words in the vocabulary (which can be computationally expensive, especially for large vocabularies), negative sampling trains the model by updating weights for only a subset of words,spesifically, a small number of "negative" (incorrect) samples for each positive (correct) context pair.

In [53]:
def compute_negative_sampling_distribution(indexed_words):
    counts = np.bincount(indexed_words)
    probablility = counts / counts.sum()
    probablility_75 = probablility**0.75
    return probablility_75 / probablility_75.sum()

# Create Custom Dataset for Skip-Gram Training

A PyTorch daataset is implemented to generate training samples for the Skip-Gram model. For each target word, contenxt words within a dynamic window and negative samples are retriebed to train the embeddings efficiently.

In [54]:
def get_target(words, index, max_window_size=5):
    window_size = random.randint(1, max_window_size)
    start_position = max(0, index - window_size)
    end_position = min(index + window_size + 1, len(words))
    return words[start_position:index] + words[index + 1:end_position]

In [55]:
get_target([i for i in range(20)], 5)

[3, 4, 6, 7]

In [56]:
get_target(subsampled_words[:20], 5)

['revolution', 'revolution']

In [57]:
from torch.utils.data import Dataset, DataLoader


class WordToVecDataset(Dataset):
    def __init__(self, indexed_words, window_size=4):
        self.indexed_words = indexed_words
        self.window_size = window_size

    def __len__(self):
        return len(self.indexed_words)

    def __getitem__(self, index):
        center_word = self.indexed_words[index]
        context_words = get_target(self.indexed_words, index, self.window_size)
        return center_word, context_words


In [58]:
dataset = WordToVecDataset(
    indexed_words=[i for i in range(20)],
    window_size=4
)

center, context = dataset[0]
print(center, context)

0 [1, 2, 3, 4]


# Implement Collate Function for Efficient Batching

This section provides a custom collate function to combine individiual samples into efficient batches for training. It enables parallel processing during model training, significantly accelerating the embedding learning process.

In [61]:
import torch

def create_collate_fn(
    vocabulary_size,
    negative_sampling_distribution,
    number_of_negative_samples
):
  negative_distribution_tensor = torch.tensor(negative_sampling_distribution, dtype = torch.float)

  def collate_function(batch):
    all_center_words = []
    all_context_words = []

    #flattten out all center-context pairs
    for center_word, context_word_list in batch:
      for context_word in context_word_list:
        all_center_words.append(center_word)
        all_context_words.append(context_word)

    center_words_tensor = torch.LongTensor(all_center_words)
    context_words_tensor = torch.LongTensor(all_context_words)

    # generate negative samples for the entire batch in one shot
    total_pairs = len(center_words_tensor)
    negative_samples_flat = torch.multinomial(
        negative_distribution_tensor,
        total_pairs * number_of_negative_samples,
        replacement=True
    )

    negative_samples_tensor = negative_samples_flat.view(total_pairs, number_of_negative_samples)

    return center_words_tensor, context_words_tensor, negative_samples_tensor
  return collate_function

In [62]:
dataset = WordToVecDataset(indexed_words = [i for i in range(20)],
                           window_size = 4)

collect_fn = create_collate_fn(20, np.full(20, 1 / 20, dtype = np.float32), 5)

dataloader = DataLoader(
    dataset,
    batch_size = 4,
    shuffle = True,
    num_workers = 2,
    collate_fn = collect_fn,
    drop_last = True
)

centers_tensor, contexts_tensor, negatives_tensor = next(iter(dataloader))
print(centers_tensor)
print(contexts_tensor)
print(negatives_tensor)

tensor([14, 14, 14, 14, 14, 14, 14, 14, 12, 12, 13, 13, 13, 13, 13, 13,  8,  8,
         8,  8,  8,  8,  8,  8])
tensor([10, 11, 12, 13, 15, 16, 17, 18, 11, 13, 10, 11, 12, 14, 15, 16,  4,  5,
         6,  7,  9, 10, 11, 12])
tensor([[ 3,  0, 11,  8, 19],
        [19, 13, 16,  3, 17],
        [ 5, 17, 19,  0, 14],
        [ 6,  8,  1, 16,  0],
        [ 2,  3,  2,  3, 13],
        [17,  0, 17,  6, 14],
        [18,  7,  2, 18, 16],
        [18, 15, 17, 12,  8],
        [18, 13,  6, 17, 13],
        [14, 13,  4, 18, 17],
        [16, 12,  7,  4,  4],
        [ 9,  7,  9,  6, 12],
        [15, 19,  1,  2,  8],
        [15,  6, 12,  1, 16],
        [18,  5,  8, 18,  0],
        [18,  7,  6,  1, 11],
        [ 0, 18, 16, 14, 11],
        [ 1, 14, 16,  6, 13],
        [ 9,  2, 19,  8,  5],
        [12,  0, 14, 11,  1],
        [18,  9,  5,  8,  1],
        [ 7, 15, 10, 10, 16],
        [11,  3, 11, 10,  9],
        [ 2, 15, 18, 18, 18]])
