# Continuous Bag of Words (Implementation)

In this notebook, I implement the Continuous Bag of Words algorithm, which is a word2vec algorithm for calculating word embeddings. First, we must start off with some actual words data to train the CBOW model on. For this, I will use Text8, a text file containing the first 100 MB of cleaned text from Wikipedia. 

## Motivation

One-hot encoding doesn't carry much meaning - word embeddings allow us to capture relationship between words.

In CBOW, we predict a target word based on context words.

In [42]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import numpy as np
import gensim.downloader as api
from collections import Counter
import json
import os
import torch

## 1. Prepare data

CBOW works by training a model to predict the target word based off of on the context word, allowing us to use the trained parameters as word embeddings. Thus, we must first prepare the words to be inputted into the neural network, so we need to one-hot encode. This entails a few main steps:

1. Retrieve a list of words to serve as vocabulary. We will do this by taking the 1000 most frequent words in the text8 dataset.
2. One-hot encode each word in the vocabulary, resulting in a dictionary mapping from each word to a sparse vector.
2. Find the context for each word in the vocabulary, resulting in a dictionary mapping from each word to its context words.
3. One-hot encode the word-context dictionary using the word-onehot dictionary, resulting in a one-hot encoded word-context dictionary.

### 1.1 Prepare Vocabulary List

In [2]:
# Load the text8 dataset
text8_dataset = api.load('text8')
text8_words = [word for words in text8_dataset for word in words]
print("Number of words in text8:", len(text8_words))

Number of words: 17005207


Since the Text8 dataset is extremely large, we will take a subset of 1,000 words and work with that instead since it is more practical: the dataset will still be large enough to do its purpose of being a good learning experience but not so much where the computer will run into memory issues. We will do this by considering the words that are most frequent. 

In [40]:
max_vocab_size = 1000
word_counts = Counter(text8_words)
vocab = [word for word, count in word_counts.most_common(max_vocab_size)]
print('Vocabulary:')
print(vocab[0:10], '...')

Vocabulary:
['the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero', 'nine', 'two'] ...


### 1.2 One Hot Encode Words
Now that we have prepared a list of words we will use as our vocabulary to generate word embeddings on, let's create a dictionary to one hot encode each of these words. This is so can we feed the words into our model so it can train parameters.

In [28]:
encoder = OneHotEncoder() 
vocab_array = np.array(vocab).reshape(-1, 1)
encoder.fit(vocab_array)
words_to_one_hot = {word: encoder.transform([[word]])[0] for word in vocab} 

<class 'dict'>


Let's save the words_to_one_hot dictionary to a json file.

In [29]:
words_to_one_hot_list = {word: one_hot.toarray().tolist()[0] for word, one_hot in words_to_one_hot.items()} # Convert csr_matrix to list before saving to JSON

file_path = 'data/words_to_one_hot.json'
with open(file_path, 'w') as file:
    json.dump(words_to_one_hot_list, file, indent=4)

### 1.3 Retrieve Word Contexts

(Since in CBOW we are trying to predict the center word based on context words, we first need to define which context words correlate to each word. )
In addition to encoding each word, we must also retrieve the contexts of each word. We do this by finding the surrounding words of the target word in different sentences, defined by some window size. Let's choose a window size of 4.

The below code creates a dictionary where each key is a word and its corresponding object is the context words.

In [6]:
WINDOW_SIZE = 5
    
def build_context_dictionary(words, window_size=WINDOW_SIZE):
    """   
    Args:
        words (list of str):List containing all the words (vocabulary)
        window_size (int): The number of words to consider on each side of the target word.

    Returns:
        dict: Dictionary where keys are words and values are their corresponding contexts.
    """
    words_to_context = {} 
    for index, word in enumerate(words):
        start_index = max(0, index - window_size)
        end_index = min(len(words), index + window_size + 1)       
        word_context = words[start_index:index] + words[index + 1:end_index]
        
        if word in words_to_context:
            words_to_context[word].append(word_context)
        else:
            words_to_context[word] = [word_context]
            
    return words_to_context


In [None]:
words_to_context = build_context_dictionary(vocab)

As we did for the words_to_one_hot dictionary, let's save the words_to_context dictionary as a JSON file as well.

In [None]:
file_path = 'data/words_to_context.json'

with open(file_path, 'w') as file:
    json.dump(words_to_context, file, indent=4)

The code below uses the words_to_context JSON file to load the words_to_context dictionary (assuming you have saved the words_to_context JSON file previously)

In [31]:
if os.path.exists(file_path):
        with open(file_path) as file:
            words_to_context = json.load(file)

### 1.4 Encode Words-Context Dictionary

Now we have two dictionaries:
- `words_to_context`: A dictionary where each key is a word and the corresponding value is a list of context words.
- `words_to_one_hot`: A dictionary mapping each word to its one-hot encoded vector.

Let's use these two dictionaries to finally create a dictionary where both keys and their corresponding context words are one-hot vectors. 

In [46]:
def convert_words_to_one_hot(words_to_context, words_to_one_hot):
    """
    Convert a dictionary of words and their contexts from words to one-hot encoded vectors.

    Args:
        words_to_context (dict): A dictionary where keys are words and values are lists of context words.
        word_to_one_hot (dict): A dictionary mapping words to their corresponding one-hot encoded vectors.

    Returns:
        dict: A dictionary where both keys and their corresponding context words are converted to string representations of one-hot vectors.
    """
    one_hot_dict = {}

    for word, contexts in words_to_context.items():
        if word in words_to_one_hot:
            # Convert the one-hot vector for the key word to a string key
            one_hot_word = json.dumps(words_to_one_hot[word].toarray()[0].tolist())
            # Convert the one-hot vectors for all context words to string keys
            one_hot_contexts = [json.dumps(words_to_one_hot[ctx].toarray()[0].tolist()) for ctx in contexts if ctx in words_to_one_hot]
            # Store in the new dictionary using the string representation of the one-hot vector as the key
            one_hot_dict[one_hot_word] = one_hot_contexts

    return one_hot_dict

encoded_words_to_context = convert_words_to_one_hot(words_to_context, words_to_one_hot)

NameError: name 'word_to_one_hot' is not defined

As we did for the other dictionaries, let's save this as a JSON file.

In [38]:
file_path = 'data/encoded_words_to_context.json'
with open(file_path, 'w') as file:
    json.dump(encoded_words_to_context, file, indent=4)

### 1.5 Create Usable Dataset 

Now that we have a dictionary containing the words and their corresponding contexts represented as one-hot vectors, `encoded_words_to_context`, we can use this to finally create a usable dataset to train the model. Since in CBOW we are trying to predict the target word based off of the context words, the features here will be the context words, and the "label" will be the target word.

First, we must prepare the data by extracting features and labels from the dictionary appropriately.

In [44]:
inputs = []
targets = []

for target, contexts in encoded_words_to_context.items():
    # Assuming contexts is a list of one-hot encoded vectors
    context_sum = sum(torch.tensor(context) for context in contexts)  # Summing context vectors
    avg_context = context_sum / len(contexts)  # Averaging
    inputs.append(avg_context)
    targets.append(torch.tensor(target))  # Target vector

inputs = torch.stack(inputs)  # Convert list of tensors to a tensor
targets = torch.stack(targets)

ZeroDivisionError: division by zero

In [None]:
# Convert inputs and targets into train and test sets
X_train, X_test, y_train, y_test = train_test_split(inputs.numpy(), targets.numpy(), test_size=0.2, random_state=42)


## 2. Define and Train Model

In [45]:
first_five_keys = list(encoded_words_to_context.keys())[:5]

# Print the keys and their corresponding values
for key in first_five_keys:
    print(f"Key: {key}")
    print("Values:")
    if encoded_words_to_context[key]:  # Check if there are any values
        for value in encoded_words_to_context[key]:
            print(value)
    else:
        print("This key has no values or an empty list.")
    print("\n")  # Print a newline for better separation between entries

Key: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,