In this notebook, I implement the Continuous Bag of Words algorithm from scratch, which is a word2vec algorithm for calculating word embeddings. First, we must start off with some actual words data to train the CBOW model on. For this, I will use Text8, a text file containing the first 100 MB of cleaned text from Wikipedia. 

## Motivation

One-hot encoding doesn't carry much meaning - word embeddings allow us to capture relationship between words.

In CBOW, we predict a target word based on context words.

## 1. Prepare data

CBOW works by training a model to predict the target word based off of on the context word, allowing us to use the trained parameters as word embeddings. Thus, we must first prepare the words to be inputted into the neural network, so we need to one-hot encode.

### 1.1 Prepare a vocabulary list and their corresponding one hot encodings

In [3]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import gensim.downloader as api
from collections import Counter
import json

In [4]:
# Load the text8 dataset
text8_dataset = api.load('text8')

# Convert it into a list of words
text8_words = [word for words in text8_dataset for word in words]

print("Number of words:", len(text8_words))

Number of words: 17005207


Since the Text8 dataset is extremely large, we will take a subset of 10,000 words and work with that instead since it is more practical: the dataset will still be large enough to do its purpose of being a good learning experience but not so much where the computer will run into memory issues. We will do this by considering the words that are most frequent. 

In [5]:
word_counts = Counter(text8_words)

def limit_vocabulary(word_counts, max_vocab_size=10000):
    """
    
    Args:
        word_counts (Counter): A Counter object from the collections module, containing the frequency of each word in the dataset.
        max_vocab_size (int, optional): The maximum number of words to include in the reduced vocabulary. Defaults to 10,000. 

    Returns:
        set: A set of strings, each representing a word. This set contains only the most frequent words as specified by 'max_vocab_size'.

    """
    limited_vocab = {word for word, count in word_counts.most_common(max_vocab_size)}
    return limited_vocab

limited_vocab = limit_vocabulary(word_counts)

limited_text8_words = [word for word in text8_words if word in limited_vocab]
print("Number of words:", len(set(limited_text8_words)))

Number of words: 10000


Let's one hot encode each of those 10,000 words so that each word correlates to a sparse vector.

In [6]:
# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)  # Using dense array for easier handling

# Convert limited_vocab to a NumPy array with the correct shape
vocab_array = np.array(list(limited_vocab)).reshape(-1, 1)

# Fit the encoder
encoder.fit(vocab_array)
one_hot_dict = {word: encoder.transform([[word]])[0] for word in limited_vocab}

### 1.2 Retrieve Word Contexts
Since in CBOW we are trying to predict the center word based on context words, we first need to define which context words correlate to each word. We do this by choosing a window size. Let's use a window size of 5. The below code creates a dictionary where each key is a word and its corresponding object is the context words.

In [9]:
WINDOW_SIZE = 5
    
def build_context_dictionary(words, window_size=WINDOW_SIZE):
    """   
    Args:
        words (list of str):List containing all the words (vocabulary)
        window_size (int): The number of words to consider on each side of the target word.

    Returns:
        dict: Dictionary where keys are words and values are their corresponding contexts.
    """
    words_to_context = {} 
    for index, word in enumerate(words):
        start_index = max(0, index - window_size)
        end_index = min(len(words), index + window_size + 1)       
        word_context = words[start_index:index] + words[index + 1:end_index]
        
        if word in words_to_context:
            words_to_context[word].append(word_context)
        else:
            words_to_context[word] = [word_context]
            
    return words_to_context

context_dict = build_context_dictionary(limited_text8_words)

# Save context_dict to json to access it easier later

#file_path = 'data/context_dict.json'
#with open(file_path, 'w') as file:
    #json.dump(context_dict, file, indent=4)

Now that we have the context-target pairs ready, we need to encode these words into one-hot vectors so that they can be converted into word embeddings by training the parameters of the model to make a prediction of the target word given the context word. Before we can do this, however, we need to flatten all words and contexts into a single list to ensure they are included in the vocabulary of the `OneHotEncoder`. This is necessary to prepare the data

In [5]:
def flatten_words_to_context(words_to_context):
    """
    
    Args:
        words_to_context (dict): Dictionary where keys are words and values are their corresponding contexts. 

    Returns:
        list of str: List of all words (vocabulary)

    """
    all_words = []
    for word, contexts in words_to_context.items():
        all_words.append(word)
        for context in contexts:
            all_words.extend(context)
    return all_words

all_words = flatten_words_to_context(context_dict)

Now that we have flattened the list, let's one hot encode the words.

In [6]:
def one_hot_encode_words(all_words):
    one_hot_encoder = OneHotEncoder(sparse=False)
    all_words_array = np.array(all_words).reshape(-1, 1)
    one_hot_encoder.fit(all_words_array)
    return one_hot_encoder

def get_one_hot_vector(one_hot_encoder, word):
    return one_hot_encoder.transform([[word]])[0]

one_hot_encoder = one_hot_encode_words(all_words)



Finally, let's create the one-hot encoded dictionary.

In [None]:
def create_one_hot_dict(words_to_context, one_hot_encoder):
    one_hot_dict = {}
    for word, contexts in words_to_context.items():
        one_hot_word = get_one_hot_vector(one_hot_encoder, word)
        one_hot_contexts = [get_one_hot_vector(one_hot_encoder, ctx_word) for context in contexts for ctx_word in context]
        one_hot_dict[tuple(one_hot_word)] = one_hot_contexts
    return one_hot_dict

one_hot_encoded_context_dict = create_one_hot_dict(context_dict, one_hot_encoder)

In [None]:
 """
def one_hot_encode_context_dict(words_to_context):
    
    
    
    one_hot_encoder = OneHotEncoder(sparse=False)  # Use sparse=False to handle the data more easily

    # Collect all unique words from keys and context lists
    unique_words = set()
    for word, contexts in words_to_context.items():
        unique_words.add(word)
        unique_words.update(contexts)
    
    # Fit the OneHotEncoder on all unique words
    unique_words = np.array(list(unique_words)).reshape(-1, 1)
    one_hot_encoder.fit(unique_words)
    
    # Create one-hot dictionary for all words
    one_hot_words = {word: one_hot_encoder.transform([[word]])[0] 
                     for word in unique_words.flatten()}
    
    # Map each word and its contexts to their one-hot vectors
    one_hot_words_to_context = {
        one_hot_words[word]: [one_hot_words[ctx] for ctx in contexts if ctx in one_hot_words]
        for word, contexts in words_to_context.items()
    }

    return one_hot_words_to_context

one_hot_words_to_context = one_hot_encode_context_dict(words_to_context)
"""

In [None]:
"""
# Convert words in words_to_context dictionary to one-hot
word_to_one_hot = {word[0]: one_hot_encoder.transform([[word[0]]])
                   for word in unique_text8_words}
one_hot_words_to_contexts = {}
for word, contexts in words_to_context.items():
    # Transform each context word to one-hot using the map
    one_hot_contexts = [word_to_one_hot[context_word] for context_word in contexts if context_word in word_to_one_hot]
    # Assign the one-hot encoded context words to the one-hot encoded target word
    one_hot_words_to_contexts[word_to_one_hot[word]] = one_hot_contexts
    """