# Continuous Bag of Words (Implementation)

In this notebook, I implement the Continuous Bag of Words algorithm, which is a word2vec algorithm for calculating word embeddings. First, we must start off with some actual words data to train the CBOW model on. For this, I will use Text8, a text file containing the first 100 MB of cleaned text from Wikipedia. 

## Motivation

One-hot encoding doesn't carry much meaning - word embeddings allow us to capture relationship between words.

In CBOW, we predict a target word based on context words.

In [22]:
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
import string
import torch

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/danieltiourine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Step 1: Prepare data

CBOW works by training a model to predict the target word based off of on the context word, allowing us to use the trained parameters as word embeddings. Thus, we must first prepare the words to be inputted into the neural network, so we need to one-hot encode. This entails a few main steps:

1. Prepare Corpus: We will use an excerpt from "Alice in Wonderland" by Lewis Carroll
2. One-hot encode each word in the vocabulary, resulting in a dictionary mapping from each word to a sparse vector.
2. Find the context for each word in the vocabulary, resulting in a dictionary mapping from each word to its context words.
3. One-hot encode the word-context dictionary using the word-onehot dictionary, resulting in a one-hot encoded word-context dictionary.

### Step 1.1: Prepare Corpus

In [23]:
corpus_path = 'corpus.txt'
with open(corpus_path, 'r', encoding='utf-8') as f:
    corpus = f.read()

print("Corpus:")
print(corpus)

Corpus:
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'

So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at i

In [24]:
# Preprocess words (lowercase, remove punctuation, tokenization)

# Sample preprocessed text
corpus = corpus.lower()

# Remove punctuation
translator = str.maketrans('', '', string.punctuation)
corpus = corpus.translate(translator)

# Tokenize the text
tokenized_corpus = word_tokenize(corpus)

# Create vocabulary
vocabulary = set(tokenized_corpus)

print("Vocabulary:", vocabulary)
print("Number of unique words:", len(vocabulary))

Vocabulary: {'fortunately', 'burning', 'rabbit', 'of', 'at', 'curiosity', 'whether', 'as', 'when', 'actually', 'could', 'hear', 'for', 'pleasure', 'see', 'on', 'to', 'shall', 'and', 'would', 'watch', 'think', 'by', 'oh', 'eyes', 'just', 'occurred', 'making', 'suddenly', 'own', 'wondered', 'twice', 'started', 'afterwards', 'itself', 'ought', 'with', 'its', 'feel', 'pictures', 'made', 'reading', 'rabbithole', 'took', 'but', 'having', 'is', 'waistcoatpocket', 'this', 'her', 'feet', 'that', 'in', 'out', 'what', 'mind', 'up', 'considering', 'remarkable', 'i', 'without', 'much', 'get', 'white', 'she', 'thought', 'hurried', 'was', 'so', 'use', 'sleepy', 'or', 'trouble', 'well', 'field', 'have', 'pop', 'there', 'book', 'either', 'under', 'worth', 'sister', 'pink', 'down', 'it', 'conversation', 'quite', 'after', 'take', 'across', 'a', 'very', 'hot', 'tired', 'over', 'ever', 'do', 'did', 'then', 'into', 'dear', 'daisies', 'all', 'natural', 'peeped', 'looked', 'no', 'ran', 'picking', 'conversatio

### Step 1.2: Retrieve Word-Context Pairs

(Since in CBOW we are trying to predict the center word based on context words, we first need to define which context words correlate to each word. )
In addition to encoding each word, we must also retrieve the contexts of each word. We do this by finding the surrounding words of the target word in different sentences, defined by some window size. Let's choose a window size of 4.

The below code creates a dictionary where each key is a word and its corresponding object is the context words.

In [25]:
window_size = 2
def build_word_context_pairs(tokenized_corpus=tokenized_corpus, vocabulary=vocabulary, window_size=window_size):
    word_context_pairs = {word: [] for word in vocabulary}
    
    for index, target_word in enumerate(tokenized_corpus):
        if target_word in vocabulary:
            start_index = max(index - window_size, 0)
            end_index = min(index + window_size + 1, len(tokenized_corpus))
            
            context_words = [tokenized_corpus[i] for i in range(start_index, end_index) if i != index]
            word_context_pairs[target_word].extend(context_words)
            
    return word_context_pairs
    
word_context_pairs = build_word_context_pairs()     
print("Context pairs for 'alice':", word_context_pairs['alice'])
print("Context pairs for 'natural':", word_context_pairs['natural'])
print("Context pairs for 'conversations':", word_context_pairs['conversations'])

Context pairs for 'alice': ['was', 'beginning', 'book', 'thought', 'without', 'pictures', 'nor', 'did', 'think', 'it', 'hurried', 'on', 'started', 'to']
Context pairs for 'natural': ['seemed', 'quite', 'but', 'when']
Context pairs for 'conversations': ['pictures', 'or', 'in', 'it']


### Step 1.3: Encode Words
Now that we have prepared our vocabulary into words to context pairs, let's encode these words so they be fed into the model. Instead of traditional one-hot-encoding, word2vec typically uses more efficient representation due to the high computational costs associated with handling large vocabularies. In this case, we will use index mapping.

In [26]:
def create_index_mapping(vocabulary):
    word_to_index = {word: i for i, word in enumerate(vocabulary)}
    index_to_word = {i: word for word, i in word_to_index.items()}
    return word_to_index, index_to_word

word_to_index, index_to_word = create_index_mapping(vocabulary)

### Step 1.4: Set up training data
Now that we have encoded our word-context pairs, we can use this to set up the training data.

In [27]:
def get_training_data(word_context_pairs, word_to_index):
    training_data = []
    for word, contexts in word_context_pairs.items():
        word_idx = word_to_index[word]
        context_indices = [word_to_index[context] for context in contexts]
        for context_idx in context_indices:
            training_data.append((word_idx, context_idx))
    return np.array(training_data, dtype='int32')

training_data = get_training_data(word_context_pairs, word_to_index)

## Step 2: Define and Train Model

In [28]:
"""
first_five_keys = list(encoded_words_to_context.keys())[:5]

# Print the keys and their corresponding values
for key in first_five_keys:
    print(f"Key: {key}")
    print("Values:")
    if encoded_words_to_context[key]:  # Check if there are any values
        for value in encoded_words_to_context[key]:
            print(value)
    else:
        print("This key has no values or an empty list.")
    print("\n")  # Print a newline for better separation between entries
"""
print("")


