In this notebook, I implement the Continuous Bag of Words algorithm from scratch, which is a word2vec algorithm for calculating word embeddings. First, we must start off with some actual words data to train the CBOW model on. For this, I will use Text8, a text file containing the first 100 MB of cleaned text from Wikipedia. 

In CBOW, we predict a target word based on context words.

In [2]:
import gensim.downloader as api

# Load the text8 dataset
text8_dataset = api.load('text8')

# Convert it into a list of words
text8_words = [word for words in text8_dataset for word in words]

print(text8_words[:100])  # Print the first 100 words to check
print("Number of words:", len(text8_words))

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing']
17005207


Since in CBOW we are trying to predict the center word based on context words, we first need to define which context words correlate to each word. We do this by choosing a window size. Let's use a window size of 5. The below code creates a dictionary where each key is a word and its corresponding object is the context words.

In [3]:
words_to_context = {}
WINDOW_SIZE = 5

for index, word in enumerate(text8_words):
    # Calculate start and end indices for context
    start_index = max(0, index - WINDOW_SIZE)
    end_index = min(len(text8_words), index + WINDOW_SIZE + 1)

    # Get context words excluding the target word itself
    word_context = text8_words[start_index:index] + text8_words[index + 1:end_index]

    # Append context words to the list for this word in the dictionary
    if word in words_to_context:
        words_to_context[word].append(word_context)
    else:
        words_to_context[word] = [word_context]

In [1]:
print('Hello world')

Hello world
