
<div style="color:#ffffff;
          font-size:50px;
          font-style:italic;
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	&nbsp; Word2vec CBOW from scratch
</div>
<br>   
<div style="
          font-size:20px;
          text-align:left;
          font-family: 'Palatino';
          ">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Project: Embedding with word2vec & CBOW using PyTorch and Python<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Author: George Barrinuevo<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date: 07/06/2025<br>
</div>

<br><div style="color:#ffffff;
          font-size:30px;
          font-style:italic;
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	      &nbsp; Project Notes
</div>
<div style="
          font-size:16px;
          text-align:left;
          font-family: 'Cambria';">
    
<b>My Thoughts</b>
- This is an example of implementing word2vec embedding with CBOW (Continuous Bag of Words) created from scratch. The purpose of this script is for educational purposes.
- The script uses PyTorch and Python.

<b>Technical Details</b>
- The word2vec is an embedding method that determines how input text is prepared for training a model. It has two variants: CBOW and Skip-Gram. This script implements the CBOW version of word2vec.
- The CBOW divides text in to target and context. The target is just a single word. If the window size is 2, the context is a list of words that is 2 words just before the target and 2 words just after the target. The input is the context and the truth to be predicted is the target.
- The accuracy of this model depends on the size of the text corpus, number of epochs used in training, CPU/GPU power, and etc. The bigger or larger (e.g. larger text corpus or CPU/GPU) the better.
- A vocabulary is created which maps words to an index number, also known as the Token ID. The word_to_ix variable takes a string word and outputs it's corresponding Token ID. The ix_to_word takes a Token ID and outputs the corresponding string word.
- The EMDEDDING_DIM value can be any number you want. This is similar to 'features' such as a King and Queen share a common feature which is 'royalty'. A smaller EMDEDDING_DIM value captures less features/complexity but requires less CPU/GPU power. A larger value captures more features/complexity but needs more processing power. If you need more accurcy, try increasing this value.
- The WINDOW_SIZE is the number of words to include in the text. If this value is 4, then it includes the 4 words before and 4 words after the target word.


In [1]:
!pip install torch

# After installing these packages, restart the kernel and re-run this notebook.

In [2]:
import torch
import torch.nn as nn

WINDOW_SIZE = 4
EMDEDDING_DIM = 100

raw_text = '''Among the vicissitudes incident to life, no event could have filled me with greater anxieties than that of which the notification
was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I
can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with
an immutable decision, as the asylum of my declining years — a retreat which was rendered every day more necessary as well as more dear to me by
the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand,
the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced
of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments
from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of
emotions, all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might
be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances or by an
affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as
well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences
be judged by my country with some share of the partiality in which they originated. Such being the impressions under which I have, in obedience to
the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to
that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect,
that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these
essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In
tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own,
nor those of my fellow-citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the
affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have
been distinguished by some token of providential agency. And in the important revolution just accomplished in the system of their united government,
the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means
by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings
which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed.
You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more
auspiciously commence. By the article establishing the executive department, it is made the duty of the President "to recommend to your consideration
such measures as he shall judge necessary and expedient." The circumstances under which I now meet you will acquit me from entering into that subject
further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to
which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me,
to substitute in place of a recommendation of particular measures the tribute that is due to the talents, the rectitude, and the patriotism which adorn
the characters selected to devise and adopt them. In these honorable qualifications, I behold the surest pledges that as on one side no local prejudices
or attachments, no separate views, nor party animosities will misdirect the comprehensive and equal eye which ought to watch over this great assemblage
of communities and interests; so, on another, that the foundations of our national policy will be laid in the pure and immutable principles of private
morality; and the preeminence of a free government be exemplified by all the attributes which can win the affections of its citizens and command the
respect of the world.'''.split()

vocab = sorted(set(raw_text))
vocab_size = len(vocab)

In [3]:
class model_CBOW(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(model_CBOW, self).__init__()

        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation_function1 = nn.ReLU()

        self.linear2 = nn.Linear(128, vocab_size)
        self.activation_function2 = nn.LogSoftmax(dim = -1)

    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view(1,-1)
        out = self.linear1(embeds)
        out = self.activation_function1(out)
        out = self.linear2(out)
        out = self.activation_function2(out)
        return out

def create_token_func():
    global word_to_ix
    global ix_to_word

    word_to_ix = {word:ix for ix, word in enumerate(vocab)}
    ix_to_word = {ix:word for ix, word in enumerate(vocab)}

def words_to_token(words, return_as_tensor=True):
    token_list = []

    for one_word in words:
        token_list.append(word_to_ix[one_word])

    if return_as_tensor:
        return torch.tensor(token_list, dtype=torch.long)
    else:
        return token_list

def CBOW():
    global input_target_words

    input_target_words = []
    for i in range(WINDOW_SIZE, len(raw_text) - WINDOW_SIZE):
        if i + WINDOW_SIZE >= len(raw_text):
            break

        context = raw_text[i-WINDOW_SIZE:i+WINDOW_SIZE+1]
        target = raw_text[i]
        context.remove(target)
        target = [target]
        input_target_words.append((context, target))

In [4]:
def training():
    loss_function = nn.NLLLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
    epoch_size = 100

    for epoch in range(epoch_size):
        total_loss = 0

        for context, target in input_target_words:
            context_vector = words_to_token(context)
            log_probs = model(context_vector)
            t1 = words_to_token(target, return_as_tensor=False)
            total_loss += loss_function(log_probs, torch.tensor(t1))

        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

def testing():
    select_context_idx = 41
    context, target = input_target_words[select_context_idx]
    print(f'context truth: {context}')
    print(f'target truth: {target}')
    print(f'-------')

    context_vector = words_to_token(context)
    y_pred = model(context_vector)

    print(f'Prediction: {ix_to_word[torch.argmax(y_pred[0]).item()]}')

In [5]:
create_token_func()
CBOW()
model = model_CBOW(vocab_size, EMDEDDING_DIM)
training()
testing()

context truth: ['was', 'summoned', 'by', 'my', 'whose', 'voice', 'I', 'can']
target truth: ['Country,']
-------
Prediction: Country,
