## Step 1: Preprocess Text

We will preprocess the text by converting it to lowercase, removing punctuation, and splitting it into sentences.


In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict
import re

# Example textfdhtxsr
#text = "Do all the good you can, for all the people you can, in all the ways you can, as long as you can."
path = "D:\\Y5 AMS\\Information-WR\\TP-04\\IMDB Dataset.csv"
text = pd.read_csv(path)["review"][0]
print("Data:", text)

Data: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to t

In [10]:
# Step 1: Split Sentences
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    sentences = text.split(".")
    sentences = [sentence.split() for sentence in sentences if sentence]
     # Strip whitespace from each word  
    sentences = [[word.strip() for word in sentence] for sentence in sentences]
    return sentences

sentences = preprocess_text(text)
print("Sentences:", sentences)

Sentences: [['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '1', 'oz', 'episode', 'youll', 'be', 'hooked', 'they', 'are', 'right', 'as', 'this', 'is', 'exactly', 'what', 'happened', 'with', 'mebr', 'br', 'the', 'first', 'thing', 'that', 'struck', 'me', 'about', 'oz', 'was', 'its', 'brutality', 'and', 'unflinching', 'scenes', 'of', 'violence', 'which', 'set', 'in', 'right', 'from', 'the', 'word', 'go', 'trust', 'me', 'this', 'is', 'not', 'a', 'show', 'for', 'the', 'faint', 'hearted', 'or', 'timid', 'this', 'show', 'pulls', 'no', 'punches', 'with', 'regards', 'to', 'drugs', 'sex', 'or', 'violence', 'its', 'is', 'hardcore', 'in', 'the', 'classic', 'use', 'of', 'the', 'wordbr', 'br', 'it', 'is', 'called', 'oz', 'as', 'that', 'is', 'the', 'nickname', 'given', 'to', 'the', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'it', 'focuses', 'mainly', 'on', 'emerald', 'city', 'an', 'experimental', 'section', 'of', 'the', 'prison', 'wher

## Step 2: Build Vocabulary

We will create a vocabulary of unique words and map each word to an index.


In [11]:
# Step 2: Make Vocabulary
def build_vocabulary(sentences):
    vocabulary = set()
    for sentence in sentences:
        vocabulary.update(sentence)
    word_to_index = {word: idx for idx, word in enumerate(vocabulary)}
    index_to_word = {idx: word for word, idx in word_to_index.items()}
    return word_to_index, index_to_word


word_to_index, index_to_word = build_vocabulary(sentences)
print("Word to Index:", word_to_index)

Word to Index: {'1': 0, 'maximum': 1, 'latinos': 2, 'is': 3, 'fact': 4, 'their': 5, 'are': 6, 'experience': 7, 'goes': 8, 'doesnt': 9, 'or': 10, 'regards': 11, 'scuffles': 12, 'wholl': 13, 'drugs': 14, 'if': 15, 'about': 16, 'romanceoz': 17, 'section': 18, 'has': 19, 'lack': 20, 'main': 21, 'you': 22, 'which': 23, 'developed': 24, 'out': 25, 'nickel': 26, 'got': 27, 'away': 28, 'gangstas': 29, 'appeal': 30, 'on': 31, 'never': 32, 'surreal': 33, 'in': 34, 'other': 35, 'agenda': 36, 'mannered': 37, 'dealings': 38, 'was': 39, 'shady': 40, 'mebr': 41, 'watched': 42, 'middle': 43, 'can': 44, 'inmates': 45, 'death': 46, 'struck': 47, 'penitentary': 48, 'irish': 49, 'side': 50, 'of': 51, 'where': 52, 'happened': 53, 'into': 54, 'a': 55, 'oswald': 56, 'show': 57, 'the': 58, 'so': 59, 'what': 60, 'it': 61, 'shows': 62, 'graphic': 63, 'hearted': 64, 'high': 65, 'bitches': 66, 'hardcore': 67, 'prison': 68, 'forget': 69, 'italians': 70, 'mainly': 71, 'exactly': 72, 'right': 73, 'go': 74, 'touch': 

## Step 3: One-Hot Encoding

Convert each word into a one-hot encoded vector.


In [16]:
# Import numpy
import numpy as np
# Step 3: One-Hot Encode
def one_hot_encode(word, word_to_index):
    word = word.strip()  # Strip whitespace from the word
    vector = np.zeros(len(word_to_index))
    vector[word_to_index[word]] = 1
    return vector

# Example of one-hot encoding
example_word = "you"
print(f"One-hot encoding for '{example_word}':", one_hot_encode(example_word, word_to_index))

One-hot encoding for 'you': [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


## Step 4: Prepare Training Data

Generate training pairs of a target word and its surrounding context words.


In [17]:
# Step 4: Prepare Training Data
def generate_training_data(sentences, word_to_index, window_size=2):
    training_data = []
    for sentence in sentences:
        for i, target_word in enumerate(sentence):
            context = []
            for j in range(-window_size, window_size + 1):
                if j != 0 and 0 <= i + j < len(sentence):
                    context.append(sentence[i + j])
            training_data.append((context, target_word))
    return training_data

training_data = generate_training_data(sentences, word_to_index)
print("Training Data:", training_data)

Training Data: [(['of', 'the'], 'one'), (['one', 'the', 'other'], 'of'), (['one', 'of', 'other', 'reviewers'], 'the'), (['of', 'the', 'reviewers', 'has'], 'other'), (['the', 'other', 'has', 'mentioned'], 'reviewers'), (['other', 'reviewers', 'mentioned', 'that'], 'has'), (['reviewers', 'has', 'that', 'after'], 'mentioned'), (['has', 'mentioned', 'after', 'watching'], 'that'), (['mentioned', 'that', 'watching', 'just'], 'after'), (['that', 'after', 'just', '1'], 'watching'), (['after', 'watching', '1', 'oz'], 'just'), (['watching', 'just', 'oz', 'episode'], '1'), (['just', '1', 'episode', 'youll'], 'oz'), (['1', 'oz', 'youll', 'be'], 'episode'), (['oz', 'episode', 'be', 'hooked'], 'youll'), (['episode', 'youll', 'hooked', 'they'], 'be'), (['youll', 'be', 'they', 'are'], 'hooked'), (['be', 'hooked', 'are', 'right'], 'they'), (['hooked', 'they', 'right', 'as'], 'are'), (['they', 'are', 'as', 'this'], 'right'), (['are', 'right', 'this', 'is'], 'as'), (['right', 'as', 'is', 'exactly'], 'thi

## Step 5: Initialize Weights

Randomly initialize weights for the neural network.


In [33]:
# Step 5: Initialize Weights
def initialize_weights(vocab_size, embedding_dim):
    W1 = np.random.rand(vocab_size, embedding_dim)
    W2 = np.random.rand(embedding_dim, vocab_size)
    return W1, W2

vocab_size = len(word_to_index)
embedding_dim = 10 # You can choose any dimension (embadding dim size = no limit depend on weight since we train in Nueral Network)
W1, W2 = initialize_weights(vocab_size, embedding_dim)
print("W1 Shape:", W1.shape)
print("W2 Shape:", W2.shape)

# You can choose any dimension embadding value :(- positive = samantic meaning, - negative = opposite meaning)

W1 Shape: (191, 10)
W2 Shape: (10, 191)


## Step 6: Forward Pass

Use the context words to predict the target word.


In [19]:
# Step 6: Forward Pass
def forward_pass(context_words, W1, W2, word_to_index):
    context_vectors = np.sum([one_hot_encode(word, word_to_index) for word in context_words], axis=0)
    hidden_layer = np.dot(context_vectors, W1)
    output_layer = np.dot(hidden_layer, W2)
    predictions = softmax(output_layer)
    return predictions, hidden_layer

def softmax(x):
    exp_x = np.exp(x - np.max(x))
    return exp_x / exp_x.sum(axis=0)

# Example forward pass
context_words = ["the", "you"]
predictions, hidden_layer = forward_pass(context_words, W1, W2, word_to_index)
print("Predictions:", predictions)

Predictions: [0.00112012 0.01683232 0.00588892 0.0081579  0.00197456 0.04299571
 0.00719331 0.00123197 0.00264025 0.00085635 0.0087082  0.00129673
 0.00987993 0.00464819 0.00521012 0.00363398 0.00976825 0.00322798
 0.00268262 0.00864183 0.00360238 0.00158183 0.00945697 0.00263331
 0.00289186 0.00193461 0.00669332 0.001491   0.00512661 0.01191443
 0.00114412 0.00332812 0.0028873  0.00635626 0.00217089 0.01592017
 0.00285223 0.00593184 0.00290764 0.0080026  0.0031115  0.00506536
 0.00432771 0.00127654 0.01182434 0.00065049 0.00239956 0.00127871
 0.00637386 0.00871847 0.00221355 0.00315    0.01841785 0.00254405
 0.00317052 0.0007507  0.00213622 0.0180977  0.012872   0.00637939
 0.01332869 0.00184483 0.00392233 0.00697748 0.00185954 0.00308835
 0.00388916 0.01747616 0.0007934  0.00615317 0.00623406 0.00987063
 0.00676555 0.00831884 0.00235018 0.00477473 0.0091109  0.00213582
 0.00182079 0.00293324 0.00150348 0.01294486 0.00152171 0.00209218
 0.00634271 0.00311802 0.00458739 0.02107357 0.01

In [27]:
# Example forward pass with detailed output
context_words = ["the", "you"]
predictions, hidden_layer = forward_pass(context_words, W1, W2, word_to_index)

# Map predictions to words
predicted_probabilities = {index_to_word[idx]: prob for idx, prob in enumerate(predictions)}

# Print predictions
print("Context Words:", context_words)
print("Predicted Probabilities:")
for word, prob in predicted_probabilities.items():
    print(f"{word}: {prob:.4f}")

Context Words: ['the', 'you']
Predicted Probabilities:
1: 0.0000
maximum: 0.0021
latinos: 0.0000
is: 0.0000
fact: 0.0014
their: 0.0000
are: 0.0000
experience: 0.0000
goes: 0.0000
doesnt: 0.0000
or: 0.0000
regards: 0.0000
scuffles: 0.0000
wholl: 0.0000
drugs: 0.0000
if: 0.0000
about: 0.0000
romanceoz: 0.0000
section: 0.0000
has: 0.0000
lack: 0.0000
main: 0.0012
you: 0.0000
which: 0.0000
developed: 0.0001
out: 0.0000
nickel: 0.0000
got: 0.0000
away: 0.0000
gangstas: 0.0000
appeal: 0.0848
on: 0.0004
never: 0.0009
surreal: 0.0000
in: 0.0012
other: 0.0000
agenda: 0.0194
mannered: 0.0000
dealings: 0.0000
was: 0.0000
shady: 0.0000
mebr: 0.0042
watched: 0.0000
middle: 0.0000
can: 0.0000
inmates: 0.0000
death: 0.0000
struck: 0.0000
penitentary: 0.0000
irish: 0.0000
side: 0.0000
of: 0.2944
where: 0.0228
happened: 0.0000
into: 0.0000
a: 0.0000
oswald: 0.0007
show: 0.0002
the: 0.0000
so: 0.0000
what: 0.0000
it: 0.0000
shows: 0.0000
graphic: 0.0000
hearted: 0.0017
high: 0.0017
bitches: 0.0000
hardc

In [35]:
# Predict the most likely target word and print embedding vector
context_words = ["the", "you"]
predictions, hidden_layer = forward_pass(context_words, W1, W2, word_to_index)

# Find the word with the highest probability
max_prob_index = np.argmax(predictions)
predicted_word = index_to_word[max_prob_index]

# Print the result
print("Context Words:", context_words)
print("Predicted Target Word:", predicted_word)
print("Probability:", predictions[max_prob_index])

# Print the embedding vector for the predicted word
embedding_vector = W1[word_to_index[predicted_word]]
print("Embedding Vector for Predicted Word:", embedding_vector)

Context Words: ['the', 'you']
Predicted Target Word: a
Probability: 0.01766454166611733
Embedding Vector for Predicted Word: [0.46178689 0.19959484 0.67549317 0.41180717 0.80148279 0.3085899
 0.18632604 0.30465782 0.52571886 0.80320544]


## Step 7: Calculate Loss

Compute the loss to measure how far the predictions are from the actual target.


In [20]:
# Step 7: Calculate Loss
def calculate_loss(predictions, target_word, word_to_index):
    target_vector = one_hot_encode(target_word, word_to_index)
    loss = -np.sum(target_vector * np.log(predictions))
    return loss

# Example loss calculation
target_word = "you"
loss = calculate_loss(predictions, target_word, word_to_index)
print("Loss:", loss)

Loss: 4.661002881496868


## Step 8: Update Weights

Adjust the weights using backpropagation to minimize the loss.


In [21]:
# Step 8: Update Weights
def backpropagate(W1, W2, hidden_layer, context_words, predictions, target_word, word_to_index, learning_rate=0.01):
    target_vector = one_hot_encode(target_word, word_to_index)
    error = predictions - target_vector
    dW2 = np.outer(hidden_layer, error)
    dW1 = np.outer(np.sum([one_hot_encode(word, word_to_index) for word in context_words], axis=0), np.dot(W2, error))
    W1 -= learning_rate * dW1
    W2 -= learning_rate * dW2
    return W1, W2

# Example weight update
W1, W2 = backpropagate(W1, W2, hidden_layer, context_words, predictions, target_word, word_to_index)
print("Updated W1:", W1)
print("Updated W2:", W2)

Updated W1: [[0.97899608 0.80407992 0.8165085  ... 0.33992258 0.28518708 0.00120282]
 [0.24214833 0.92688365 0.1471619  ... 0.96228308 0.04343361 0.44555612]
 [0.64642993 0.10520371 0.35753076 ... 0.13218406 0.11879782 0.53407438]
 ...
 [0.94358912 0.16576628 0.89732151 ... 0.45993921 0.66833432 0.24997746]
 [0.20484033 0.13636099 0.78745049 ... 0.05136297 0.5654286  0.97378329]
 [0.51070908 0.15062529 0.97892189 ... 0.44463841 0.88339349 0.25950017]]
Updated W2: [[0.52378247 0.89222918 0.86997127 ... 0.73087749 0.87754141 0.74059442]
 [0.08821982 0.16543376 0.55790768 ... 0.2146925  0.45009949 0.7191441 ]
 [0.34772724 0.91402409 0.26119076 ... 0.72199846 0.42913997 0.86857217]
 ...
 [0.37915686 0.77550342 0.65197886 ... 0.86890509 0.76794262 0.16758902]
 [0.20290302 0.95457001 0.93696946 ... 0.64471561 0.13623499 0.29007315]
 [0.44467986 0.99884161 0.43020554 ... 0.54006021 0.1457283  0.43675004]]


## Training the CBOW Model

We will train the CBOW model on the example text corpus for multiple epochs.


In [22]:
# Training the CBOW Model
for epoch in range(1000):
    total_loss = 0
    for context_words, target_word in training_data:
        predictions, hidden_layer = forward_pass(context_words, W1, W2, word_to_index)
        loss = calculate_loss(predictions, target_word, word_to_index)
        total_loss += loss
        W1, W2 = backpropagate(W1, W2, hidden_layer, context_words, predictions, target_word, word_to_index)
    if epoch % 100 == 0:
        print(f"Epoch {epoch+100}, Loss: {total_loss:.4f}")

Epoch 100, Loss: 1927.5875
Epoch 200, Loss: 159.9771
Epoch 200, Loss: 159.9771
Epoch 300, Loss: 27.1783
Epoch 300, Loss: 27.1783
Epoch 400, Loss: 12.3953
Epoch 400, Loss: 12.3953
Epoch 500, Loss: 7.6629
Epoch 500, Loss: 7.6629
Epoch 600, Loss: 5.4328
Epoch 600, Loss: 5.4328
Epoch 700, Loss: 4.1602
Epoch 700, Loss: 4.1602
Epoch 800, Loss: 3.3463
Epoch 800, Loss: 3.3463
Epoch 900, Loss: 2.7849
Epoch 900, Loss: 2.7849
Epoch 1000, Loss: 2.3761
Epoch 1000, Loss: 2.3761


In [None]:
# Training the CBOW Model
for epoch in range(10000):
    total_loss = 0
    for context_words, target_word in training_data:
        predictions, hidden_layer = forward_pass(context_words, W1, W2, word_to_index)
        loss = calculate_loss(predictions, target_word, word_to_index)
        total_loss += loss
        W1, W2 = backpropagate(W1, W2, hidden_layer, context_words, predictions, target_word, word_to_index)
    if epoch % 1000 == 0:
        print(f"Epoch {epoch+1000}, Loss: {total_loss:.4f}")

Epoch 1000, Loss: 2.0080
Epoch 2000, Loss: 0.8353
Epoch 3000, Loss: 0.5118
Epoch 4000, Loss: 0.3640
Epoch 5000, Loss: 0.2803
Epoch 6000, Loss: 0.2268
Epoch 7000, Loss: 0.1899
Epoch 8000, Loss: 0.1628
Epoch 9000, Loss: 0.1423
Epoch 10000, Loss: 0.1262
