## Step 1: Preprocess Text

We will preprocess the text by converting it to lowercase, removing punctuation, and splitting it into sentences.


In [27]:
import numpy as np
import pandas as pd
from collections import defaultdict
import re

# Example text
#text = "Do all the good you can, for all the people you can, in all the ways you can, as long as you can."
path = "D:\\Y5 AMS\\Information-WR\\TP-04\\IMDB Dataset.csv"
text = pd.read_csv(path)["review"][0]
print("Data:", text)

Data: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to t

In [28]:
# Step 1: Split Sentences
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    sentences = text.split(".")
    sentences = [sentence.split() for sentence in sentences if sentence]
     # Strip whitespace from each word  
    sentences = [[word.strip() for word in sentence] for sentence in sentences]
    return sentences

sentences = preprocess_text(text)
print("Sentences:", sentences)

Sentences: [['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '1', 'oz', 'episode', 'youll', 'be', 'hooked', 'they', 'are', 'right', 'as', 'this', 'is', 'exactly', 'what', 'happened', 'with', 'mebr', 'br', 'the', 'first', 'thing', 'that', 'struck', 'me', 'about', 'oz', 'was', 'its', 'brutality', 'and', 'unflinching', 'scenes', 'of', 'violence', 'which', 'set', 'in', 'right', 'from', 'the', 'word', 'go', 'trust', 'me', 'this', 'is', 'not', 'a', 'show', 'for', 'the', 'faint', 'hearted', 'or', 'timid', 'this', 'show', 'pulls', 'no', 'punches', 'with', 'regards', 'to', 'drugs', 'sex', 'or', 'violence', 'its', 'is', 'hardcore', 'in', 'the', 'classic', 'use', 'of', 'the', 'wordbr', 'br', 'it', 'is', 'called', 'oz', 'as', 'that', 'is', 'the', 'nickname', 'given', 'to', 'the', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'it', 'focuses', 'mainly', 'on', 'emerald', 'city', 'an', 'experimental', 'section', 'of', 'the', 'prison', 'wher

## Step 2: Build Vocabulary

We will create a vocabulary of unique words and map each word to an index.


In [None]:
# Step 2: Make Vocabulary
def build_vocabulary(sentences):
    vocabulary = set()
    for sentence in sentences:
        vocabulary.update(sentence)
    word_to_index = {word: idx for idx, word in enumerate(vocabulary)}
    index_to_word = {idx: word for word, idx in word_to_index.items()}
    return word_to_index, index_to_word


word_to_index, index_to_word = build_vocabulary(sentences)
print("Word to Index:", word_to_index)

Word to Index: {'they': 0, 'what': 1, 'gangstas': 2, 'an': 3, 'their': 4, 'muslims': 5, 'face': 6, 'being': 7, 'get': 8, 'to': 9, 'exactly': 10, 'due': 11, 'word': 12, 'painted': 13, 'youll': 14, 'accustomed': 15, 'just': 16, 'after': 17, 'from': 18, 'hooked': 19, 'focuses': 20, 'italians': 21, 'guards': 22, 'thing': 23, 'oz': 24, 'watched': 25, 'levels': 26, 'has': 27, 'trust': 28, 'all': 29, 'well': 30, 'surreal': 31, 'viewingthats': 32, 'given': 33, 'use': 34, 'which': 35, 'of': 36, 'show': 37, 'stares': 38, 'me': 39, 'the': 40, 'city': 41, 'one': 42, 'a': 43, 'not': 44, 'em': 45, 'struck': 46, 'wordbr': 47, 'br': 48, 'so': 49, 'called': 50, 'injustice': 51, 'side': 52, 'graphic': 53, 'experience': 54, 'oswald': 55, 'high': 56, 'moreso': 57, 'ever': 58, 'dodgy': 59, 'goes': 60, 'wouldnt': 61, 'nasty': 62, 'its': 63, 'i': 64, 'doesnt': 65, 'latinos': 66, 'would': 67, 'scenes': 68, 'middle': 69, 'comfortable': 70, 'timid': 71, 'skills': 72, 'where': 73, 'about': 74, 'glass': 75, 'arou

## Step 3: One-Hot Encoding

Convert each word into a one-hot encoded vector.


In [None]:
# Import numpy
import numpy as np
# Step 3: One-Hot Encode
def one_hot_encode(word, word_to_index):
    word = word.strip()  # Strip whitespace from the word
    vector = np.zeros(len(word_to_index))
    vector[word_to_index[word]] = 1
    return vector

# Example of one-hot encoding
example_word = "you"
print(f"One-hot encoding for '{example_word}':", one_hot_encode(example_word, word_to_index))

One-hot encoding for 'you': [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


## Step 4: Prepare Training Data

Generate training pairs of a target word and its surrounding context words.


In [31]:
# Step 4: Prepare Training Data
def generate_training_data(sentences, word_to_index, window_size=2):
    training_data = []
    for sentence in sentences:
        for i, target_word in enumerate(sentence):
            context = []
            for j in range(-window_size, window_size + 1):
                if j != 0 and 0 <= i + j < len(sentence):
                    context.append(sentence[i + j])
            training_data.append((context, target_word))
    return training_data

training_data = generate_training_data(sentences, word_to_index)
print("Training Data:", training_data)

Training Data: [(['of', 'the'], 'one'), (['one', 'the', 'other'], 'of'), (['one', 'of', 'other', 'reviewers'], 'the'), (['of', 'the', 'reviewers', 'has'], 'other'), (['the', 'other', 'has', 'mentioned'], 'reviewers'), (['other', 'reviewers', 'mentioned', 'that'], 'has'), (['reviewers', 'has', 'that', 'after'], 'mentioned'), (['has', 'mentioned', 'after', 'watching'], 'that'), (['mentioned', 'that', 'watching', 'just'], 'after'), (['that', 'after', 'just', '1'], 'watching'), (['after', 'watching', '1', 'oz'], 'just'), (['watching', 'just', 'oz', 'episode'], '1'), (['just', '1', 'episode', 'youll'], 'oz'), (['1', 'oz', 'youll', 'be'], 'episode'), (['oz', 'episode', 'be', 'hooked'], 'youll'), (['episode', 'youll', 'hooked', 'they'], 'be'), (['youll', 'be', 'they', 'are'], 'hooked'), (['be', 'hooked', 'are', 'right'], 'they'), (['hooked', 'they', 'right', 'as'], 'are'), (['they', 'are', 'as', 'this'], 'right'), (['are', 'right', 'this', 'is'], 'as'), (['right', 'as', 'is', 'exactly'], 'thi

## Step 5: Initialize Weights

Randomly initialize weights for the neural network.


In [32]:
# Step 5: Initialize Weights
def initialize_weights(vocab_size, embedding_dim):
    W1 = np.random.rand(vocab_size, embedding_dim)
    W2 = np.random.rand(embedding_dim, vocab_size)
    return W1, W2

vocab_size = len(word_to_index)
embedding_dim = 10
W1, W2 = initialize_weights(vocab_size, embedding_dim)
print("W1 Shape:", W1.shape)
print("W2 Shape:", W2.shape)

W1 Shape: (191, 10)
W2 Shape: (10, 191)


## Step 6: Forward Pass

Use the context words to predict the target word.


In [33]:
# Step 6: Forward Pass
def forward_pass(context_words, W1, W2, word_to_index):
    context_vectors = np.sum([one_hot_encode(word, word_to_index) for word in context_words], axis=0)
    hidden_layer = np.dot(context_vectors, W1)
    output_layer = np.dot(hidden_layer, W2)
    predictions = softmax(output_layer)
    return predictions, hidden_layer

def softmax(x):
    exp_x = np.exp(x - np.max(x))
    return exp_x / exp_x.sum(axis=0)

# Example forward pass
context_words = ["the", "you"]
predictions, hidden_layer = forward_pass(context_words, W1, W2, word_to_index)
print("Predictions:", predictions)

Predictions: [0.00213227 0.0018785  0.00332152 0.0006865  0.00435646 0.00288194
 0.00205401 0.00560074 0.00597857 0.00726922 0.00270178 0.00111142
 0.00446999 0.0156748  0.01047186 0.00188084 0.00351845 0.01132483
 0.00132852 0.0010822  0.00131235 0.00505399 0.0089736  0.00641337
 0.00066024 0.00304097 0.00466539 0.00278789 0.00847183 0.00200756
 0.00055433 0.00118528 0.00182418 0.00737007 0.00423864 0.00783332
 0.01147866 0.00787929 0.00770173 0.01345859 0.00961481 0.00474966
 0.00270046 0.00309687 0.00264362 0.00112812 0.00552614 0.00603089
 0.00358825 0.0036818  0.00480967 0.01083732 0.00416332 0.00392931
 0.001359   0.00147904 0.00895109 0.00276624 0.00212335 0.00308182
 0.00721263 0.00174748 0.00468638 0.00159617 0.00201591 0.00163697
 0.00354256 0.0029527  0.00215963 0.00306316 0.01441095 0.00168572
 0.00824243 0.00377058 0.00407944 0.00370662 0.00195967 0.00925811
 0.00193402 0.0022038  0.00144293 0.02957747 0.00609281 0.01069854
 0.00723083 0.00259711 0.01888807 0.00432152 0.01

## Step 7: Calculate Loss

Compute the loss to measure how far the predictions are from the actual target.


In [34]:
# Step 7: Calculate Loss
def calculate_loss(predictions, target_word, word_to_index):
    target_vector = one_hot_encode(target_word, word_to_index)
    loss = -np.sum(target_vector * np.log(predictions))
    return loss

# Example loss calculation
target_word = "you"
loss = calculate_loss(predictions, target_word, word_to_index)
print("Loss:", loss)

Loss: 5.8837579364215795


## Step 8: Update Weights

Adjust the weights using backpropagation to minimize the loss.


In [35]:
# Step 8: Update Weights
def backpropagate(W1, W2, hidden_layer, context_words, predictions, target_word, word_to_index, learning_rate=0.01):
    target_vector = one_hot_encode(target_word, word_to_index)
    error = predictions - target_vector
    dW2 = np.outer(hidden_layer, error)
    dW1 = np.outer(np.sum([one_hot_encode(word, word_to_index) for word in context_words], axis=0), np.dot(W2, error))
    W1 -= learning_rate * dW1
    W2 -= learning_rate * dW2
    return W1, W2

# Example weight update
W1, W2 = backpropagate(W1, W2, hidden_layer, context_words, predictions, target_word, word_to_index)
print("Updated W1:", W1)
print("Updated W2:", W2)

Updated W1: [[0.08310851 0.74618342 0.19801752 ... 0.40738392 0.35007402 0.77353624]
 [0.76821327 0.86323211 0.21154912 ... 0.17984746 0.16831906 0.32361618]
 [0.45822773 0.00705465 0.20741367 ... 0.20689923 0.13580686 0.56027178]
 ...
 [0.06330716 0.08635212 0.73143845 ... 0.3129106  0.29609666 0.40056424]
 [0.66156429 0.49835694 0.97768569 ... 0.52090851 0.06955654 0.31915977]
 [0.5397931  0.20921538 0.19101099 ... 0.42257942 0.39824093 0.94691191]]
Updated W2: [[0.83090592 0.65196339 0.52216267 ... 0.32916463 0.58848774 0.57794898]
 [0.21378237 0.87540133 0.0203004  ... 0.9942027  0.33454198 0.71350138]
 [0.09581668 0.00745751 0.81211325 ... 0.12431117 0.38603117 0.17427445]
 ...
 [0.87991565 0.37731752 0.74163782 ... 0.08974289 0.02208898 0.41896346]
 [0.91831192 0.83277559 0.83246297 ... 0.12639498 0.77119136 0.10360173]
 [0.78077818 0.00524953 0.76746312 ... 0.49719124 0.66159015 0.73420595]]


## Training the CBOW Model

We will train the CBOW model on the example text corpus for multiple epochs.


In [36]:
# Training the CBOW Model
for epoch in range(1000):
    total_loss = 0
    for context_words, target_word in training_data:
        predictions, hidden_layer = forward_pass(context_words, W1, W2, word_to_index)
        loss = calculate_loss(predictions, target_word, word_to_index)
        total_loss += loss
        W1, W2 = backpropagate(W1, W2, hidden_layer, context_words, predictions, target_word, word_to_index)
    if epoch % 100 == 0:
        print(f"Epoch {epoch+100}, Loss: {total_loss:.4f}")

Epoch 100, Loss: 1876.6766


Epoch 200, Loss: 153.2660
Epoch 300, Loss: 26.4190
Epoch 400, Loss: 12.0695
Epoch 500, Loss: 7.4540
Epoch 600, Loss: 5.2811
Epoch 700, Loss: 4.0427
Epoch 800, Loss: 3.2513
Epoch 900, Loss: 2.7058
Epoch 1000, Loss: 2.3088


In [37]:
# Training the CBOW Model
for epoch in range(10000):
    total_loss = 0
    for context_words, target_word in training_data:
        predictions, hidden_layer = forward_pass(context_words, W1, W2, word_to_index)
        loss = calculate_loss(predictions, target_word, word_to_index)
        total_loss += loss
        W1, W2 = backpropagate(W1, W2, hidden_layer, context_words, predictions, target_word, word_to_index)
    if epoch % 1000 == 0:
        print(f"Epoch {epoch+1000}, Loss: {total_loss:.4f}")

Epoch 1000, Loss: 2.0080
Epoch 2000, Loss: 0.8353
Epoch 3000, Loss: 0.5118
Epoch 4000, Loss: 0.3640
Epoch 5000, Loss: 0.2803
Epoch 6000, Loss: 0.2268
Epoch 7000, Loss: 0.1899
Epoch 8000, Loss: 0.1628
Epoch 9000, Loss: 0.1423
Epoch 10000, Loss: 0.1262
