<a href="https://colab.research.google.com/github/abrarz2511/MLP_exp01/blob/main/Assignment01_Abrar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mathematical Formulation for IMDB Text Classification using an MLP
# Instructor: Dr. Ankur Mali
# University of South Florida (Spring 2025)

This document describes the mathematical framework for processing IMDB text data using a character-level bag-of-characters representation, passing it through a multi-layer perceptron (MLP), and training the model via gradient descent. The evaluation metrics include loss, accuracy, precision, and recall.

---
The findings and Recording of the experiment are recorded in the notebook after the programs for the given models.

Name: Abrar Zahin
## 1. Tokenization and Input Representation

Given a raw text review \( T \), we first tokenize it at the character level. Let \( V \) be the vocabulary (i.e., the set of unique characters) extracted from the training data with size \( |V| = d \).

For each text review \( T \), we construct a binary bag-of-characters vector \( x \in \{0,1\}^d \) such that:

$$
x_j =
\begin{cases}
1, & \text{if the } j\text{-th character in } V \text{ appears in } T, \\
0, & \text{otherwise.}
\end{cases}
$$

Thus, each review is represented as:

$$
x = \mathrm{BOW}(T) \in \mathbb{R}^d.
$$

---

## 2. MLP Model

The MLP we consider has the following structure:
- **Input layer:** Receives $$( x \in \mathbb{R}^d )$$.
- **Hidden Layer 1:** With $h_1$ or $z_1$ (Post-activation) neurons.
- **Hidden Layer 2:** With \( h_2 \) neurons.
- **Output Layer:** With \( c \) neurons (for \( c = 2 \) classes in binary classification).

### 2.1. Model Parameters

- **First Hidden Layer:**
  - Weight matrix: $$(W^{(1)} \in \mathbb{R}^{d \times h_1} )$$
  - Bias vector: $$( b^{(1)} \in \mathbb{R}^{h_1} )$$

- **Second Hidden Layer:**
  - Weight matrix: $$( W^{(2)} \in \mathbb{R}^{h_1 \times h_2} )$$
  - Bias vector: $$( b^{(2)} \in \mathbb{R}^{h_2} )$$

- **Output Layer:**
  - Weight matrix: $$( W^{(3)} \in \mathbb{R}^{h_2 \times c} )$$
  - Bias vector: $$( b^{(3)} \in \mathbb{R}^{c} )$$

> **Note:** In the original code, a third hidden layer size (\( h_3 \)) is provided as a parameter but is not used in the forward computation. Here, the model uses two hidden layers. You can add any N layers, to this pipeline, remember to modify the pipeline accordingly.

### 2.2. Forward Pass

For an input vector \( x \), the forward propagation through the network is as follows:

1. **First Hidden Layer:**

   $$
   h^{(1)} = \text{ReLU}\Big( x\, W^{(1)} + b^{(1)} \Big)
   $$

2. **Second Hidden Layer:**

   $$
   h^{(2)} = \text{ReLU}\Big( h^{(1)}\, W^{(2)} + b^{(2)} \Big)
   $$

3. **Output Layer (Logits):**

   $$
   z = h^{(2)}\, W^{(3)} + b^{(3)}
   $$

The logits \( z \) are then converted to class probabilities using the softmax function:

$$
\hat{y} = \text{softmax}(z) = \frac{\exp(z)}{\sum_{j=1}^{c} \exp(z_j)}
$$

---

## 3. Loss Function

We use the **Categorical Cross Entropy Loss** (with logits) for training. For a single sample with true one-hot label \( y \) and predicted probabilities \( \hat{y} \), the loss is:

$$
L(y, \hat{y}) = -\sum_{j=1}^{c} y_j \log(\hat{y}_j)
$$

For a batch of \( N \) samples, the average loss is computed as:

$$
L = \frac{1}{N} \sum_{i=1}^{N} L(y^{(i)}, \hat{y}^{(i)})
$$

---

## 4. Training via Gradient Descent

The goal is to minimize the loss \( \mathcal{L} \) with respect to the model parameters:

$$
\Theta = \{ W^{(1)},\, b^{(1)},\, W^{(2)},\, b^{(2)},\, W^{(3)},\, b^{(3)} \}
$$

Using gradient descent (or an adaptive method like Adam), each parameter \( \theta \in \Theta \) is updated as:

$$
\theta \leftarrow \theta - \eta\, \nabla_\theta L
$$

where:
- $\eta$ is the learning rate.
- $\nabla_\theta L $ denotes the gradient of the loss with respect to $\theta $.

Backpropagation is used to compute these gradients efficiently.

---

## 5. Evaluation Metrics

In addition to monitoring the loss during training, we evaluate the model performance using:

- **Accuracy:**

  $$
  \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
  $$

- **Precision:**

  $$
  \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
  $$

- **Recall:**

  $$
  \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
  $$

These metrics are computed on the validation and test sets to assess the model’s generalization performance.

---

## 6. Summary of the Pipeline

1. **Tokenization:**  
   Each review \( T \) is tokenized at the character level and converted into a binary vector $$x \in \{0,1\}^d$$ representing the presence of each character in the vocabulary \( V \).

2. **MLP Forward Propagation:**  
   The input vector \( x \) is propagated through the MLP:
   - First hidden layer: $$ h^{(1)} = \text{ReLU}\big( x\, W^{(1)} + b^{(1)} \big) $$
   - Second hidden layer: $$ h^{(2)} = \text{ReLU}\big( h^{(1)}\, W^{(2)} + b^{(2)} \big) $$
   - Output layer: $$ z = h^{(2)}\, W^{(3)} + b^{(3)} $$
   - Softmax conversion: $$ \hat{y} = \text{softmax}(z) $$

3. **Loss Computation:**  
   The categorical cross entropy loss L is computed using the true labels and the predicted probabilities.

4. **Training:**  
   The model parameters $\Theta$ are updated using gradient descent (or Adam), where:

   $$
   \theta \leftarrow \theta - \eta\, \nabla_\theta L
   $$

5. **Evaluation:**  
   After training, the model is evaluated on the validation and test sets using the loss, accuracy, precision, and recall metrics.

---

This formulation captures the entire process—from transforming raw text into a numeric representation, through the forward and backward passes of an MLP, to the training and evaluation of the system. Shorter version of your slides :)


## MLP on IMDB Dataset

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score

tf.random.set_seed(1221)
np.random.seed(1221)
# -------------------------------
# Original MLP Class Definition
# -------------------------------
class MLP(object):
    def __init__(self, size_input, size_hidden1, size_hidden2, size_hidden3, size_output, device=None):
        """
        size_input: int, size of input layer
        size_hidden1: int, size of the 1st hidden layer
        size_hidden2: int, size of the 2nd hidden layer
        size_hidden3: int, size of the 3rd hidden layer (not used in compute_output here)
        size_output: int, size of output layer
        device: str or None, either 'cpu' or 'gpu' or None.
        """
        self.size_input = size_input
        self.size_hidden1 = size_hidden1
        self.size_hidden2 = size_hidden2
        self.size_hidden3 = size_hidden3  # (Currently not used in the forward pass)
        self.size_output = size_output
        self.device = device

        # Initialize weights and biases for first hidden layer
        self.W1 = tf.Variable(tf.random.normal([self.size_input, self.size_hidden1], stddev=0.1))
        self.b1 = tf.Variable(tf.zeros([1, self.size_hidden1]))

        # Initialize weights and biases for second hidden layer
        self.W2 = tf.Variable(tf.random.normal([self.size_hidden1, self.size_hidden2], stddev=0.1))
        self.b2 = tf.Variable(tf.zeros([1, self.size_hidden2]))

        # Initialize weights and biases for output layer
        self.W3 = tf.Variable(tf.random.normal([self.size_hidden2, self.size_output], stddev=0.1))
        self.b3 = tf.Variable(tf.zeros([1, self.size_output]))

        # List of variables to update during backpropagation
        self.variables = [self.W1, self.W2, self.W3, self.b1, self.b2, self.b3]

    def forward(self, X):
        """
        Forward pass.
        X: Tensor, inputs.
        """
        if self.device is not None:
            with tf.device('gpu:0' if self.device == 'gpu' else 'cpu'):
                self.y = self.compute_output(X)
        else:
            self.y = self.compute_output(X)
        return self.y

    def loss(self, y_pred, y_true):
        """
        Computes the loss between predicted and true outputs.
        y_pred: Tensor of shape (batch_size, size_output)
        y_true: Tensor of shape (batch_size, size_output)
        """
        y_true_tf = tf.cast(y_true, dtype=tf.float32)
        y_pred_tf = tf.cast(y_pred, dtype=tf.float32)
        cce = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
        loss_x = cce(y_true_tf, y_pred_tf)
        return loss_x

    def backward(self, X_train, y_train):
        """
        Backward pass: compute gradients of the loss with respect to the variables.
        """
        with tf.GradientTape() as tape:
            predicted = self.forward(X_train)
            current_loss = self.loss(predicted, y_train)
        grads = tape.gradient(current_loss, self.variables)
        return grads

    def compute_output(self, X):
        """
        Custom method to compute the output tensor during the forward pass.
        """
        # Cast X to float32
        X_tf = tf.cast(X, dtype=tf.float32)
        # First hidden layer
        h1 = tf.matmul(X_tf, self.W1) + self.b1
        z1 = tf.nn.relu(h1)
        # Second hidden layer
        h2 = tf.matmul(z1, self.W2) + self.b2
        z2 = tf.nn.relu(h2)
        # Output layer (logits)
        output = tf.matmul(z2, self.W3) + self.b3
        return output

# -------------------------------
# Character-Level Tokenizer and Preprocessing Functions
# -------------------------------
def char_level_tokenizer(texts, num_words=1000):
    """
    Create and fit a character-level tokenizer.

    Args:
        texts (list of str): List of texts.
        num_words (int or None): Maximum number of tokens to keep.

    Returns:
        tokenizer: A fitted Tokenizer instance.
    """
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=num_words, char_level=False, lower=True)
    tokenizer.fit_on_texts(texts)
    return tokenizer

def texts_to_bow(tokenizer, texts):
    """
    Convert texts to a bag-of-characters representation.

    Args:
        tokenizer: A fitted character-level Tokenizer.
        texts (list of str): List of texts.

    Returns:
        Numpy array representing the binary bag-of-characters for each text.
    """
    # texts_to_matrix with mode 'binary' produces a fixed-length binary vector per text.
    matrix = tokenizer.texts_to_matrix(texts, mode='binary')
    return matrix

def one_hot_encode(labels, num_classes=2):
    """
    Convert numeric labels to one-hot encoded vectors.
    """
    return np.eye(num_classes)[labels]

# -------------------------------
# Load and Prepare the IMDB Dataset
# -------------------------------
print("Loading IMDB dataset...")
# Load the IMDB reviews dataset with the 'as_supervised' flag so that we get (text, label) pairs.
(ds_train, ds_test), ds_info = tfds.load('imdb_reviews',
                                           split=['train', 'test'],
                                           as_supervised=True,
                                           with_info=True)

# Convert training dataset to lists.
train_texts = []
train_labels = []
for text, label in tfds.as_numpy(ds_train):
    # Decode byte strings to utf-8 strings.
    train_texts.append(text.decode('utf-8'))
    train_labels.append(label)
train_labels = np.array(train_labels)

# Create a validation set from the training data (20% for validation).
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42)

# Convert test dataset to lists.
test_texts = []
test_labels = []
for text, label in tfds.as_numpy(ds_test):
    test_texts.append(text.decode('utf-8'))
    test_labels.append(label)
test_labels = np.array(test_labels)

print(f"Train samples: {len(train_texts)}, Validation samples: {len(val_texts)}, Test samples: {len(test_texts)}")

# -------------------------------
# Preprocessing: Tokenization and Vectorization
# -------------------------------
# Build the character-level tokenizer on the training texts.
tokenizer = char_level_tokenizer(train_texts)
print("Tokenizer vocabulary size:", len(tokenizer.word_index) + 1)

# Convert texts to bag-of-characters representation.
X_train = texts_to_bow(tokenizer, train_texts)
X_val   = texts_to_bow(tokenizer, val_texts)
X_test  = texts_to_bow(tokenizer, test_texts)

# Convert labels to one-hot encoding.
y_train = one_hot_encode(train_labels)
y_val   = one_hot_encode(val_labels)
y_test  = one_hot_encode(test_labels)

# -------------------------------
# Model Setup
# -------------------------------
# The input size is determined by the dimension of the bag-of-characters vector.
size_input = X_train.shape[1]
# Set hidden layer sizes as desired.
size_hidden1 = 256
size_hidden2 = 128
size_hidden3 = 32  # Placeholder (not used in the forward pass)
size_output  = 2

# Instantiate the MLP model.
model = MLP(size_input, size_hidden1, size_hidden2, size_hidden3, size_output, device=None)

# Define the optimizer.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)

# -------------------------------
# Training Parameters and Loop
# -------------------------------
batch_size = 128
epochs = 10
num_batches = int(np.ceil(X_train.shape[0] / batch_size))

print("\nStarting training...\n")
for epoch in range(epochs):
    # Shuffle training data at the start of each epoch.
    indices = np.arange(X_train.shape[0])
    np.random.shuffle(indices)
    X_train = X_train[indices]
    y_train = y_train[indices]

    epoch_loss = 0
    for i in range(num_batches):
        start = i * batch_size
        end = min((i+1) * batch_size, X_train.shape[0])
        X_batch = X_train[start:end]
        y_batch = y_train[start:end]

        # Compute gradients and update weights.
        # with tf.GradientTape() as tape:
        #     predictions = model.forward(X_batch)
        #     loss_value = model.loss(predictions, y_batch)
        # grads = tape.gradient(loss_value, model.variables)
        predictions = model.forward(X_batch)
        loss_value = model.loss(predictions, y_batch)
        grads = model.backward(X_batch, y_batch)
        optimizer.apply_gradients(zip(grads, model.variables))
        epoch_loss += loss_value.numpy() * (end - start)

    epoch_loss /= X_train.shape[0]

    # Evaluate on validation set.
    val_logits = model.forward(X_val)
    val_loss = model.loss(val_logits, y_val).numpy()
    val_preds = np.argmax(val_logits.numpy(), axis=1)
    true_val = np.argmax(y_val, axis=1)
    accuracy = np.mean(val_preds == true_val)
    precision = precision_score(true_val, val_preds)
    recall = recall_score(true_val, val_preds)

    print(f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.4f} | Val Loss: {val_loss:.4f} | "
          f"Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f}")

# -------------------------------
# Final Evaluation on Test Set
# -------------------------------
print("\nEvaluating on test set...")
test_logits = model.forward(X_test)
test_loss = model.loss(test_logits, y_test).numpy()
test_preds = np.argmax(test_logits.numpy(), axis=1)
true_test = np.argmax(y_test, axis=1)
test_accuracy = np.mean(test_preds == true_test)
test_precision = precision_score(true_test, test_preds)
test_recall = recall_score(true_test, test_preds)

print(f"Test Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f} | "
      f"Test Precision: {test_precision:.4f} | Test Recall: {test_recall:.4f}")


Loading IMDB dataset...
Train samples: 20000, Validation samples: 5000, Test samples: 25000
Tokenizer vocabulary size: 80169

Starting training...

Epoch 01 | Training Loss: 0.4650 | Val Loss: 0.3785 | Accuracy: 0.8362 | Precision: 0.8482 | Recall: 0.8065
Epoch 02 | Training Loss: 0.3107 | Val Loss: 0.3574 | Accuracy: 0.8458 | Precision: 0.8408 | Recall: 0.8412
Epoch 03 | Training Loss: 0.2617 | Val Loss: 0.3603 | Accuracy: 0.8472 | Precision: 0.8309 | Recall: 0.8597
Epoch 04 | Training Loss: 0.2135 | Val Loss: 0.3768 | Accuracy: 0.8432 | Precision: 0.8522 | Recall: 0.8185
Epoch 05 | Training Loss: 0.1606 | Val Loss: 0.4060 | Accuracy: 0.8376 | Precision: 0.8209 | Recall: 0.8507
Epoch 06 | Training Loss: 0.1043 | Val Loss: 0.4599 | Accuracy: 0.8340 | Precision: 0.8111 | Recall: 0.8573
Epoch 07 | Training Loss: 0.0585 | Val Loss: 0.5233 | Accuracy: 0.8308 | Precision: 0.8244 | Recall: 0.8271
Epoch 08 | Training Loss: 0.0277 | Val Loss: 0.6087 | Accuracy: 0.8300 | Precision: 0.8106 | Rec

## Random MLP on IMDB Dataset

In [7]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
tf.random.set_seed(1221)
np.random.seed(1221)
# -------------------------------
# Original MLP Class Definition
# -------------------------------
class MLP_rnd(object):
    def __init__(self, size_input, size_hidden1, size_hidden2, size_hidden3, size_output, device=None):
        """
        size_input: int, size of input layer
        size_hidden1: int, size of the 1st hidden layer
        size_hidden2: int, size of the 2nd hidden layer
        size_hidden3: int, size of the 3rd hidden layer (not used in compute_output here)
        size_output: int, size of output layer
        device: str or None, either 'cpu' or 'gpu' or None.
        """
        self.size_input = size_input
        self.size_hidden1 = size_hidden1
        self.size_hidden2 = size_hidden2
        self.size_hidden3 = size_hidden3  # (Currently not used in the forward pass)
        self.size_output = size_output
        self.device = device

        # Initialize weights and biases for first hidden layer
        self.W1 = tf.Variable(tf.random.normal([self.size_input, self.size_hidden1], stddev=0.1))
        self.b1 = tf.Variable(tf.zeros([1, self.size_hidden1]))

        # Initialize weights and biases for second hidden layer
        self.W2 = tf.Variable(tf.random.normal([self.size_hidden1, self.size_hidden2], stddev=0.1))
        self.b2 = tf.Variable(tf.zeros([1, self.size_hidden2]))

        # Initialize weights and biases for output layer
        self.W3 = tf.Variable(tf.random.normal([self.size_hidden2, self.size_output], stddev=0.1))
        self.b3 = tf.Variable(tf.zeros([1, self.size_output]))

        # List of variables to update during backpropagation
        #self.variables = [self.W1, self.W2, self.W3, self.b1, self.b2, self.b3]
        self.variables = [self.W3, self.b3]

    def forward(self, X):
        """
        Forward pass.
        X: Tensor, inputs.
        """
        if self.device is not None:
            with tf.device('gpu:0' if self.device == 'gpu' else 'cpu'):
                self.y = self.compute_output(X)
        else:
            self.y = self.compute_output(X)
        return self.y

    def loss(self, y_pred, y_true):
        """
        Computes the loss between predicted and true outputs.
        y_pred: Tensor of shape (batch_size, size_output)
        y_true: Tensor of shape (batch_size, size_output)
        """
        y_true_tf = tf.cast(y_true, dtype=tf.float32)
        y_pred_tf = tf.cast(y_pred, dtype=tf.float32)
        cce = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
        loss_x = cce(y_true_tf, y_pred_tf)
        return loss_x

    def backward(self, X_train, y_train):
        """
        Backward pass: compute gradients of the loss with respect to the variables.
        """
        with tf.GradientTape() as tape:
            predicted = self.forward(X_train)
            current_loss = self.loss(predicted, y_train)
        grads = tape.gradient(current_loss, self.variables)
        return grads

    def compute_output(self, X):
        """
        Custom method to compute the output tensor during the forward pass.
        """
        # Cast X to float32
        X_tf = tf.cast(X, dtype=tf.float32)
        # First hidden layer
        h1 = tf.matmul(X_tf, self.W1) + self.b1
        z1 = tf.nn.relu(h1)
        # Second hidden layer
        h2 = tf.matmul(z1, self.W2) + self.b2
        z2 = tf.nn.relu(h2)
        # Output layer (logits)
        output = tf.matmul(z2, self.W3) + self.b3
        return output

# -------------------------------
# Character-Level Tokenizer and Preprocessing Functions
# -------------------------------
def char_level_tokenizer(texts, num_words=1000):
    """
    Create and fit a character-level tokenizer.

    Args:
        texts (list of str): List of texts.
        num_words (int or None): Maximum number of tokens to keep.

    Returns:
        tokenizer: A fitted Tokenizer instance.
    """
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=num_words, char_level=False, lower=True)
    tokenizer.fit_on_texts(texts)
    return tokenizer

def texts_to_bow(tokenizer, texts):
    """
    Convert texts to a bag-of-characters representation.

    Args:
        tokenizer: A fitted character-level Tokenizer.
        texts (list of str): List of texts.

    Returns:
        Numpy array representing the binary bag-of-characters for each text.
    """
    # texts_to_matrix with mode 'binary' produces a fixed-length binary vector per text.
    matrix = tokenizer.texts_to_matrix(texts, mode='binary')
    return matrix

def one_hot_encode(labels, num_classes=2):
    """
    Convert numeric labels to one-hot encoded vectors.
    """
    return np.eye(num_classes)[labels]

# -------------------------------
# Load and Prepare the IMDB Dataset
# -------------------------------
print("Loading IMDB dataset...")
# Load the IMDB reviews dataset with the 'as_supervised' flag so that we get (text, label) pairs.
(ds_train, ds_test), ds_info = tfds.load('imdb_reviews',
                                           split=['train', 'test'],
                                           as_supervised=True,
                                           with_info=True)

# Convert training dataset to lists.
train_texts = []
train_labels = []
for text, label in tfds.as_numpy(ds_train):
    # Decode byte strings to utf-8 strings.
    train_texts.append(text.decode('utf-8'))
    train_labels.append(label)
train_labels = np.array(train_labels)

# Create a validation set from the training data (20% for validation).
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42)

# Convert test dataset to lists.
test_texts = []
test_labels = []
for text, label in tfds.as_numpy(ds_test):
    test_texts.append(text.decode('utf-8'))
    test_labels.append(label)
test_labels = np.array(test_labels)

print(f"Train samples: {len(train_texts)}, Validation samples: {len(val_texts)}, Test samples: {len(test_texts)}")

# -------------------------------
# Preprocessing: Tokenization and Vectorization
# -------------------------------
# Build the character-level tokenizer on the training texts.
tokenizer = char_level_tokenizer(train_texts)
print("Tokenizer vocabulary size:", len(tokenizer.word_index) + 1)

# Convert texts to bag-of-characters representation.
X_train = texts_to_bow(tokenizer, train_texts)
X_val   = texts_to_bow(tokenizer, val_texts)
X_test  = texts_to_bow(tokenizer, test_texts)

# Convert labels to one-hot encoding.
y_train = one_hot_encode(train_labels)
y_val   = one_hot_encode(val_labels)
y_test  = one_hot_encode(test_labels)

# -------------------------------
# Model Setup
# -------------------------------
# The input size is determined by the dimension of the bag-of-characters vector.
size_input = X_train.shape[1]
# Set hidden layer sizes as desired.
size_hidden1 = 256
size_hidden2 = 128
size_hidden3 = 32  # Placeholder (not used in the forward pass)
size_output  = 2

# Instantiate the MLP model.
model = MLP_rnd(size_input, size_hidden1, size_hidden2, size_hidden3, size_output, device=None)

# Define the optimizer.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)

# -------------------------------
# Training Parameters and Loop
# -------------------------------
batch_size = 128
epochs = 10
num_batches = int(np.ceil(X_train.shape[0] / batch_size))

print("\nStarting training...\n")
for epoch in range(epochs):
    # Shuffle training data at the start of each epoch.
    indices = np.arange(X_train.shape[0])
    np.random.shuffle(indices)
    X_train = X_train[indices]
    y_train = y_train[indices]

    epoch_loss = 0
    for i in range(num_batches):
        start = i * batch_size
        end = min((i+1) * batch_size, X_train.shape[0])
        X_batch = X_train[start:end]
        y_batch = y_train[start:end]

        # Compute gradients and update weights.
        # with tf.GradientTape() as tape:
        #     predictions = model.forward(X_batch)
        #     loss_value = model.loss(predictions, y_batch)
        # grads = tape.gradient(loss_value, model.variables)
        predictions = model.forward(X_batch)
        loss_value = model.loss(predictions, y_batch)
        grads = model.backward(X_batch, y_batch)
        optimizer.apply_gradients(zip(grads, model.variables))
        epoch_loss += loss_value.numpy() * (end - start)

    epoch_loss /= X_train.shape[0]

    # Evaluate on validation set.
    val_logits = model.forward(X_val)
    val_loss = model.loss(val_logits, y_val).numpy()
    val_preds = np.argmax(val_logits.numpy(), axis=1)
    true_val = np.argmax(y_val, axis=1)
    accuracy = np.mean(val_preds == true_val)
    precision = precision_score(true_val, val_preds)
    recall = recall_score(true_val, val_preds)

    print(f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.4f} | Val Loss: {val_loss:.4f} | "
          f"Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f}")

# -------------------------------
# Final Evaluation on Test Set
# -------------------------------
print("\nEvaluating on test set...")
test_logits = model.forward(X_test)
test_loss = model.loss(test_logits, y_test).numpy()
test_preds = np.argmax(test_logits.numpy(), axis=1)
true_test = np.argmax(y_test, axis=1)
test_accuracy = np.mean(test_preds == true_test)
test_precision = precision_score(true_test, test_preds)
test_recall = recall_score(true_test, test_preds)

print(f"Test Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f} | "
      f"Test Precision: {test_precision:.4f} | Test Recall: {test_recall:.4f}")


Loading IMDB dataset...
Train samples: 20000, Validation samples: 5000, Test samples: 25000
Tokenizer vocabulary size: 80169

Starting training...

Epoch 01 | Training Loss: 0.7183 | Val Loss: 0.6921 | Accuracy: 0.5556 | Precision: 0.5432 | Recall: 0.5243
Epoch 02 | Training Loss: 0.6803 | Val Loss: 0.6694 | Accuracy: 0.5898 | Precision: 0.5791 | Recall: 0.5635
Epoch 03 | Training Loss: 0.6599 | Val Loss: 0.6573 | Accuracy: 0.6062 | Precision: 0.5964 | Recall: 0.5804
Epoch 04 | Training Loss: 0.6481 | Val Loss: 0.6507 | Accuracy: 0.6154 | Precision: 0.5993 | Recall: 0.6238
Epoch 05 | Training Loss: 0.6414 | Val Loss: 0.6466 | Accuracy: 0.6206 | Precision: 0.6015 | Recall: 0.6440
Epoch 06 | Training Loss: 0.6366 | Val Loss: 0.6455 | Accuracy: 0.6234 | Precision: 0.5971 | Recall: 0.6861
Epoch 07 | Training Loss: 0.6333 | Val Loss: 0.6430 | Accuracy: 0.6276 | Precision: 0.6032 | Recall: 0.6778
Epoch 08 | Training Loss: 0.6312 | Val Loss: 0.6406 | Accuracy: 0.6318 | Precision: 0.6157 | Rec

## MLP with feedback alignment on IMDB Dataset

In [8]:
class MLP_FA(object):
    def __init__(self, size_input, size_hidden1, size_hidden2, size_hidden3, size_output, device=None):
        """
        size_input: int, size of input layer
        size_hidden1: int, size of the 1st hidden layer
        size_hidden2: int, size of the 2nd hidden layer
        size_hidden3: int, size of the 3rd hidden layer (Note: Not used in compute_output in this example)
        size_output: int, size of output layer
        device: str or None, either 'cpu' or 'gpu' or None.
        """
        self.size_input = size_input
        self.size_hidden1 = size_hidden1
        self.size_hidden2 = size_hidden2
        self.size_hidden3 = size_hidden3  # (Currently not used)
        self.size_output = size_output
        self.device = device

        # Initialize weights and biases for first hidden layer
        self.W1 = tf.Variable(tf.random.normal([self.size_input, self.size_hidden1], stddev=0.1))
        self.b1 = tf.Variable(tf.zeros([1, self.size_hidden1]))

        # Initialize weights and biases for second hidden layer
        self.W2 = tf.Variable(tf.random.normal([self.size_hidden1, self.size_hidden2], stddev=0.1))
        self.b2 = tf.Variable(tf.zeros([1, self.size_hidden2]))

        # Initialize weights and biases for output layer
        self.W3 = tf.Variable(tf.random.normal([self.size_hidden2, self.size_output], stddev=0.1))
        self.b3 = tf.Variable(tf.zeros([1, self.size_output]))

        # Create fixed random feedback matrices for feedback alignment:
        # B3: used to propagate the error from the output layer to the second hidden layer.
        # It replaces the use of W3^T. Its shape is (size_output, size_hidden2).
        self.B3 = tf.Variable(tf.random.normal([self.size_output, self.size_hidden2]), trainable=False)

        # B2: used to propagate the error from the second hidden layer to the first hidden layer.
        # Its shape is (size_hidden2, size_hidden1).
        self.B2 = tf.Variable(tf.random.normal([self.size_hidden2, self.size_hidden1]), trainable=False)

        # Define variables to be updated during training
        self.variables = [self.W1, self.W2, self.W3, self.b1, self.b2, self.b3]

    def forward(self, X):
        """
        Forward pass.
        X: Tensor, inputs.
        """
        if self.device is not None:
            with tf.device('gpu:0' if self.device == 'gpu' else 'cpu'):
                self.y = self.compute_output(X)
        else:
            self.y = self.compute_output(X)
        return self.y

    def loss(self, y_pred, y_true):
        """
        Computes the loss between predicted and true outputs.
        y_pred - Tensor of shape (batch_size, size_output)
        y_true - Tensor of shape (batch_size, size_output)
        """
        y_true_tf = tf.cast(y_true, dtype=tf.float32)
        y_pred_tf = tf.cast(y_pred, dtype=tf.float32)
        cce = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
        loss_x = cce(y_true_tf, y_pred_tf)
        return loss_x

    def backward(self, X_train, y_train):
        """
        Backward pass using feedback alignment.
        Computes gradients manually using fixed random feedback matrices.
        X_train: Input data (numpy array)
        y_train: One-hot encoded labels (numpy array)
        Returns: List of gradients corresponding to [dW1, dW2, dW3, db1, db2, db3]
        """
        # Cast input to float32 tensor
        X_tf = tf.cast(X_train, tf.float32)

        # --- Forward Pass ---
        # First hidden layer
        h1 = tf.matmul(X_tf, self.W1) + self.b1
        a1 = tf.nn.relu(h1)
        # Second hidden layer
        h2 = tf.matmul(a1, self.W2) + self.b2
        a2 = tf.nn.relu(h2)
        # Output layer (logits)
        logits = tf.matmul(a2, self.W3) + self.b3
        # Softmax predictions
        y_pred = tf.nn.softmax(logits)

        # --- Compute Output Error ---
        # For cross-entropy with softmax, the derivative is (y_pred - y_true)
        delta3 = y_pred - tf.cast(y_train, tf.float32)  # shape: (batch, size_output)
        batch_size = tf.cast(tf.shape(X_tf)[0], tf.float32)

        # --- Gradients for Output Layer ---
        dW3 = tf.matmul(tf.transpose(a2), delta3) / batch_size
        db3 = tf.reduce_mean(delta3, axis=0, keepdims=True)

        # --- Feedback Alignment for Second Hidden Layer ---
        # Instead of delta2 = (delta3 dot W3^T) * ReLU'(h2), use a fixed random matrix B3.
        relu_grad_h2 = tf.cast(h2 > 0, tf.float32)
        # delta3 has shape (batch, size_output) and B3 has shape (size_output, size_hidden2)
        delta2 = tf.matmul(delta3, self.B3) * relu_grad_h2  # shape: (batch, size_hidden2)

        dW2 = tf.matmul(tf.transpose(a1), delta2) / batch_size
        db2 = tf.reduce_mean(delta2, axis=0, keepdims=True)

        # --- Feedback Alignment for First Hidden Layer ---
        # Instead of delta1 = (delta2 dot W2^T) * ReLU'(h1), use a fixed random matrix B2.
        relu_grad_h1 = tf.cast(h1 > 0, tf.float32)
        # delta2 has shape (batch, size_hidden2) and B2 has shape (size_hidden2, size_hidden1)
        delta1 = tf.matmul(delta2, self.B2) * relu_grad_h1  # shape: (batch, size_hidden1)

        dW1 = tf.matmul(tf.transpose(X_tf), delta1) / batch_size
        db1 = tf.reduce_mean(delta1, axis=0, keepdims=True)

        return [dW1, dW2, dW3, db1, db2, db3]

    def compute_output(self, X):
        """
        Custom method to obtain output tensor during the forward pass.
        """
        X_tf = tf.cast(X, dtype=tf.float32)
        h1 = tf.matmul(X_tf, self.W1) + self.b1
        z1 = tf.nn.relu(h1)
        h2 = tf.matmul(z1, self.W2) + self.b2
        z2 = tf.nn.relu(h2)
        output = tf.matmul(z2, self.W3) + self.b3
        return output


# -------------------------------
# Character-Level Tokenizer and Preprocessing Functions
# -------------------------------
def char_level_tokenizer(texts, num_words=1000):
    """
    Create and fit a character-level tokenizer.

    Args:
        texts (list of str): List of texts.
        num_words (int or None): Maximum number of tokens to keep.

    Returns:
        tokenizer: A fitted Tokenizer instance.
    """
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=num_words, char_level=False, lower=True)
    tokenizer.fit_on_texts(texts)
    return tokenizer

def texts_to_bow(tokenizer, texts):
    """
    Convert texts to a bag-of-characters representation.

    Args:
        tokenizer: A fitted character-level Tokenizer.
        texts (list of str): List of texts.

    Returns:
        Numpy array representing the binary bag-of-characters for each text.
    """
    # texts_to_matrix with mode 'binary' produces a fixed-length binary vector per text.
    matrix = tokenizer.texts_to_matrix(texts, mode='binary')
    return matrix

def one_hot_encode(labels, num_classes=2):
    """
    Convert numeric labels to one-hot encoded vectors.
    """
    return np.eye(num_classes)[labels]

# -------------------------------
# Load and Prepare the IMDB Dataset
# -------------------------------
print("Loading IMDB dataset...")
# Load the IMDB reviews dataset with the 'as_supervised' flag so that we get (text, label) pairs.
(ds_train, ds_test), ds_info = tfds.load('imdb_reviews',
                                           split=['train', 'test'],
                                           as_supervised=True,
                                           with_info=True)

# Convert training dataset to lists.
train_texts = []
train_labels = []
for text, label in tfds.as_numpy(ds_train):
    # Decode byte strings to utf-8 strings.
    train_texts.append(text.decode('utf-8'))
    train_labels.append(label)
train_labels = np.array(train_labels)

# Create a validation set from the training data (20% for validation).
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42)

# Convert test dataset to lists.
test_texts = []
test_labels = []
for text, label in tfds.as_numpy(ds_test):
    test_texts.append(text.decode('utf-8'))
    test_labels.append(label)
test_labels = np.array(test_labels)

print(f"Train samples: {len(train_texts)}, Validation samples: {len(val_texts)}, Test samples: {len(test_texts)}")

# -------------------------------
# Preprocessing: Tokenization and Vectorization
# -------------------------------
# Build the character-level tokenizer on the training texts.
tokenizer = char_level_tokenizer(train_texts)
print("Tokenizer vocabulary size:", len(tokenizer.word_index) + 1)

# Convert texts to bag-of-characters representation.
X_train = texts_to_bow(tokenizer, train_texts)
X_val   = texts_to_bow(tokenizer, val_texts)
X_test  = texts_to_bow(tokenizer, test_texts)

# Convert labels to one-hot encoding.
y_train = one_hot_encode(train_labels)
y_val   = one_hot_encode(val_labels)
y_test  = one_hot_encode(test_labels)

# -------------------------------
# Model Setup
# -------------------------------
# The input size is determined by the dimension of the bag-of-characters vector.
size_input = X_train.shape[1]
# Set hidden layer sizes as desired.
size_hidden1 = 256
size_hidden2 = 128
size_hidden3 = 32  # Placeholder (not used in the forward pass)
size_output  = 2

# Instantiate the MLP model.
model = MLP_FA(size_input, size_hidden1, size_hidden2, size_hidden3, size_output, device=None)

# Define the optimizer.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)

# -------------------------------
# Training Parameters and Loop
# -------------------------------
batch_size = 128
epochs = 10
num_batches = int(np.ceil(X_train.shape[0] / batch_size))

print("\nStarting training...\n")
for epoch in range(epochs):
    # Shuffle training data at the start of each epoch.
    indices = np.arange(X_train.shape[0])
    np.random.shuffle(indices)
    X_train = X_train[indices]
    y_train = y_train[indices]

    epoch_loss = 0
    for i in range(num_batches):
        start = i * batch_size
        end = min((i+1) * batch_size, X_train.shape[0])
        X_batch = X_train[start:end]
        y_batch = y_train[start:end]

        # Compute gradients and update weights.
        # with tf.GradientTape() as tape:
        #     predictions = model.forward(X_batch)
        #     loss_value = model.loss(predictions, y_batch)
        # grads = tape.gradient(loss_value, model.variables)
        predictions = model.forward(X_batch)
        loss_value = model.loss(predictions, y_batch)
        grads = model.backward(X_batch, y_batch)
        optimizer.apply_gradients(zip(grads, model.variables))
        epoch_loss += loss_value.numpy() * (end - start)

    epoch_loss /= X_train.shape[0]

    # Evaluate on validation set.
    val_logits = model.forward(X_val)
    val_loss = model.loss(val_logits, y_val).numpy()
    val_preds = np.argmax(val_logits.numpy(), axis=1)
    true_val = np.argmax(y_val, axis=1)
    accuracy = np.mean(val_preds == true_val)
    precision = precision_score(true_val, val_preds)
    recall = recall_score(true_val, val_preds)

    print(f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.4f} | Val Loss: {val_loss:.4f} | "
          f"Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f}")

# -------------------------------
# Final Evaluation on Test Set
# -------------------------------
print("\nEvaluating on test set...")
test_logits = model.forward(X_test)
test_loss = model.loss(test_logits, y_test).numpy()
test_preds = np.argmax(test_logits.numpy(), axis=1)
true_test = np.argmax(y_test, axis=1)
test_accuracy = np.mean(test_preds == true_test)
test_precision = precision_score(true_test, test_preds)
test_recall = recall_score(true_test, test_preds)

print(f"Test Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f} | "
      f"Test Precision: {test_precision:.4f} | Test Recall: {test_recall:.4f}")

Loading IMDB dataset...
Train samples: 20000, Validation samples: 5000, Test samples: 25000
Tokenizer vocabulary size: 80169

Starting training...

Epoch 01 | Training Loss: 0.6719 | Val Loss: 0.4389 | Accuracy: 0.8062 | Precision: 0.7748 | Recall: 0.8461
Epoch 02 | Training Loss: 0.3739 | Val Loss: 0.3686 | Accuracy: 0.8458 | Precision: 0.8087 | Recall: 0.8932
Epoch 03 | Training Loss: 0.3217 | Val Loss: 0.3526 | Accuracy: 0.8550 | Precision: 0.8509 | Recall: 0.8498
Epoch 04 | Training Loss: 0.3025 | Val Loss: 0.3559 | Accuracy: 0.8518 | Precision: 0.8203 | Recall: 0.8890
Epoch 05 | Training Loss: 0.2823 | Val Loss: 0.3541 | Accuracy: 0.8530 | Precision: 0.8300 | Recall: 0.8762
Epoch 06 | Training Loss: 0.2607 | Val Loss: 0.3703 | Accuracy: 0.8486 | Precision: 0.8200 | Recall: 0.8812
Epoch 07 | Training Loss: 0.2344 | Val Loss: 0.3827 | Accuracy: 0.8430 | Precision: 0.8618 | Recall: 0.8053
Epoch 08 | Training Loss: 0.2043 | Val Loss: 0.4065 | Accuracy: 0.8384 | Precision: 0.8669 | Rec

646# Assignment 2 Todos

## Overview
- **Objective:**  
  Modify your model’s text preprocessing by changing from character-level tokenization to word-level tokenization. Compare the performance of both tokenization methods. Additionally, perform hyper-parameter optimization by experimenting with various settings (learning rate, hidden layers, hidden sizes, batch sizes, optimizers, and activation functions) and report your findings.

## 1. Initial Setup
- [Done] **Set Random Seeds:**  
  Ensure reproducibility by setting seeds for all random number generators (e.g., Python’s `random`, NumPy, TensorFlow/PyTorch).
  
- [Done] **Prepare the Environment:**  
  - Create a new or update an existing Jupyter Notebook.
  - Ensure that all necessary libraries (e.g., NumPy, pandas, TensorFlow/PyTorch, matplotlib, etc.) are installed.
  
- [Done] **Version Control:**  
  Initialize a Git repository (if not already done) and commit your initial setup.

## 2. Data Preprocessing
- [Done] **Load Dataset:**  
  Load your dataset into the notebook.
  
- [Done] **Tokenization:**
  - **Character-Level Tokenization:**  
    - Tokenize the text data at the character level.
    - Save and log the processed data.
  - **Word-Level Tokenization:**  
    - Modify the tokenization process to tokenize the text by words.
    - Save and log the processed data.
    
- [Done] **Comparison:**  
  - Create a section in your notebook to compare the two tokenization approaches.
  - Visualize or tabulate differences in vocabulary size, sequence lengths, and other relevant metrics.

**Findings from tokenization comparison:**
- Bag of Words: Loss 0.9187, Accuracy: 0.8382
- Bag of Character: Loss: 0.6614, Accuracy: 0.6064

## 3. Model Architecture
- [Done] **Define the Model:**  
  Develop a model (or models) that can handle both tokenization types. Include the following adjustable hyper-parameters:
  - Learning rate
  - Number of hidden layers
  - Hidden sizes (neurons per layer)
  - Batch sizes
  - Optimizers (e.g., Adam, SGD, RMSProp)
  - Activation functions (e.g., ReLU, Tanh, LeakyReLU)

## 4. Hyper-Parameter Optimization
- [Done] **Experiment Setup:**  
  For each hyper-parameter configuration, perform at least 3 different tests to ensure robustness.
  
- [Done ] **Grid/Random Search:**  
  Set up a search over the following hyper-parameter ranges (example values provided):
  - **Learning Rate:** `[0.001, 0.0005, 0.0001]`
  - **Hidden Layers:** `[1, 2, 3]`
  - **Hidden Sizes:** `[128, 256, 512]`
  - **Batch Sizes:** `[32, 64, 128]`
  - **Optimizers:** `[Adam, SGD, RMSProp]`
  - **Activation Functions:** `[ReLU, Tanh, LeakyReLU]`
  
- [Done ] **Logging:**  
  Record the results (accuracy, loss, etc.) for each configuration in tables or charts.

**Best Model:**
-	Learning Rate : 0.0005
-	Number of Hidden Layers: 2
-	Hidden Layer 1 size: 256
-	Batch size: 128
-	Optimizer: Adam
-	Activation function: ReLu

Test accuracy: 0.8652

## 5. Model Training and Evaluation
- [ Done] **Training with Each Configuration:**  
  Run experiments for both tokenization approaches with each set of hyper-parameters:
  - Train the model at least 3 times per configuration (keeping the seed constant at this stage).
  - Log training and validation performance.
  
- [Done] **Identify the Best Model:**  
  Select the best performing configuration based on validation metrics (e.g., accuracy).

## 6. Final Experiments
- [Done] **Robustness Check:**  
  Once the best model is identified:
  - Re-run the experiments at least 3 times with different random seeds.
  - Record the performance (accuracy) for each run.
  
- [Done] **Statistical Reporting:**  
  - Compute the **mean accuracy** and **standard error** across these runs.
  - Include these statistics in your report.

## 7. Documentation and Reporting
- [Done ] **Jupyter Notebook:**  
  - Ensure that your notebook is well-commented and clearly documents each step.
  - Include code cells for setting seeds, data preprocessing, model building, training, evaluation, and visualization.
  
- [Done ] **Detailed Report (Word Document):**  
  Prepare a report that includes:
  - **Introduction:** Objectives and overview of the work.
  - **Methodology:** Detailed explanation of tokenization changes and hyper-parameter optimization strategy.
  - **Experiments and Results:**  
    - Comparison between character-level and word-level tokenization.
    - Tables/graphs for hyper-parameter experiments.
    - Final model performance with mean accuracy and standard error.
  - **Discussion:** Analysis of results, challenges encountered, and insights.
  - **Conclusion:** Summarize the key findings.
  
- [ ] **Submission:**  
  - Submit your Jupyter Notebook.
  - Submit your Word document report.
  - Ensure that both files are included in your repository or submission package.

## 8. Final Checklist
- [ ] All experiments have at least 3 different tests.
- [ ] Random seeds are set before any experiment.
- [ ] Hyper-parameter optimization covers changes in learning rate, hidden layers, hidden sizes, batch sizes, optimizers, and activation functions.
- [ ] The best model’s performance is verified with experiments on different seeds.
- [ ] Best model should be compared with random model shown above.
- [ ] The report clearly documents the methodology, experiments, results, and final conclusions.
- [ ] If experiments are shown with deeper MLP_FA with best settings (Extra credits -- 2 points)

---

> **Note:**  
> Keep thorough logs and document any observations during your experiments. Clear documentation is key to reproducibility and understanding your results.

