# Tutorial 3: Intro to Recurrent Neural Networks: Math, Training, and the Copy Task

# Instructor: Dr. Ankur Mali
# University of South Florida (Spring 2025)
# Student: Abrar Zahin

#Objective
The objective of this experiment is to compare the ability of the different model architectures of RNN  to perform a copy task.

A copy task is a simple task that tests a model's ability to remember sequences of data over a time period. RNN's are able to complete this task due to their ability to maintain a memory through their hidden state and Back Propagation through time (BPTT).



The models are:
- Standard LSTM
- Multiplicative LSTM
- Standard GRU
- Multiplicative GRU

These gated architectures are used to address the issues faced by vanilla RNNs. While processing long sequences over time, the gradients of RNNs tend to converge towards zero (for vanishing) or towards infinity (for exploding). Thus, learning long range dependencies is inhibited.

The LSTM models maintain a memory cell and the GRU uses a gated architecuture to allow long-term dependencies. The multiplicative variants of LSTM and GRU use

#Experimental procedure
We begin by implementing the four architectures: Standard LSTM, Multiplicative LSTM, Standard GRU and Multiplicative GRU.

**CopyTaskDataset:**

We begin by generating a random dataset for the copytask. In the class CopyTaskDataset(Dataset) we generated a sequence of random integers using NumPy. The sequence is of a certain length and vocabulary size (types of tokens) and the amount of these sequences (num_samples) is specified as well. In the input sequence, the sequence is generated followed by an equal number of delimiters. In the target label, the delimiters are returned and the original sequence follows.

The input sequence is then added to a vector called data that is converted to a pytorch tensor of shape [num_samples, seq_length, 1]
The target sequence is also added to a vector called labels and converted to a tensor of the same shape.

Since we're using CrossEntropyLoss calculation for the training, we're also performing one hot encoding on the labels. CrossEntropyLoss expects one hot encoded labels.

**LSTMCellPytorch:**

In a function called LSTMPytorch, we define the model by initializing all the weight, bias, inputs and outputs. These parameters are first defined as zeros and transformed into their respective shapes for calculations.

The forward function contains all the equations for a single forward pass of the model cell.

**LSTMPytorch**

The LSTMPytorch function unrolls the model over each full sequence. At first we intialize the hidden and cell state to retain the memory. Using a for loop, we go through all time steps one at a time. Thus, the output of each time step is used to derive the new hidden state.

**train_lstm:**

The train_lstm function trains the model using a optimizer, and training the model in epochs. In each epoch, loss is calculated. The optimizer we use is sigmoid function.

**evaluate_lstm:**

This function tests the model. The sample is divided into chunks and a hidded state is maintained sequentially through the chunks. The accuracy is calculated through accumulating the accuracies throughout the chunks.

#Training and testing:

We train and test the models with the same parameters thorugh different sequence lengths T = {100, 200, 500, 1000}.

**Metrics**:
We calculate the loss for each epoch during training and calculate the mean loss. During testing, we calculate the accuracy.

For all four models, we compare these metrics for four different sequence lengths.




In [None]:
import torch
from torch.utils.data import Dataset, DataLoader  #Dataset and Dataloader used for training data
import time                   #For training time calculation
import numpy as np

#This function generates a dataset of integer symbols, to be used for training the models
class CopyTaskDataset(Dataset):
    def __init__(self, seq_length=100, vocab_size=10, num_samples=1000):
        self.seq_length = seq_length            #1000 pairs of sequences of len 100 and 10 type of tokens
        self.vocab_size = vocab_size
        self.num_samples = num_samples
        self.delimiter_token = vocab_size
        self.input_size = vocab_size + 1             #Number of unique tokens + Delimiter
        self.data, self.labels = self.generate_data()

    def generate_data(self):
        data = []
        labels = []
        for _ in range(self.num_samples):
            sequence = np.random.randint(0, self.vocab_size, size=(self.seq_length))

            input_seq = np.concatenate([sequence, [self.delimiter_token] * self.seq_length])  #Original sequence first and delimiter tokens follow

            target_seq = np.concatenate([[self.delimiter_token] * self.seq_length, sequence])   #Delimiter tokens first and original sequence follows

            data.append(input_seq)
            labels.append(target_seq)

        # We reshape the data to have 3 dimensions and convert them to pytorch tensors (num_samples, seq_length, 1)
        data = torch.tensor(data, dtype=torch.long)
        labels = torch.tensor(labels, dtype=torch.long)  #long indices for CrossEntropy calculation

        data_one_hot = torch.nn.functional.one_hot(data, num_classes=self.input_size).float()      #One hot encoding for training
        #labels_one_hot = torch.nn.functional.one_hot(labels, num_classes=self.input_size).float()
        return data_one_hot, labels

    def __len__(self):
        return self.num_samples       #This function returns the length

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]     #To get the indices


###
###LSTM Implementation
class LSTMCellPytorch(torch.nn.Module):
    """
    A single-step RNN cell in PyTorch.
    h_t = tanh( x_t * W_x + h_{t-1} * W_h + b )

    LSTM:
    i_t = sigmoid(x_t * W_xi + h_{t-1} * U_i + b_i)
    f_t = sigmoid(x_t * W_xf + h_{t-1} * U_f + b_f)
    o_t = sigmoid(x_t * W_xo + h_{t-1} * U_o + b_o)
    c~_t = tanh(x_t * W_xc + h_{t-1} * U_c + b_c)
    c_t = f_t * c_{t-1} + i_t * c~_t
    h_t = o_t * tanh(c_t)

    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        #Initialize the parameters used
        #input and forget gates:
        self.W_xi = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_i = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.W_xf = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_f = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_i = torch.nn.Parameter(torch.zeros(hidden_size))
        self.b_f = torch.nn.Parameter(torch.zeros(hidden_size))

        #output gate:
        self.W_xo = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_o = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_o = torch.nn.Parameter(torch.zeros(hidden_size))

        #Memory cell gate:
        self.W_xc = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_c = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_c = torch.nn.Parameter(torch.zeros(hidden_size))
    def forward(self, x_t, h_prev, c_prev):
        i_t = torch.sigmoid(x_t @ self.W_xi + h_prev @ self.U_i + self.b_i)
        f_t = torch.sigmoid(x_t @ self.W_xf + h_prev @ self.U_f + self.b_f)
        o_t = torch.sigmoid(x_t @ self.W_xo + h_prev @ self.U_o + self.b_o)
        c_t = f_t * c_prev + i_t * torch.tanh(x_t @ self.W_xc + h_prev @ self.U_c + self.b_c)
        h_t = o_t * torch.tanh(c_t)
        return i_t, f_t, o_t, c_t, h_t

###
#LSTM Unrolling over time function
class LSTMPyTorch(torch.nn.Module):
    """
    Unrolls the LSTM over a full sequence.
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn_cell = LSTMCellPytorch(input_size, hidden_size)
        self.W_out = torch.nn.Parameter(torch.randn(hidden_size, input_size) * 0.1)

        self.b_out = torch.nn.Parameter(torch.zeros(input_size))

    def forward(self, X, hidden_state= None):
        batch_size, seq_length, _ = X.shape

        if hidden_state is None:
            h = torch.zeros(batch_size, self.hidden_size, device=X.device)
            c = torch.zeros(batch_size, self.hidden_size, device=X.device)
        else:
            h, c = hidden_state

        outputs = []
        for t in range(seq_length):
            x_t = X[:, t, :]  # Shap [batch_size, input_size]
            i_t, f_t, o_t, c_t, h_t = self.rnn_cell(x_t, h, c)
            out_t = h @ self.W_out + self.b_out
            outputs.append(out_t.unsqueeze(1))  # shape [batch_size,1,input_size]

            h, c = h_t, c_t
        # Concatenate across time:
        return torch.cat(outputs, dim=1), (h, c)  # [batch_size, seq_length, input_size]

###

# The training function- train_LSTM
def train_LSTM(model, train_loader, epochs=10, lr=0.01):
    device = torch.device("cuda")
    model.to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    criterion = torch.nn.CrossEntropyLoss()

    start_time = time.time()

    hidden_state = None        #hidden state maintained during training

    for epoch in range(epochs):
        total_loss = 0
        total_batches = 0

        for X_batch, Y_batch in train_loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)


            optimizer.zero_grad()
            output, hidden_state = model(X_batch, hidden_state)
            hidden_state = (hidden_state[0].detach(), hidden_state[1].detach())   #Detaches from previous hidden state to prevent exploding gradient

            loss = criterion(output.view(-1, input_size), Y_batch.view(-1))
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            total_batches += 1

        epoch_loss = total_loss / total_batches
        print(f"Epoch {epoch+1}/{epochs} | Loss: {epoch_loss:.6f}")

    return time.time() - start_time, loss

#The testing function- evaluate_LSTM:
def evaluate_LSTM(model, test_loader, chunk_size = 100):
    model.eval()
    correct_predictions = 0
    total_predictions = 0
    device = next(model.parameters()).device


    with torch.no_grad():
        for X_batch, Y_batch in test_loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)
            batch_size, seq_length, _ = X_batch.shape
            hidden_state = None      #hiddnt state to be maintained throughout the batches
            num_chunks = seq_length // chunk_size

            chunk_correct = []
            chunk_total = []
            outputs = []
            for i in range(num_chunks):
                start = i * chunk_size
                end = (i + 1) * chunk_size

                X_chunk = X_batch[:, start:end, :]
                Y_chunk = Y_batch[:, start:end]

                output, hidden_state = model(X_chunk, hidden_state)
                hidden_state = (hidden_state[0].detach(), hidden_state[1].detach())
                outputs.append(output)


            outputs = torch.cat(outputs, dim=1)  # Reconstruct full sequence
            predicted_labels = torch.argmax(outputs, dim=-1)

            correct_predictions += (predicted_labels == Y_batch).sum().item()
            total_predictions += predicted_labels.numel()


    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
    return accuracy

###
#For training dataset:
seq_length = 100
vocab_size = 10
num_samples = 1500
batch_size = 64

# Creating dataset and Dataloader for training:
dataset = CopyTaskDataset(seq_length, vocab_size, num_samples)
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last = True)

# Define model
input_size = vocab_size + 1  # Including delimiter token
hidden_size = 128
model = LSTMPyTorch(input_size, hidden_size)

# Train model:
training_time, loss = train_LSTM(model, train_loader, epochs=10, lr=0.005)
print(f"Training completed in {training_time:.2f} seconds")
print(f"Mean Loss: {loss:.4f}")
###

###
#For testing dataset:
seq_length = 1000
vocab_size = 10
num_samples = 1500
batch_size = 64

#Dataset and loader for testing:
test_dataset = CopyTaskDataset(seq_length, vocab_size, num_samples)
test_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last = True)

#Test the model:
test_accuracy = evaluate_LSTM(model, test_loader, chunk_size = 100)
print(f"LSTM Test Accuracy: {test_accuracy:.4f}")


Epoch 1/10 | Loss: 2.308744
Epoch 2/10 | Loss: 2.057651
Epoch 3/10 | Loss: 1.831775
Epoch 4/10 | Loss: 1.643091
Epoch 5/10 | Loss: 1.449918
Epoch 6/10 | Loss: 1.382073
Epoch 7/10 | Loss: 1.360832
Epoch 8/10 | Loss: 1.342116
Epoch 9/10 | Loss: 1.329215
Epoch 10/10 | Loss: 1.319866
Training completed in 69.95 seconds
Mean Loss: 1.3162
LSTM Test Accuracy: 0.5458


##Results

**T = 100**

Epoch 1/10 | Loss: 2.307068                                                           
Epoch 2/10 | Loss: 2.052329 <br>
Epoch 3/10 | Loss: 1.805295 <br>
Epoch 4/10 | Loss: 1.594715<br>
Epoch 5/10 | Loss: 1.412653<br>
Epoch 6/10 | Loss: 1.366210<br>
Epoch 7/10 | Loss: 1.349931<br>
Epoch 8/10 | Loss: 1.335301<br>
Epoch 9/10 | Loss: 1.324294<br>
Epoch 10/10 | Loss: 1.315721<br>
Training completed in 65.02 seconds<br>
Mean Loss: 1.3121<br>
LSTM Test Accuracy: 0.5460<br>

**T=200**

Epoch 1/10 | Loss: 2.300380<br>
Epoch 2/10 | Loss: 2.021753<br>
Epoch 3/10 | Loss: 1.737184<br>
Epoch 4/10 | Loss: 1.494123<br>
Epoch 5/10 | Loss: 1.383652<br>
Epoch 6/10 | Loss: 1.361969<br>
Epoch 7/10 | Loss: 1.344759<br>
Epoch 8/10 | Loss: 1.331481<br>
Epoch 9/10 | Loss: 1.321371<br>
Epoch 10/10 | Loss: 1.313421<br>
Training completed in 72.52 seconds<br>
Mean Loss: 1.3093<br>
LSTM Test Accuracy: 0.5461<br>

**T=500**

Epoch 1/10 | Loss: 2.321768<br>
Epoch 2/10 | Loss: 2.088610<br>
Epoch 3/10 | Loss: 1.870382<br>
Epoch 4/10 | Loss: 1.686539<br>
Epoch 5/10 | Loss: 1.494767<br>
Epoch 6/10 | Loss: 1.382270<br>
Epoch 7/10 | Loss: 1.360358<br>
Epoch 8/10 | Loss: 1.342912<br>
Epoch 9/10 | Loss: 1.329381<br>
Epoch 10/10 | Loss: 1.319294<br>
Training completed in 74.14 seconds<br>
Mean Loss: 1.3152<br>
LSTM Test Accuracy: 0.5456<br>

**T=1000**

        
Epoch 1/10 | Loss: 2.308744 <br>
Epoch 2/10 | Loss: 2.057651 <br>
Epoch 3/10 | Loss: 1.831775<br>
Epoch 4/10 | Loss: 1.643091<br>
Epoch 5/10 | Loss: 1.449918<br>
Epoch 6/10 | Loss: 1.382073<br>
Epoch 7/10 | Loss: 1.360832<br>
Epoch 8/10 | Loss: 1.342116<br>
Epoch 9/10 | Loss: 1.329215<br>
Epoch 10/10 | Loss: 1.319866<br>
Training completed in 69.95 seconds<br>
Mean Loss: 1.3162<br>
LSTM Test Accuracy: 0.5458<br>



---

#Multiplicative LSTM:

Now we train and test a multiplicative LSTM cell with identical hyperparameters.



In [7]:
import torch
from torch.utils.data import Dataset, DataLoader  #Dataset and Dataloader used for training data
import time                   #For training time calculation
import numpy as np

#This function generates a dataset of integer symbols, to be used for training the models
class CopyTaskDataset(Dataset):
    def __init__(self, seq_length=100, vocab_size=10, num_samples=1000):
        self.seq_length = seq_length            #1000 pairs of sequences of len 100 and 10 type of tokens
        self.vocab_size = vocab_size
        self.num_samples = num_samples
        self.delimiter_token = vocab_size
        self.input_size = vocab_size + 1             #Number of unique tokens + Delimiter
        self.data, self.labels = self.generate_data()

    def generate_data(self):
        data = []
        labels = []
        for _ in range(self.num_samples):
            sequence = np.random.randint(0, self.vocab_size, size=(self.seq_length))

            input_seq = np.concatenate([sequence, [self.delimiter_token] * self.seq_length])  #Original sequence first and delimiter tokens follow

            target_seq = np.concatenate([[self.delimiter_token] * self.seq_length, sequence])   #Delimiter tokens first and original sequence follows

            data.append(input_seq)
            labels.append(target_seq)

        # We reshape the data to have 3 dimensions and convert them to pytorch tensors (num_samples, seq_length, 1)
        data = torch.tensor(data, dtype=torch.long)
        labels = torch.tensor(labels, dtype=torch.long)  #long indices for CrossEntropy calculation

        data_one_hot = torch.nn.functional.one_hot(data, num_classes=self.input_size).float()      #One hot encoding for training
        #labels_one_hot = torch.nn.functional.one_hot(labels, num_classes=self.input_size).float()
        return data_one_hot, labels

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]


###
###LSTM Implementation
class LSTMCellPytorch(torch.nn.Module):
    """
    A single-step RNN cell in PyTorch.
    h_t = tanh( x_t * W_x + h_{t-1} * W_h + b )

    Multiplicative LSTM:
    m_t = sigmoid(x_t * W_xm + h_{t-1} * U_m + b_m)
    ~x_t = m_t * x_t
    i_t = sigmoid(~x_t * W_xi + h_{t-1} * U_i + b_i)
    f_t = sigmoid(~x_t * W_xf + h_{t-1} * U_f + b_f)
    o_t = sigmoid(~x_t * W_xo + h_{t-1} * U_o + b_o)
    ~c_t = tanh(x_t * W_xc + h_{t-1} * U_c + b_c)
    c_t = f_t * c_{t-1} + i_t * ~c_t

    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        #Initialize the variables used

        #memory matrix variables:
        self.W_xm = torch.nn.Parameter(torch.randn(input_size, input_size) * 0.1)
        self.U_m = torch.nn.Parameter(torch.randn(hidden_size, input_size) * 0.1)
        self.b_m = torch.nn.Parameter(torch.zeros(input_size))

        #input and forget gates:
        self.W_xi = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_i = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.W_xf = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_f = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_i = torch.nn.Parameter(torch.zeros(hidden_size))
        self.b_f = torch.nn.Parameter(torch.zeros(hidden_size))

        #output gate:
        self.W_xo = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_o = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_o = torch.nn.Parameter(torch.zeros(hidden_size))

        #Memory cell gate:
        self.W_xc = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_c = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_c = torch.nn.Parameter(torch.zeros(hidden_size))

    def forward(self, x_t, h_prev, c_prev):

        m_t = torch.sigmoid(x_t @ self.W_xm + h_prev @ self.U_m + self.b_m)         #memory matrix
        x_t = m_t * x_t


        i_t = torch.sigmoid(x_t @ self.W_xi + h_prev @ self.U_i + self.b_i)
        f_t = torch.sigmoid(x_t @ self.W_xf + h_prev @ self.U_f + self.b_f)
        o_t = torch.sigmoid(x_t @ self.W_xo + h_prev @ self.U_o + self.b_o)
        c_t = f_t * c_prev + i_t * torch.tanh(x_t @ self.W_xc + h_prev @ self.U_c + self.b_c)
        h_t = o_t * torch.tanh(c_t)
        return m_t, i_t, f_t, o_t, c_t, h_t

###
#LSTM Unrolling over time function
class LSTMPyTorch(torch.nn.Module):
    """
    Unrolls the LSTM over a full sequence.
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn_cell = LSTMCellPytorch(input_size, hidden_size)
        # Output projection to match the original input dimension for copy task
        self.W_out = torch.nn.Parameter(torch.randn(hidden_size, input_size) * 0.1)

        self.b_out = torch.nn.Parameter(torch.zeros(input_size))

    def forward(self, X, hidden_state= None):
        # X: [batch_size, seq_length, input_size]
        batch_size, seq_length, _ = X.shape

        if hidden_state is None:
            h = torch.zeros(batch_size, self.hidden_size, device=X.device)
            c = torch.zeros(batch_size, self.hidden_size, device=X.device)
        else:
            h, c = hidden_state

        outputs = []
        for t in range(seq_length):
            x_t = X[:, t, :]  # [batch_size, input_size]
            m_t, i_t, f_t, o_t, c_t, h_t = self.rnn_cell(x_t, h, c)
            # Project hidden -> input_size
            out_t = h @ self.W_out + self.b_out
            outputs.append(out_t.unsqueeze(1))  # shape [batch_size,1,input_size]

            h, c = h_t, c_t
        # Concatenate across time
        return torch.cat(outputs, dim=1), (h, c)  # [batch_size, seq_length, input_size]

###

# The training function- train_LSTM
def train_LSTM(model, train_loader, epochs=10, lr=0.01):
    device = torch.device("cuda")
    model.to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    criterion = torch.nn.CrossEntropyLoss()

    start_time = time.time()

    hidden_state = None        #hidden state maintained during training

    for epoch in range(epochs):
        total_loss = 0
        total_batches = 0

        for X_batch, Y_batch in train_loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)


            optimizer.zero_grad()
            output, hidden_state = model(X_batch, hidden_state)
            hidden_state = (hidden_state[0].detach(), hidden_state[1].detach())   #Detaches from previous hidden state to prevent exploding gradient

            loss = criterion(output.view(-1, input_size), Y_batch.view(-1))
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            total_batches += 1

        epoch_loss = total_loss / total_batches
        print(f"Epoch {epoch+1}/{epochs} | Loss: {epoch_loss:.6f}")

    return time.time() - start_time, loss

#The testing function- evaluate_LSTM:
def evaluate_LSTM(model, test_loader, chunk_size = 100):
    model.eval()
    correct_predictions = 0
    total_predictions = 0
    device = next(model.parameters()).device


    with torch.no_grad():
        for X_batch, Y_batch in test_loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)
            batch_size, seq_length, _ = X_batch.shape
            hidden_state = None      #hiddnt state to be maintained throughout the batches
            num_chunks = seq_length // chunk_size

            chunk_correct = []
            chunk_total = []
            outputs = []
            for i in range(num_chunks):
                start = i * chunk_size
                end = (i + 1) * chunk_size

                X_chunk = X_batch[:, start:end, :]
                Y_chunk = Y_batch[:, start:end]

                output, hidden_state = model(X_chunk, hidden_state)
                hidden_state = (hidden_state[0].detach(), hidden_state[1].detach())
                outputs.append(output)


            outputs = torch.cat(outputs, dim=1)  # Reconstruct full sequence
            predicted_labels = torch.argmax(outputs, dim=-1)

            correct_predictions += (predicted_labels == Y_batch).sum().item()
            total_predictions += predicted_labels.numel()


    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
    return accuracy

###
#For training dataset:
seq_length = 100
vocab_size = 10
num_samples = 1500
batch_size = 64

# Creating dataset and Dataloader for training:
dataset = CopyTaskDataset(seq_length, vocab_size, num_samples)
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last = True)

# Define model
input_size = vocab_size + 1  # Including delimiter token
hidden_size = 128
model = LSTMPyTorch(input_size, hidden_size)

# Train model:
training_time, loss = train_LSTM(model, train_loader, epochs=10, lr=0.005)
print(f"Training completed in {training_time:.2f} seconds")
print(f"Mean Loss: {loss:.4f}")
###

###
#For testing dataset:
seq_length = 1000
vocab_size = 10
num_samples = 1500
batch_size = 64

#Dataset and loader for testing:
test_dataset = CopyTaskDataset(seq_length, vocab_size, num_samples)
test_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last = True)

#Test the model:
test_accuracy = evaluate_LSTM(model, test_loader, chunk_size = 100)
print(f"mLSTM Test Accuracy: {test_accuracy:.4f}")


Epoch 1/10 | Loss: 2.322238
Epoch 2/10 | Loss: 2.096273
Epoch 3/10 | Loss: 1.906069
Epoch 4/10 | Loss: 1.796025
Epoch 5/10 | Loss: 1.719152
Epoch 6/10 | Loss: 1.584395
Epoch 7/10 | Loss: 1.446875
Epoch 8/10 | Loss: 1.407637
Epoch 9/10 | Loss: 1.374652
Epoch 10/10 | Loss: 1.354479
Training completed in 100.52 seconds
Mean Loss: 1.3472
mLSTM Test Accuracy: 0.5448


**T = 100**
Epoch 1/10 | Loss: 2.325448<br>
Epoch 2/10 | Loss: 2.100564<br>
Epoch 3/10 | Loss: 1.911989<br>
Epoch 4/10 | Loss: 1.819186<br>
Epoch 5/10 | Loss: 1.775007<br>
Epoch 6/10 | Loss: 1.705509<br>
Epoch 7/10 | Loss: 1.558481<br>
Epoch 8/10 | Loss: 1.449990<br>
Epoch 9/10 | Loss: 1.407998<br>
Epoch 10/10 | Loss: 1.375835<br>
Training completed in 92.10 seconds<br>
Mean Loss: 1.3647<br>
mLSTM Test Accuracy: 0.5434<br>


**T = 200**
Epoch 1/10 | Loss: 2.326527<br>
Epoch 2/10 | Loss: 2.106548<br>
Epoch 3/10 | Loss: 1.930772<br>
Epoch 4/10 | Loss: 1.835865<br>
Epoch 5/10 | Loss: 1.784955<br>
Epoch 6/10 | Loss: 1.728241<br>
Epoch 7/10 | Loss: 1.631638<br>
Epoch 8/10 | Loss: 1.482542<br>
Epoch 9/10 | Loss: 1.410738<br>
Epoch 10/10 | Loss: 1.383396<br>
Training completed in 97.40 seconds<br>
Mean Loss: 1.3712<br>
mLSTM Test Accuracy: 0.5437<br>
<br>
**T = 500**
Epoch 1/10 | Loss: 2.324010<br>
Epoch 2/10 | Loss: 2.098319<br>
Epoch 3/10 | Loss: 1.911312<br>
Epoch 4/10 | Loss: 1.806705<br>
Epoch 5/10 | Loss: 1.742324<br>
Epoch 6/10 | Loss: 1.641926<br>
Epoch 7/10 | Loss: 1.479221<br>
Epoch 8/10 | Loss: 1.414648<br>
Epoch 9/10 | Loss: 1.383242<br>
Epoch 10/10 | Loss: 1.359688<br>
Training completed in 87.35 seconds<br>
Mean Loss: 1.3518<br>
mLSTM Test Accuracy: 0.5446<br>


**T = 1000**
Epoch 1/10 | Loss: 2.322238<br>
Epoch 2/10 | Loss: 2.096273<br>
Epoch 3/10 | Loss: 1.906069<br>
Epoch 4/10 | Loss: 1.796025<br>
Epoch 5/10 | Loss: 1.719152<br>
Epoch 6/10 | Loss: 1.584395<br>
Epoch 7/10 | Loss: 1.446875<br>
Epoch 8/10 | Loss: 1.407637<br>
Epoch 9/10 | Loss: 1.374652<br>
Epoch 10/10 | Loss: 1.354479<br>
Training completed in 100.52 seconds<br>
Mean Loss: 1.3472<br>
mLSTM Test Accuracy: 0.5448<br>

---

# GRU:
We perform the same training and testing procedure on a GRU model for comparison.





In [10]:
import torch
from torch.utils.data import Dataset, DataLoader  #Dataset and Dataloader used for training data
import time                   #For training time calculation
import numpy as np

#This function generates a dataset of integer symbols, to be used for training the models
class CopyTaskDataset(Dataset):
    def __init__(self, seq_length=100, vocab_size=10, num_samples=1000):
        self.seq_length = seq_length            #1000 pairs of sequences of len 100 and 10 type of tokens
        self.vocab_size = vocab_size
        self.num_samples = num_samples
        self.delimiter_token = vocab_size
        self.input_size = vocab_size + 1             #Number of unique tokens + Delimiter
        self.data, self.labels = self.generate_data()

    def generate_data(self):
        data = []
        labels = []
        for _ in range(self.num_samples):
            sequence = np.random.randint(0, self.vocab_size, size=(self.seq_length))

            input_seq = np.concatenate([sequence, [self.delimiter_token] * self.seq_length])  #Original sequence first and delimiter tokens follow

            target_seq = np.concatenate([[self.delimiter_token] * self.seq_length, sequence])   #Delimiter tokens first and original sequence follows

            data.append(input_seq)
            labels.append(target_seq)

        # We reshape the data to have 3 dimensions and convert them to pytorch tensors (num_samples, seq_length, 1)
        data = torch.tensor(data, dtype=torch.long)
        labels = torch.tensor(labels, dtype=torch.long)  #long indices for CrossEntropy calculation

        data_one_hot = torch.nn.functional.one_hot(data, num_classes=self.input_size).float()      #One hot encoding for training
        #labels_one_hot = torch.nn.functional.one_hot(labels, num_classes=self.input_size).float()
        return data_one_hot, labels

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]


###
###GRU Implementation
class LSTMCellPytorch(torch.nn.Module):
    """
    A single-step RNN cell in PyTorch for reference:
    h_t = tanh( x_t * W_x + h_{t-1} * W_h + b )

    GRU:
    z_t = sigmoid(x_t * W_xz + h_{t-1} * W_hz + b_z)
    r_t = sigmoid(x_t * W_xr + h_{t-1} * W_hr + b_r)
    ~h_t = tanh(z_t * r_t * h_{t-1} + x_t * W_xh + b_h)
    h_t = (1 - z_t) * h_{t-1} + z_t * ~h_t
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Update gate:
        self.W_xz = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_hz = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_z = torch.nn.Parameter(torch.zeros(hidden_size))

        # Reset gate:
        self.W_xr = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_hr = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_r = torch.nn.Parameter(torch.zeros(hidden_size))

        # Candidate hidden state:
        self.W_xh = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_hh = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_h = torch.nn.Parameter(torch.zeros(hidden_size))

    def forward(self, x_t, h_prev):
        z_t = torch.sigmoid(x_t @ self.W_xz + h_prev @ self.U_hz + self.b_z)


        r_t = torch.sigmoid(x_t @ self.W_xr + h_prev @ self.U_hr + self.b_r)


        h_candidate = torch.tanh(x_t @ self.W_xh + (r_t * h_prev) @ self.U_hh + self.b_h)


        h_t = (1 - z_t) * h_prev + z_t * h_candidate

        return z_t, r_t, h_candidate, h_t

###
#GRU Unrolling over time function
class LSTMPyTorch(torch.nn.Module):
    """
    Unrolls the GRU over a full sequence.
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn_cell = LSTMCellPytorch(input_size, hidden_size)
        self.W_out = torch.nn.Parameter(torch.randn(hidden_size, input_size) * 0.1)

        self.b_out = torch.nn.Parameter(torch.zeros(input_size))

    def forward(self, X, hidden_state= None):
        # X: [batch_size, seq_length, input_size]
        batch_size, seq_length, _ = X.shape

        if hidden_state is None:
            h = torch.zeros(batch_size, self.hidden_size, device=X.device)

        else:
            h = hidden_state

        outputs = []
        for t in range(seq_length):
            x_t = X[:, t, :]  # [batch_size, input_size]
            z_t, r_t, h_candidate, h_t = self.rnn_cell(x_t, h)
            out_t = h @ self.W_out + self.b_out
            outputs.append(out_t.unsqueeze(1))  # shape [batch_size,1,input_size]

            h = h_t
        # Concatenate across time
        return torch.cat(outputs, dim=1), h  # [batch_size, seq_length, input_size]

###

# The training function- train_LSTM
def train_LSTM(model, train_loader, epochs=10, lr=0.01):
    device = torch.device("cuda")
    model.to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    criterion = torch.nn.CrossEntropyLoss()

    start_time = time.time()

    hidden_state = None        #hidden state maintained during training

    for epoch in range(epochs):
        total_loss = 0
        total_batches = 0

        for X_batch, Y_batch in train_loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)


            optimizer.zero_grad()
            output, hidden_state = model(X_batch, hidden_state)
            hidden_state = (hidden_state.detach())   #Detaches from previous hidden state to prevent exploding gradient

            loss = criterion(output.view(-1, input_size), Y_batch.view(-1))
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            total_batches += 1

        epoch_loss = total_loss / total_batches
        print(f"Epoch {epoch+1}/{epochs} | Loss: {epoch_loss:.6f}")

    return time.time() - start_time, loss

#The testing function- evaluate_LSTM:
def evaluate_LSTM(model, test_loader, chunk_size = 100):
    model.eval()
    correct_predictions = 0
    total_predictions = 0
    device = next(model.parameters()).device


    with torch.no_grad():
        for X_batch, Y_batch in test_loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)
            batch_size, seq_length, _ = X_batch.shape
            hidden_state = None      #hiddnt state to be maintained throughout the batches
            num_chunks = seq_length // chunk_size

            chunk_correct = []
            chunk_total = []
            outputs = []
            for i in range(num_chunks):
                start = i * chunk_size
                end = (i + 1) * chunk_size

                X_chunk = X_batch[:, start:end, :]
                Y_chunk = Y_batch[:, start:end]

                output, hidden_state = model(X_chunk, hidden_state)
                hidden_state = (hidden_state.detach())
                outputs.append(output)


            outputs = torch.cat(outputs, dim=1)  # Reconstruct full sequence
            predicted_labels = torch.argmax(outputs, dim=-1)

            correct_predictions += (predicted_labels == Y_batch).sum().item()
            total_predictions += predicted_labels.numel()


    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
    return accuracy

###
#For training dataset:
seq_length = 100
vocab_size = 10
num_samples = 1500
batch_size = 64

# Creating dataset and Dataloader for training:
dataset = CopyTaskDataset(seq_length, vocab_size, num_samples)
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last = True)

# Define the model:
input_size = vocab_size + 1  # Including delimiter token
hidden_size = 128
model = LSTMPyTorch(input_size, hidden_size)

# Train model:
training_time, loss = train_LSTM(model, train_loader, epochs=10, lr=0.005)
print(f"Training completed in {training_time:.2f} seconds")
print(f"Mean Loss: {loss:.4f}")
###

###
#For testing dataset:
seq_length = 1000
vocab_size = 10
num_samples = 1500
batch_size = 64

#Dataset and loader for testing:
test_dataset = CopyTaskDataset(seq_length, vocab_size, num_samples)
test_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last = True)

#Test the model:
test_accuracy = evaluate_LSTM(model, test_loader, chunk_size = 100)
print(f"GRU Test Accuracy: {test_accuracy:.4f}")


Epoch 1/10 | Loss: 2.256180
Epoch 2/10 | Loss: 1.792469
Epoch 3/10 | Loss: 1.428502
Epoch 4/10 | Loss: 1.323199
Epoch 5/10 | Loss: 1.312905
Epoch 6/10 | Loss: 1.300798
Epoch 7/10 | Loss: 1.293492
Epoch 8/10 | Loss: 1.288133
Epoch 9/10 | Loss: 1.283637
Epoch 10/10 | Loss: 1.279767
Training completed in 67.30 seconds
Mean Loss: 1.2779
GRU Test Accuracy: 0.5476


**T= 100**

Epoch 1/10 | Loss: 2.262720<br>
Epoch 2/10 | Loss: 1.832611<br>
Epoch 3/10 | Loss: 1.478234<br>
Epoch 4/10 | Loss: 1.320243<br>
Epoch 5/10 | Loss: 1.310087<br>
Epoch 6/10 | Loss: 1.299058<br>
Epoch 7/10 | Loss: 1.291252<br>
Epoch 8/10 | Loss: 1.285523<br>
Epoch 9/10 | Loss: 1.280829<br>
Epoch 10/10 | Loss: 1.276729<br>
Training completed in 65.63 seconds<br>
Mean Loss: 1.2747<br>
GRU Test Accuracy: 0.5478<br>

**T = 200**

Epoch 1/10 | Loss: 2.277316<br>
Epoch 2/10 | Loss: 1.848805<br>
Epoch 3/10 | Loss: 1.488620<br>
Epoch 4/10 | Loss: 1.327172<br>
Epoch 5/10 | Loss: 1.310578<br>
Epoch 6/10 | Loss: 1.297373<br>
Epoch 7/10 | Loss: 1.289892<br>
Epoch 8/10 | Loss: 1.284580<br>
Epoch 9/10 | Loss: 1.279979<br>
Epoch 10/10 | Loss: 1.275893<br>
Training completed in 67.72 seconds<br>
Mean Loss: 1.2736<br>
GRU Test Accuracy: 0.5476<br>

**T=500**

Epoch 1/10 | Loss: 2.193659<br>
Epoch 2/10 | Loss: 1.634995<br>
Epoch 3/10 | Loss: 1.349776<br>
Epoch 4/10 | Loss: 1.319150<br>
Epoch 5/10 | Loss: 1.306936<br>
Epoch 6/10 | Loss: 1.296261<br>
Epoch 7/10 | Loss: 1.288920<br>
Epoch 8/10 | Loss: 1.283386<br>
Epoch 9/10 | Loss: 1.278781<br>
Epoch 10/10 | Loss: 1.274840<br>
Training completed in 66.03 seconds<br>
Mean Loss: 1.2733<br>
GRU Test Accuracy: 0.5478<br>

**T = 1000**

Epoch 1/10 | Loss: 2.256180<br>
Epoch 2/10 | Loss: 1.792469<br>
Epoch 3/10 | Loss: 1.428502<br>
Epoch 4/10 | Loss: 1.323199<br>
Epoch 5/10 | Loss: 1.312905<br>
Epoch 6/10 | Loss: 1.300798<br>
Epoch 7/10 | Loss: 1.293492<br>
Epoch 8/10 | Loss: 1.288133<br>
Epoch 9/10 | Loss: 1.283637<br>
Epoch 10/10 | Loss: 1.279767<br>
Training completed in 67.30 seconds<br>
Mean Loss: 1.2779<br>
GRU Test Accuracy: 0.5476<br>

#Multiplicative GRU

Lastly, we perform the training and tests on the multiplicative variant of the GRU model.

In [13]:
import torch
from torch.utils.data import Dataset, DataLoader  #Dataset and Dataloader used for training data
import time                   #For training time calculation
import numpy as np

#This function generates a dataset of integer symbols, to be used for training the models
class CopyTaskDataset(Dataset):
    def __init__(self, seq_length=100, vocab_size=10, num_samples=1000):
        self.seq_length = seq_length            #1000 pairs of sequences of len 100 and 10 type of tokens
        self.vocab_size = vocab_size
        self.num_samples = num_samples
        self.delimiter_token = vocab_size
        self.input_size = vocab_size + 1             #Number of unique tokens + Delimiter
        self.data, self.labels = self.generate_data()

    def generate_data(self):
        data = []
        labels = []
        for _ in range(self.num_samples):
            sequence = np.random.randint(0, self.vocab_size, size=(self.seq_length))

            input_seq = np.concatenate([sequence, [self.delimiter_token] * self.seq_length])  #Original sequence first and delimiter tokens follow

            target_seq = np.concatenate([[self.delimiter_token] * self.seq_length, sequence])   #Delimiter tokens first and original sequence follows

            data.append(input_seq)
            labels.append(target_seq)

        # We reshape the data to have 3 dimensions and convert them to pytorch tensors (num_samples, seq_length, 1)
        data = torch.tensor(data, dtype=torch.long)
        labels = torch.tensor(labels, dtype=torch.long)  #long indices for CrossEntropy calculation

        data_one_hot = torch.nn.functional.one_hot(data, num_classes=self.input_size).float()      #One hot encoding for training
        #labels_one_hot = torch.nn.functional.one_hot(labels, num_classes=self.input_size).float()
        return data_one_hot, labels

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]


###
###mGRU Implementation
class LSTMCellPytorch(torch.nn.Module):
    """
    A single-step RNN cell in PyTorch for reference:
    h_t = tanh( x_t * W_x + h_{t-1} * W_h + b )

    Multiplicative GRU:
    m_t = sigmoid(x_t * W_xm + h_{t-1} * W_hm + b_m)
    x_t = x_t * m_t

    z_t = sigmoid(x_t * W_xz + h_{t-1} * W_hz + b_z)
    r_t = sigmoid(x_t * W_xr + h_{t-1} * W_hr + b_r)
    ~h_t = tanh(z_t * r_t * h_{t-1} + x_t * W_xh + b_h)
    h_t = (1 - z_t) * h_{t-1} + z_t * ~h_t
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        #Update gate:
        self.W_xz = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_hz = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_z = torch.nn.Parameter(torch.zeros(hidden_size))

        # Reset gate:
        self.W_xr = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_hr = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_r = torch.nn.Parameter(torch.zeros(hidden_size))

        # Candidate hidden state:
        self.W_xh = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_hh = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_h = torch.nn.Parameter(torch.zeros(hidden_size))

        self.W_m = torch.nn.Parameter(torch.randn(input_size, hidden_size) * 0.1)
        self.U_m = torch.nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.1)
        self.b_m = torch.nn.Parameter(torch.zeros(hidden_size))

    def forward(self, x_t, h_prev):

        z_t = torch.sigmoid(x_t @ self.W_xz + h_prev @ self.U_hz + self.b_z)


        r_t = torch.sigmoid(x_t @ self.W_xr + h_prev @ self.U_hr + self.b_r)

        m_t = torch.sigmoid(x_t @ self.W_m + h_prev @ self.U_m + self.b_m)

        h_candidate = torch.tanh(x_t @ self.W_xh + (r_t * h_prev) @ self.U_hh + self.b_h)


        h_t = (1 - z_t) * h_prev + z_t * h_candidate

        return z_t, r_t, h_candidate, h_t

###
#GRU Unrolling over time function
class LSTMPyTorch(torch.nn.Module):
    """
    Unrolls the GRU over a full sequence.
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn_cell = LSTMCellPytorch(input_size, hidden_size)
        # Output projection to match the original input dimension for copy task
        self.W_out = torch.nn.Parameter(torch.randn(hidden_size, input_size) * 0.1)

        self.b_out = torch.nn.Parameter(torch.zeros(input_size))

    def forward(self, X, hidden_state= None):
        # X: [batch_size, seq_length, input_size]
        batch_size, seq_length, _ = X.shape

        if hidden_state is None:
            h = torch.zeros(batch_size, self.hidden_size, device=X.device)

        else:
            h = hidden_state

        outputs = []
        for t in range(seq_length):
            x_t = X[:, t, :]  # [batch_size, input_size]
            z_t, r_t, h_candidate, h_t = self.rnn_cell(x_t, h)
            # Project hidden -> input_size
            out_t = h @ self.W_out + self.b_out
            outputs.append(out_t.unsqueeze(1))  # shape [batch_size,1,input_size]

            h = h_t
        # Concatenate across time
        return torch.cat(outputs, dim=1), h  # [batch_size, seq_length, input_size]

###

# The training function- train_LSTM
def train_LSTM(model, train_loader, epochs=10, lr=0.01):
    device = torch.device("cuda")
    model.to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    criterion = torch.nn.CrossEntropyLoss()

    start_time = time.time()

    hidden_state = None        #hidden state maintained during training

    for epoch in range(epochs):
        total_loss = 0
        total_batches = 0

        for X_batch, Y_batch in train_loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)


            optimizer.zero_grad()
            output, hidden_state = model(X_batch, hidden_state)
            hidden_state = (hidden_state.detach())   #Detaches from previous hidden state to prevent exploding gradient

            loss = criterion(output.view(-1, input_size), Y_batch.view(-1))
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            total_batches += 1

        epoch_loss = total_loss / total_batches
        print(f"Epoch {epoch+1}/{epochs} | Loss: {epoch_loss:.6f}")

    return time.time() - start_time, loss

#The testing function- evaluate_LSTM:
def evaluate_LSTM(model, test_loader, chunk_size = 100):
    model.eval()
    correct_predictions = 0
    total_predictions = 0
    device = next(model.parameters()).device


    with torch.no_grad():
        for X_batch, Y_batch in test_loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)
            batch_size, seq_length, _ = X_batch.shape
            hidden_state = None      #hiddnt state to be maintained throughout the batches
            num_chunks = seq_length // chunk_size

            chunk_correct = []
            chunk_total = []
            outputs = []
            for i in range(num_chunks):
                start = i * chunk_size
                end = (i + 1) * chunk_size

                X_chunk = X_batch[:, start:end, :]
                Y_chunk = Y_batch[:, start:end]

                output, hidden_state = model(X_chunk, hidden_state)
                hidden_state = (hidden_state.detach())
                outputs.append(output)


            outputs = torch.cat(outputs, dim=1)  # Reconstruct full sequence
            predicted_labels = torch.argmax(outputs, dim=-1)

            correct_predictions += (predicted_labels == Y_batch).sum().item()
            total_predictions += predicted_labels.numel()


    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
    return accuracy

###
#For training dataset:
seq_length = 100
vocab_size = 10
num_samples = 1500
batch_size = 64

# Creating dataset and Dataloader for training:
dataset = CopyTaskDataset(seq_length, vocab_size, num_samples)
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last = True)

# Define model
input_size = vocab_size + 1  # Including delimiter token
hidden_size = 128
model = LSTMPyTorch(input_size, hidden_size)

# Train model:
training_time, loss = train_LSTM(model, train_loader, epochs=10, lr=0.005)
print(f"Training completed in {training_time:.2f} seconds")
print(f"Mean Loss: {loss:.4f}")
###

###
#For testing dataset:
seq_length = 1000
vocab_size = 10
num_samples = 1500
batch_size = 64

#Dataset and loader for testing:
test_dataset = CopyTaskDataset(seq_length, vocab_size, num_samples)
test_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last = True)

#Test the model:
test_accuracy = evaluate_LSTM(model, test_loader, chunk_size = 100)
print(f"mGRU Test Accuracy: {test_accuracy:.4f}")

Epoch 1/10 | Loss: 2.279870
Epoch 2/10 | Loss: 1.795358
Epoch 3/10 | Loss: 1.421967
Epoch 4/10 | Loss: 1.324826
Epoch 5/10 | Loss: 1.313365
Epoch 6/10 | Loss: 1.301349
Epoch 7/10 | Loss: 1.293327
Epoch 8/10 | Loss: 1.287402
Epoch 9/10 | Loss: 1.282551
Epoch 10/10 | Loss: 1.278433
Training completed in 70.42 seconds
Mean Loss: 1.2768
mGRU Test Accuracy: 0.5474


**T = 100**

Epoch 1/10 | Loss: 2.255296<br>
Epoch 2/10 | Loss: 1.793793<br>
Epoch 3/10 | Loss: 1.432387<br>
Epoch 4/10 | Loss: 1.313185<br>
Epoch 5/10 | Loss: 1.307201<br>
Epoch 6/10 | Loss: 1.296796<br>
Epoch 7/10 | Loss: 1.289199<br>
Epoch 8/10 | Loss: 1.283684<br>
Epoch 9/10 | Loss: 1.279179<br>
Epoch 10/10 | Loss: 1.275271<br>
Training completed in 66.17 seconds<br>
Mean Loss: 1.2736<br>
mGRU Test Accuracy: 0.5482<br>

**T = 200**

Epoch 1/10 | Loss: 2.242930<br>
Epoch 2/10 | Loss: 1.718653<br>
Epoch 3/10 | Loss: 1.367778<br>
Epoch 4/10 | Loss: 1.328952<br>
Epoch 5/10 | Loss: 1.310786<br>
Epoch 6/10 | Loss: 1.297574<br>
Epoch 7/10 | Loss: 1.289462<br>
Epoch 8/10 | Loss: 1.283538<br>
Epoch 9/10 | Loss: 1.278701<br>
Epoch 10/10 | Loss: 1.274586<br>
Training completed in 72.74 seconds<br>
Mean Loss: 1.2726<br>
mGRU Test Accuracy: 0.5476<br>

**T = 500**

Epoch 1/10 | Loss: 2.161476<br>
Epoch 2/10 | Loss: 1.637637<br>
Epoch 3/10 | Loss: 1.344323<br>
Epoch 4/10 | Loss: 1.309537<br>
Epoch 5/10 | Loss: 1.301227<br>
Epoch 6/10 | Loss: 1.291401<br>
Epoch 7/10 | Loss: 1.284444<br>
Epoch 8/10 | Loss: 1.279242<br>
Epoch 9/10 | Loss: 1.274934<br>
Epoch 10/10 | Loss: 1.271261<br>
Training completed in 62.78 seconds<br>
Mean Loss: 1.2699<br>
mGRU Test Accuracy: 0.5479<br>

**T = 1000**

Epoch 1/10 | Loss: 2.279870<br>
Epoch 2/10 | Loss: 1.795358<br>
Epoch 3/10 | Loss: 1.421967<br>
Epoch 4/10 | Loss: 1.324826<br>
Epoch 5/10 | Loss: 1.313365<br>
Epoch 6/10 | Loss: 1.301349<br>
Epoch 7/10 | Loss: 1.293327<br>
Epoch 8/10 | Loss: 1.287402<br>
Epoch 9/10 | Loss: 1.282551<br>
Epoch 10/10 | Loss: 1.278433<br>
Training completed in 70.42 seconds<br>
Mean Loss: 1.2768<br>
mGRU Test Accuracy: 0.5474<br>


#Results and Inference:
As seen from the training and tests, all of the models perform nearly at the same accuracies, though there were differences in the losses calculated.

For the LSTM cell, we see:
Although the accuracies remain same for all T values, the losses gradually increase with increasing number of sequence.

For the multiplicative LSTM:
The losses are seen to be fluctuating with increasing value of T.

For GRU:
The losses also seem to be fluctuating around the same value for increasing values of T.

For multiplicative GRU:
The loss values also fluctuate. Accuracies remain near the same value.

#Conclusion:
From the experiment, we see that the models output similar accuracy and loss values for different lengths. The reason for this may be human error, the sequence lengths not being varied enough to make a difference etc.



# References:
1. Zhang, J., Lei, Q., & Dhillon, I. S. (2018, March 25). Stabilizing gradients for deep neural networks via efficient SVD parameterization. arXiv.org. https://arxiv.org/abs/1803.09327

2. Wikipedia contributors. (2025, January 30). Vanishing gradient problem. Wikipedia. https://en.wikipedia.org/wiki/Vanishing_gradient_problem

3. Ngutten. (n.d.). LSTMCopy/LSTM.ipynb at master Â· ngutten/LSTMCopy. GitHub. https://github.com/ngutten/LSTMCopy/blob/master/LSTM.ipynb

4. Omarzai, F. (2024, November 19). LSTM and GRU in depth - Fraidoon Omarzai - medium. Medium. https://medium.com/@fraidoonomarzai99/lstm-and-gru-in-depth-40aba24bfe53

5. OpenAI. (2025). ChatGPT [Large language model]. https://chat.openai.com/chat

6. Google. (2025). Gemini [Large language model]. https://gemini.google.com/app



Note: AI tools were used for debugging and learning purposes only.

