# Building, Training, and Executing a Long Short-Term Memory (LSTM) Model Manually 

This notebook builds a LSTM Model manually by creating classes.

In a second notebook, I will use some built in PyTorch, Lightning functions to build the same model, showing how much easier it can be using built-in functions. 

---

## Importing Modules

I have added copious comments to help better understand what each import is accomplishing.

In [1]:
import lightning as L # Lightning has tons of cool tools that make neural networks easier

import torch # torch will allow us to create tensors.
import torch.nn as nn # torch.nn allows us to create a neural network.
import torch.nn.functional as F # nn.functional give us access to the activation and loss functions.
from torch.optim import Adam # optim contains many optimizers. This time I am using Adam
from torch.utils.data import TensorDataset, DataLoader # needed for training data

----
## Example - Building a Long Short-Term Memory Unit one Component at a time using `PyTorch and Lightning`


For this LSTM example, I imagine that I have two companies: Company A and Company B with five day's worth of stock prices

![company stock](imgs/company_stock_prices.png)

Given this sequential data, I want to see if I can get the LSTM to remember what happened on Day 1 through Day 4, to see if I can correctly predict what will happen on Day 5. 

`The objective`: I run the data from Day 1 through Day 4 through the LSTM to see If I can predict the values for Day 5 for both Company A and Company B.

For Company A, the goal is to predict that the value on Day 5 = 0, and for Company B,the goal is to predict that the value on Day 5 = 1.

### Creating the LSTM Model Class

In [None]:
class LSTMbyHand(L.LightningModule):

    def __init__(self):

        super().__init__()

        # set the seed for the random number generator.
        L.seed_everything(seed=42)

        ###################
        #
        # Initialize the tensors for the LSTM
        #
        ###################

        # Using random values to initialize the tensors
        # Here are two 2 different ways 1) Normal Distribution and 2) Uniform Distribution
        # I use the Normal Distribtion...
        mean = torch.tensor(0.0)
        std = torch.tensor(1.0)

        # Use the normal distribution for the Weights.
        # All Biases are initialized to 0.
        #
        # These are the Weights and Biases in the first stag that determines what percentage
        # of the long-term memory the LSTM unit will remember.
        self.wlr1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wlr2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.blr1 = nn.Parameter(torch.tensor(0.), requires_grad=True)

        # These are the Weights and Biases in the second stage that determine the new
        # potential long-term memory and what percentage will be remembered.
        self.wpr1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wpr2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.bpr1 = nn.Parameter(torch.tensor(0.), requires_grad=True)

        self.wp1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wp2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.bp1 = nn.Parameter(torch.tensor(0.), requires_grad=True)

        # These are the Weights and Biases in the third stage that determine the
        # new short-term memory and what percentage will be sent to the output.
        self.wo1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wo2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.bo1 = nn.Parameter(torch.tensor(0.), requires_grad=True)

        # I could also initialize all Weights and Biases using a uniform distribution. This is
        # how the automated function nn.LSTM() does it. Adding the code here to show what that might look like.
#         self.wlr1 = nn.Parameter(torch.rand(1), requires_grad=True)
#         self.wlr2 = nn.Parameter(torch.rand(1), requires_grad=True)
#         self.blr1 = nn.Parameter(torch.rand(1), requires_grad=True)

#         self.wpr1 = nn.Parameter(torch.rand(1), requires_grad=True)
#         self.wpr2 = nn.Parameter(torch.rand(1), requires_grad=True)
#         self.bpr1 = nn.Parameter(torch.rand(1), requires_grad=True)

#         self.wp1 = nn.Parameter(torch.rand(1), requires_grad=True)
#         self.wp2 = nn.Parameter(torch.rand(1), requires_grad=True)
#         self.bp1 = nn.Parameter(torch.rand(1), requires_grad=True)

#         self.wo1 = nn.Parameter(torch.rand(1), requires_grad=True)
#         self.wo2 = nn.Parameter(torch.rand(1), requires_grad=True)
#         self.bo1 = nn.Parameter(torch.rand(1), requires_grad=True)


    def lstm_unit(self, input_value, long_memory, short_memory):
        # lstm_unit does the math for a single LSTM unit.

        # NOTES:
        # long term memory is also called "cell state"
        # short term memory is also called "hidden state"

        # 1) The first stage determines what percent of the current long-term memory
        #    should be remembered
        long_remember_percent = torch.sigmoid((short_memory * self.wlr1) +
                                              (input_value * self.wlr2) +
                                              self.blr1)

        # 2) The second stage creates a new, potential long-term memory and determines what
        #    percentage of that to add to the current long-term memory
        potential_remember_percent = torch.sigmoid((short_memory * self.wpr1) +
                                                   (input_value * self.wpr2) +
                                                   self.bpr1)
        potential_memory = torch.tanh((short_memory * self.wp1) +
                                      (input_value * self.wp2) +
                                      self.bp1)

        # Once we have gone through the first two stages, we can update the long-term memory
        updated_long_memory = ((long_memory * long_remember_percent) +
                       (potential_remember_percent * potential_memory))

        # 3) The third stage creates a new, potential short-term memory and determines what
        ##    percentage of that should be remembered and used as output.
        output_percent = torch.sigmoid((short_memory * self.wo1) +
                                       (input_value * self.wo2) +
                                       self.bo1)
        updated_short_memory = torch.tanh(updated_long_memory) * output_percent

        # Finally, we return the updated long and short-term memories
        return([updated_long_memory, updated_short_memory])
    
    def forward(self, input: list[int]) -> list[int]:
        # forward() unrolls the LSTM for the training data by calling lstm_unit() for each day of training data
        # that I have. forward() also keeps track of the long and short-term memories after each day and returns
        # the final short-term memory, which is the 'output' of the LSTM.

        long_memory = 0 # long term memory is also called "cell state" and indexed with c0, c1, ..., cN
        short_memory = 0 # short term memory is also called "hidden state" and indexed with h0, h1, ..., hN
        day1 = input[0]
        day2 = input[1]
        day3 = input[2]
        day4 = input[3]

        # Day 1
        long_memory, short_memory = self.lstm_unit(day1, long_memory, short_memory)

        # Day 2
        long_memory, short_memory = self.lstm_unit(day2, long_memory, short_memory)

        # Day 3
        long_memory, short_memory = self.lstm_unit(day3, long_memory, short_memory)

        ## Day 4
        long_memory, short_memory = self.lstm_unit(day4, long_memory, short_memory)

        # Now return short_memory, which is the 'output' of the LSTM.
        return short_memory


    def configure_optimizers(self): # this configures the optimizer we want to use for backpropagation.
        # return Adam(self.parameters(), lr=0.1) # NOTE: Setting the learning rate to 0.1 trains way faster than
                                                 # using the default learning rate, lr=0.001, which requires a lot more
                                                 # training. For now, just going to use the default
        return Adam(self.parameters())


    def training_step(self, batch, batch_idx): # take a step during gradient descent.
        input_i, label_i = batch # collect input
        output_i = self.forward(input_i[0]) # run input through the neural network
        loss = (output_i - label_i)**2 ## loss = squared residual

        ###################
        #
        # Logging the loss and the predicted values so we can evaluate the training
        #
        ###################
        self.log("train_loss", loss)
        # NOTE: Our dataset consists of two sequences of values representing Company A and Company B
        # For Company A, the goal is to predict that the value on Day 5 = 0, and for Company B,
        # the goal is to predict that the value on Day 5 = 1. We use label_i, the value we want to
        # predict, to keep track of which company we just made a prediction for and
        # log that output value in a company specific file
        if (label_i == 0):
            self.log("out_0", output_i)
        else:
            self.log("out_1", output_i)

        return loss

## Creating an Instance for the LSTM

Now, that I have created class ,`LSTMbyHand`, that defines an LSTM, I can use it to create a model and print out the randomly initialized `Weights` and `Biases`. 

Then, just for fun, I'll see what those random Weights and Biases predict for **Company A** and **Company B**. If they are good predictions, then I am done! However, the chances of getting good predictions from random values is very small. 

In [3]:
# Create the model object, print out parameters and see how well
## the untrained LSTM can make predictions...
model = LSTMbyHand()

print("Before optimization, the parameters are...")
for name, param in model.named_parameters():
    print(name, param.data)

print("\nNow let's compare the observed and predicted values...")
# NOTE: To make predictions, we pass in the first 4 days worth of stock values
# in an array for each company. In this case, the only difference between the
# input values for Company A and B occurs on the first day. Company A has 0 and
# Company B has 1.
print("Company A: Observed = 0, Predicted =",
      model(torch.tensor([0., 0.5, 0.25, 1.])).detach())
print("Company B: Observed = 1, Predicted =",
      model(torch.tensor([1., 0.5, 0.25, 1.])).detach())

Seed set to 42


Before optimization, the parameters are...
wlr1 tensor(0.3367)
wlr2 tensor(0.1288)
blr1 tensor(0.)
wpr1 tensor(0.2345)
wpr2 tensor(0.2303)
bpr1 tensor(0.)
wp1 tensor(-1.1229)
wp2 tensor(-0.1863)
bp1 tensor(0.)
wo1 tensor(2.2082)
wo2 tensor(-0.6380)
bo1 tensor(0.)

Now let's compare the observed and predicted values...
Company A: Observed = 0, Predicted = tensor(-0.0377)
Company B: Observed = 1, Predicted = tensor(-0.0383)


## Initial Results 
With the unoptimized paramters (i.e., using the initial random weights), the predicted value for **Company A**, **-0.0377**, isn't terrible, since it is relatively close to the observed value, **0**. However, the predicted value for **Company B**, **-0.0383**, _is_ terrible, because it is relatively far from the observed value, **1**. So, that means I need to train the LSTM.

Note, we would still want to train, but it was a first attempt to see if our first attempt was close enough or not.




---
---

### Time to Train my LSTM

Train the LSTM unit and use `Lightning` and `TensorBoard` to evaluate



### Use `DataLoader`


In [4]:
## create the training data as a tensor for the neural network.
inputs = torch.tensor([[0., 0.5, 0.25, 1.], [1., 0.5, 0.25, 1.]]) #A and B
labels = torch.tensor([0., 1.]) # Anticipated output predictions for company A and company B

dataset = TensorDataset(inputs, labels)
dataloader = DataLoader(dataset)

Next, I have to create a `Lightning Trainer`.

* `L.Trainer` - A Class that I use to facilitate training of the data
    * I start with 2000 epochs, which may or may not be good enough
    * Recall, I used the standard learning rate, 0.001, which makes learning slow

In [5]:
trainer = L.Trainer(max_epochs=2000)
trainer.fit(model, train_dataloaders=dataloader)

Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 12     | n/a 
---------------------------------------------
12        Trainable params
0         Non-trainable params
12        Total params
0.000     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode
/Users/lancehester/Documents/ltsm_project_pytorch/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=13` in the `DataLoader` to improve performance.
/Users/lancehe

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=2000` reached.


### Okay, now I that I have trained the model with 2000 Epochs, I can see how good the predictions are.

In [6]:
print("\nNow let's compare the observed and predicted values...")
print("Company A: Observed = 0, Predicted =", model(torch.tensor([0., 0.5, 0.25, 1.])).detach())
print("Company B: Observed = 1, Predicted =", model(torch.tensor([1., 0.5, 0.25, 1.])).detach())


Now let's compare the observed and predicted values...
Company A: Observed = 0, Predicted = tensor(0.4342)
Company B: Observed = 1, Predicted = tensor(0.6171)


### Summarizing these results:

* The predictions are terrible. 
    * Company A - Day 5 prediction is 0.4342 -- very far from 0
    * Company B - Day 5 prediction is 0.6171 -- very far from 1

* Conclusion:
    * I have more training to do. 
    * A good place to start is to look at the `loss` values and `predictions` that were saved in the log files using `TensorBoard`


[TensorBoard](https://www.tensorflow.org/tensorboard) is a visualization toolkit for TensorFlow that provides tools and visualizations for machine learning experimentation. It is particularly useful for understanding, debugging, and optimizing machine learning models. 


### To get TensorBoard working with VS code

* Open the command palette (Ctrl/Cmd + Shift + P)
    * you may need to add tensorboard to your current virtual environment
        * in terminal I used `uv add tensorbard` as I use uv to add modules
        * Note: all of this should be done for you with this project as all dependencies are in the pyproject.toml file. 

   * I had to `restart and run the code` in the notebook and it resulted in TensorBoard VS Code extension addition message 

   * Or if need be: Search for the command “Python: Launch TensorBoard” and press enter.
   
   * You will be able to select the folder where your TensorBoard log files are located. By default, the current working directory will be used. Here, I used the `lightning_logs` directory

    * VS Code will then open a new tab with TensorBoard and its lifecycle will be managed by VS Code as well. This tab means that to kill the TensorBoard process all you have to do is close the TensorBoard tab.

### Looking at the TensorBoard Training Loss Results

In the figures below, I show the TensorBoard loss( train_loss) figure and
* The predictions for Company A(out_0) and the predictions for Company B(out_1).
    * note the X - axis refers to the epoch runs
    * note the Y - axis refers to the stock prediction values
* Recall:
    * Company A, I want to predict 0
    * Company B, I want to predict 1

![train_loss](imgs/tensorboard_train_loss_part1.png)

![out_0](imgs/out_0_part1.png)
![out_0](imgs/out_1_part1.png)


#### Discussion of the Loss
The **loss** (`train_loss`) is going down, which is good, but it still has further to go.

#### Discussion of the Out_0 and Out_1

* When I look at the predictions for **Company A** (`out_0`), I see that the predictions started out pretty good, close to **0**, but then got really bad early on in training, shooting all the way up to **0.5**, but are starting to get smaller.

* In contrast, when I look at the predictions for **Company B** (`out_1`), I see that predictions started out really bad, close to **0**, but have been getting better ever since and look like they could continue to get better if I kept training.

#### Summarization
I need to peform more training to improve predictions. So, time to add more epochs. 




---
---

# Optimizing (Training) the Weights and Biases in the LSTM: Adding More Epochs without Starting Over


The good news is that because I am using **Lightning**, I can pick up where I left off training without having to start over from scratch. 

This capability is because when I train with **Lightning**, it creates _checkpoint_ files that keep track of the `Weights` and `Biases` as they change. As a result, all I have to do to is pick up where I left off is to tell the `Trainer` where the checkpoint files are located. 

How awesome, as this capability will save me a lot of time since I don't have to retrain the first **2000** epochs. 

So, I am going to add an additional **1000** epochs to the training.

In [None]:
# First, find where the most recent checkpoint files are stored
path_to_checkpoint = trainer.checkpoint_callback.best_model_path ## By default, "best" = "most recent"
print("The new trainer will start where the last left off, and the check point data is here: " +
      path_to_checkpoint + "\n")

## Then create a new Lightning Trainer
trainer = L.Trainer(max_epochs=3000) # Before, max_epochs=2000, so, by setting it to 3000, I am adding 1000 more.
# And then I call fit() using the path to the most recent checkpoint files
# so that I can pick up where I left off.
trainer.fit(model, train_dataloaders=dataloader, ckpt_path=path_to_checkpoint)

Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Restoring states from the checkpoint path at /Users/lancehester/Documents/ltsm_project_pytorch/lightning_logs/version_1/checkpoints/epoch=1999-step=4000.ckpt
/Users/lancehester/Documents/ltsm_project_pytorch/.venv/lib/python3.12/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:362: The dirpath has changed from '/Users/lancehester/Documents/ltsm_project_pytorch/lightning_logs/version_1/checkpoints' to '/Users/lancehester/Documents/ltsm_project_pytorch/lightning_logs/version_2/checkpoints', therefore `best_model_score`, `kth_best_model_path`, `kth_value`, `last_model_path` and `best_k_models` won't be reloaded. Only `best_model_path` will be reloaded.

  | Name         | Type | Params | Mode
---------------

The new trainer will start where the last left off, and the check point data is here: /Users/lancehester/Documents/ltsm_project_pytorch/lightning_logs/version_1/checkpoints/epoch=1999-step=4000.ckpt



Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3000` reached.


In [None]:
# Now that I have added **1000** epochs to the training, I check the predictions...
print("\nNow let's compare the observed and predicted values...")
print("Company A: Observed = 0, Predicted =", model(torch.tensor([0., 0.5, 0.25, 1.])).detach())
print("Company B: Observed = 1, Predicted =", model(torch.tensor([1., 0.5, 0.25, 1.])).detach())


Now let's compare the observed and predicted values...
Company A: Observed = 0, Predicted = tensor(0.2708)
Company B: Observed = 1, Predicted = tensor(0.7534)


#### Results with Additional 1000 epochs

The results are better. Checkout the Blue lines. 

![train_loss](imgs/loss_part_2.png)

The loss is getting smaller


And predictions are improving. 

![out_0](imgs/out_0_part2.png)

![out_1](imgs/out_1_part2.png)


I can do better. Let me add `2000` more epochs to the training data. 


In [None]:
## First, find where the most recent checkpoint files are stored
path_to_checkpoint = trainer.checkpoint_callback.best_model_path ## By default, "best" = "most recent"
print("The new trainer will start where the last left off, and the check point data is here: " +
      path_to_checkpoint + "\n")

## Then create a new Lightning Trainer
trainer = L.Trainer(max_epochs=5000) # Before, max_epochs=3000, so, by setting it to 5000, I am adding 2000 more.
## And then I call fit() using the path to the most recent checkpoint files
## so that I can pick up where I left off.
trainer.fit(model, train_dataloaders=dataloader, ckpt_path=path_to_checkpoint)

Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Restoring states from the checkpoint path at /Users/lancehester/Documents/ltsm_project_pytorch/lightning_logs/version_2/checkpoints/epoch=2999-step=6000.ckpt
/Users/lancehester/Documents/ltsm_project_pytorch/.venv/lib/python3.12/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:362: The dirpath has changed from '/Users/lancehester/Documents/ltsm_project_pytorch/lightning_logs/version_2/checkpoints' to '/Users/lancehester/Documents/ltsm_project_pytorch/lightning_logs/version_3/checkpoints', therefore `best_model_score`, `kth_best_model_path`, `kth_value`, `last_model_path` and `best_k_models` won't be reloaded. Only `best_model_path` will be reloaded.

  | Name         | Type | Params | Mode
---------------

The new trainer will start where the last left off, and the check point data is here: /Users/lancehester/Documents/ltsm_project_pytorch/lightning_logs/version_2/checkpoints/epoch=2999-step=6000.ckpt



Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=5000` reached.


In [12]:
print("\nNow, I compare the observed and predicted values...")
print("Company A: Observed = 0, Predicted =", model(torch.tensor([0., 0.5, 0.25, 1.])).detach())
print("Company B: Observed = 1, Predicted =", model(torch.tensor([1., 0.5, 0.25, 1.])).detach())


Now, I compare the observed and predicted values...
Company A: Observed = 0, Predicted = tensor(0.0022)
Company B: Observed = 1, Predicted = tensor(0.9693)


## The predictions are the best yet (see red lines in the figures).

* The prediction for Company A, `0.0022`, is close to 0
* The prediction for Company B, `0.9693`, is close to 1

The TensorBoard results:

![train_loss](imgs/loss_part_3.png)

The loss is getting smaller to the point that adding more EPOCHS probably is not going to provde any additional benefits, so I am going to say that we are done with the training phase.  Yahoo!


Predictions are good as well. 

![out_0](imgs/out_0_part3.png)

![out_1](imgs/out_1_part3.png)

Having a look at the final weights and biases.



In [11]:
print("After optimization, the parameters are...")
for name, param in model.named_parameters():
    print(name, param.data)

After optimization, the parameters are...
wlr1 tensor(2.7043)
wlr2 tensor(1.6307)
blr1 tensor(1.6234)
wpr1 tensor(1.9983)
wpr2 tensor(1.6525)
bpr1 tensor(0.6204)
wp1 tensor(1.4122)
wp2 tensor(0.9393)
bp1 tensor(-0.3217)
wo1 tensor(4.3848)
wo2 tensor(-0.1943)
bo1 tensor(0.5935)
