# Long Short-Term Memory (LSTM)

**Problem with Basic RNNs**

Basic RNNs suffer from a problem called the "vanishing gradient" problem. When processing long sequences, the gradients that flow backward during training can become very small over time, causing the model to forget information from earlier time steps. As a result, basic RNNs have difficulty capturing long-term dependencies in the data, making them less effective for tasks like natural language processing (NLP), where long-range dependencies are common.

Long Short-Term Memory (LSTM) is a type of recurrent neural network architecture designed to address the vanishing gradient problem. LSTM units have a more sophisticated cell structure, allowing them to selectively retain and forget information over time. The architecture includes gates that control the flow of information, enabling LSTMs to learn long-range dependencies more effectively than basic RNNs.

![RNN vs LSTM](./../../assets/rnn-vs-lstm.jpg)

## LSTM Cell Structure

The LSTM cell consists of three main components:

- **Cell State ($C_t$):** This is the memory of the LSTM. It is analogous to the hidden state in basic RNNs, but it has a more complex structure. The cell state can retain information over long periods of time.
- **Input Gate ($i_t$):** This gate controls the amount of new information that should be added to the cell state. It decides what information from the current input and the previous hidden state should be stored in the cell state.
- **Forget Gate ($f_t$):** This gate controls what information should be discarded from the cell state. It decides which information from the previous cell state should be forgotten.
- **Output Gate ($o_t$):** This gate controls how much of the cell state should be exposed as the output.

## Workflow of a LSTM Cell

The LSTM cell's architecture allows it to selectively retain or forget information at each time step, making it well-suited for tasks involving long-term dependencies. Let's explain the complete workflow of an LSTM cell in a step-by-step manner:

- **Initialization:** At the beginning of the sequence or training process, the LSTM cell is initialized with an initial cell state ($C_0$) and an initial hidden state ($h_0$). These states are often set to zeros or learned from data during training.
    - Initial cell state ($C_0$): $C_0 = 0$ or learned from data
    - Initial hidden state ($h_0$): $h_0 = 0$ or learned from data

- **Step 1 - Input Gate ($i_t$):** The LSTM cell takes the current input ($x_t$) and the previous hidden state ($h_{t-1}$) as inputs. The input gate ($i_t$) is calculated using these inputs and its weights and biases. The input gate determines how much of the new input information should be added to the cell state ($C_t$).
    - Calculate the input gate activation $(z_t): z_t = tanh(W_{iz} * x_t + U_{iz} * h_{t-1} + b_{iz})$
    - Calculate the input gate $(i_t): i_t = sigmoid(W_{ii} * x_t + U_{ii} * h_{t-1} + b_{ii})$

- **Step 2 - Forget Gate ($f_t$):** The forget gate ($f_t$) is calculated using the current input ($x_t$) and the previous hidden state ($h_{t-1}$). It decides what information from the previous cell state ($C_{t-1}$) should be forgotten. The forget gate is responsible for removing irrelevant information from the cell state.
    - Calculate the forget gate activation $(z_t): z_t = tanh(W_{fz} * x_t + U_{fz} * h_{t-1} + b_{fz})$
    - Calculate the forget gate $(f_t): f_t = sigmoid(W_{if} * x_t + U_{if} * h_{t-1} + b_{if})$

- **Step 3 - Cell State Update:** The cell state ($C_t$) is updated using the input gate ($i_t$) and the forget gate ($f_t$). The new cell state ($C_t$) is a combination of the previous cell state ($C_{t-1}$) after forgetting some information and the new input information after applying the input gate.
    - Calculate the candidate cell state $(\tilde C_t): \tilde C_t = tanh(W_{C} * x_t + U_{C} * h_{t-1} + b_{C})$
    - Update the cell state $(C_t): C_t = f_t * C_{t-1} + i_t * \tilde C_t$

- **Step 4 - Output Gate ($o_t$):** The output gate ($o_t$) is calculated using the current input ($x_t$) and the previous hidden state ($h_{t-1}$). The output gate determines how much of the cell state ($C_t$) should be exposed as the output ($h_t$).
    - Calculate the output gate activation $(z_t): z_t = tanh(W_{oz} * x_t + U_{oz} * h_{t-1} + b_{oz})$
    - Calculate the output gate $(o_t): o_t = sigmoid(W_{io} * x_t + U_{io} * h_{t-1} + b_{io})$

- **Step 5 - Hidden State Update:** The hidden state ($h_t$) is updated using the output gate ($o_t$) and the updated cell state ($C_t$). The new hidden state ($h_t$) is a transformed version of the cell state, which will be used in the next time step ($t+1$) as the previous hidden state ($h_{t+1-1}$).
    - Calculate the new hidden state $(h_t): h_t = o_t * tanh(C_t)$

- **Repeat for the Next Time Step:** The process described in Steps 1 to 6 is repeated for each time step in the sequence, allowing the LSTM cell to process the entire input sequence and update its cell state and hidden state accordingly.

- **Final Output:** After processing all time steps in the sequence, the final hidden state ($h_T$) can be used as the output of the LSTM cell for tasks like sentiment analysis, machine translation, or any other sequence-to-sequence task.

The LSTM cell's ability to control the flow of information through input, forget, and output gates enables it to selectively retain or forget information over long periods, making it more effective for capturing long-term dependencies in sequential data compared to basic RNNs. This makes LSTMs well-suited for various NLP tasks where understanding context and long-range dependencies is crucial.

In [1]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing import sequence

Load and Preprocess the dataset

In [2]:
# Set parameters
max_features = 10000  # Vocabulary size (use the top 10,000 most frequent words)
maxlen = 500  # Maximum sequence length (truncate/pad sequences to this length)
batch_size = 32

# Load the IMDB dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad/truncate sequences to a fixed length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

Build and Compile the LSTM Model

In [3]:
# Build the LSTM model
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))  # Use LSTM layer instead of SimpleRNN
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

Train the Model

In [4]:
model.fit(x_train, y_train, epochs=10, batch_size=batch_size, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1aca6b5afd0>

Evaluate the Model

In [5]:
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test loss: {loss:.4f}, Test accuracy: {accuracy:.4f}")

Test loss: 0.3480, Test accuracy: 0.8717


This program loads the IMDB dataset, which consists of movie reviews labeled with sentiment (positive or negative). It preprocesses the text data, creates a LSTM model using TensorFlow's Sequential API, and trains the model on the training set. Finally, it evaluates the model on the test set and prints the test loss and accuracy.

## References

- [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)