# 1997 LSTM

The seminal paper for Long Short-Term Memory (LSTM) is titled **"Long Short-Term Memory"** and was authored by **Sepp Hochreiter and Jürgen Schmidhuber**. It was published in **1997** in the journal *Neural Computation*. This paper introduced the LSTM architecture as a solution to the vanishing gradient problem in Recurrent Neural Networks (RNNs), enabling them to learn long-term dependencies more effectively.

The citation for this paper is:

- **[Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.](https://deeplearning.cs.cmu.edu/F23/document/readings/LSTM.pdf)**

- [Staudemeyer, R.C., & Morris, E.R. (2019). Understanding LSTM: A tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv:1909.09586v1 \[cs.NE\] 12 Sep 2019. Faculty of Computer Science, Schmalkalden University of Applied Sciences, Germany; Singapore University of Technology and Design, Singapore.](https://arxiv.org/pdf/1909.09586)

- [Sherstinsky, A. (2018). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Retrieved from arXiv:1808.03314.](https://arxiv.org/pdf/1808.03314)


This work is foundational in the field of deep learning, particularly for sequence data tasks like speech recognition, natural language processing, and time-series forecasting.

The paper on Long Short-Term Memory (LSTM) by Sepp Hochreiter and Jürgen Schmidhuber is a landmark in the field of recurrent neural networks (RNNs). It introduces LSTM, a network architecture designed to address the vanishing gradient problem, allowing for learning over long time intervals. Here's a breakdown of each section of the paper and its significance:

### 1. **Introduction**
   - **Importance**: This section sets up the context by explaining the challenge of learning long-term dependencies in RNNs due to vanishing gradients. LSTM addresses this by introducing a structure that ensures constant error flow, solving long time-lag problems without loss of short-term memory capabilities.

### 2. **Previous Work**
   - **Importance**: The authors review existing recurrent network methods like Backpropagation Through Time (BPTT), Real-Time Recurrent Learning (RTRL), and others. All suffer from issues when dealing with long time lags. This sets the stage for LSTM as a solution to these problems.
   
### 3. **Constant Error Backpropagation**
   - **Importance**: This section explains in detail why error gradients in traditional RNNs either explode or vanish when propagated backward in time. This leads to inefficient learning, especially over long time intervals. It highlights the core problem LSTM is designed to solve: preserving the error signal over long time steps, allowing effective learning over long sequences.

### 4. **The Concept of Long Short-Term Memory**
   - **Importance**: Here, LSTM’s architecture is introduced, featuring **memory cells** and **gate units** (input, output, and forget gates). These gates manage the flow of information in and out of the memory cell, ensuring that important information is stored and irrelevant information is discarded. This allows the network to "decide" what to remember and what to forget, crucial for tasks with long-term dependencies.

### 5. **Experiments**
   - **Importance**: LSTM is tested against several long-time-lag problems, such as the Embedded Reber Grammar and adding and multiplication problems. The results show that LSTM outperforms previous methods, successfully solving tasks that other RNN architectures fail at due to their inability to handle long sequences. This section highlights LSTM's practical effectiveness.

### 6. **Discussion**
   - **Importance**: This section outlines the limitations and advantages of LSTM. While LSTM has some restrictions (e.g., non-trivial tasks like XOR or precise counting of steps), it excels in handling tasks with long time dependencies, noisy inputs, and distributed representations. LSTM is robust, works over a wide range of parameters, and offers a computational complexity comparable to other recurrent models but with superior long-time-lag performance.

### 7. **Conclusion**
   - **Importance**: The paper concludes by emphasizing LSTM's ability to overcome the limitations of traditional RNNs in handling long time lags. The architecture's constant error flow, controlled by gate units, is highlighted as the key innovation that allows LSTM to solve complex, long-sequence tasks. The authors also suggest areas for further research, including real-world applications like time-series prediction, speech processing, and music composition.

### **Significance of LSTM**:
The importance of LSTM lies in its architecture, which allows recurrent neural networks to remember information for extended periods. This makes it crucial for tasks such as language modeling, speech recognition, time-series forecasting, and more, where long-term dependencies are essential for performance. The ability to control what information gets stored and forgotten makes LSTM flexible and effective across many domains. This innovation has been foundational in the development of sequence-based models and has influenced modern architectures like GRUs and attention-based models (Transformers).

Here's an example of how to use an LSTM model for a simple language model in Python using TensorFlow and Keras. This example will demonstrate how to generate text based on a sequence of characters using an LSTM network.

In [None]:
#pip install tensorflow numpy

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.utils import to_categorical

# Example text data
text = "We are learning how to build a language model using LSTM networks."

# Character level vocabulary
chars = sorted(list(set(text)))
char_to_index = {c: i for i, c in enumerate(chars)}
index_to_char = {i: c for i, c in enumerate(chars)}

# Convert the text into integer sequence
sequence_length = 10
step = 1
sequences = []
next_chars = []

for i in range(0, len(text) - sequence_length, step):
    sequences.append(text[i: i + sequence_length])
    next_chars.append(text[i + sequence_length])

# Convert sequences and next_chars to integer indices
X = np.zeros((len(sequences), sequence_length), dtype=np.int32)
y = np.zeros((len(sequences), len(chars)), dtype=np.float32)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        X[i, t] = char_to_index[char]
    y[i] = to_categorical(char_to_index[next_chars[i]], num_classes=len(chars))

# Build the LSTM model
model = Sequential()
model.add(Embedding(input_dim=len(chars), output_dim=50, input_length=sequence_length))
model.add(LSTM(128, return_sequences=False))
model.add(Dense(len(chars), activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy')

# Train the model
model.fit(X, y, batch_size=32, epochs=20)

# Function to generate text from the model
def generate_text(model, seed_text, length=100):
    generated_text = seed_text
    for _ in range(length):
        x_pred = np.zeros((1, sequence_length), dtype=np.int32)
        for t, char in enumerate(seed_text):
            x_pred[0, t] = char_to_index[char]

        # Predict the next character
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = np.argmax(preds)
        next_char = index_to_char[next_index]

        # Append next character to the generated text
        generated_text += next_char

        # Shift seed text
        seed_text = seed_text[1:] + next_char

    return generated_text

# Generate text based on a seed
seed_text = "We are lea"
generated_text = generate_text(model, seed_text, length=100)
print("Generated text:\n", generated_text)


2024-10-03 09:26:33.235860: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/20




[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - loss: 3.1780
Epoch 2/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - loss: 3.1680 
Epoch 3/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - loss: 3.1590 
Epoch 4/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - loss: 3.1468 
Epoch 5/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - loss: 3.1282 
Epoch 6/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 3.1121 
Epoch 7/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 3.0831 
Epoch 8/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 3.0447 
Epoch 9/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 2.9671 
Epoch 10/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 2.9385 
Epoch 11/20
[1m2/2[0m [32m━

### Explanation:

1. **Data Preparation**:
    - The input text is processed character by character.
    - Each sequence of `sequence_length` characters is used as input, and the following character is the target (i.e., what the LSTM should predict).
   
2. **Model**:
    - The model uses an Embedding layer to map each character to a 50-dimensional space.
    - The LSTM layer with 128 units processes the sequence.
    - A Dense layer with `softmax` activation outputs the probability distribution over all possible next characters.

3. **Training**:
    - The model is compiled with the Adam optimizer and categorical cross-entropy loss.
    - The model is trained on sequences of 10 characters from the input text.

4. **Text Generation**:
    - After training, the `generate_text` function predicts the next character based on the input seed and generates text character by character.

### Output:
The model should print a 100-character generated text based on the seed "We are lea".

### Example Output:
```
Generated text:
 We are learning how to build a language model using LSTM networksto build a langu
```

This is a simple example to get us started. For larger and more complex language models, you would typically use larger datasets and possibly more complex LSTM architectures. Notice our output didn't quite generate what we expected

It looks like the training process is working, but the generated text isn't producing meaningful results yet. This is likely because the model is still undertrained, and the loss hasn't decreased enough to capture meaningful patterns in the text.

To improve the quality of the generated text, you can try the following:

### 1. **Increase the number of epochs**:
Training for more epochs will help the model learn better. For small datasets, you may need to train for hundreds of epochs to see a significant improvement.

```python
# Try increasing epochs to 100 or more
model.fit(X, y, batch_size=32, epochs=100)
```

### 2. **Adjust model architecture**:
You can increase the complexity of the model by adding more LSTM layers or increasing the number of units.

For example, you could add another LSTM layer:

```python
model = Sequential()
model.add(Embedding(input_dim=len(chars), output_dim=50, input_length=sequence_length))
model.add(LSTM(128, return_sequences=True))  # First LSTM layer with return_sequences=True
model.add(LSTM(128))  # Second LSTM layer
model.add(Dense(len(chars), activation='softmax'))
```

### 3. **Tune the sampling temperature**:
When generating text, you can adjust the sampling strategy by applying "temperature" to the model's predictions. A higher temperature produces more randomness, while a lower temperature makes the predictions more conservative.

```simulated anealing?```

Modify the `generate_text` function like this:

```python
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def generate_text(model, seed_text, length=100, temperature=1.0):
    generated_text = seed_text
    for _ in range(length):
        x_pred = np.zeros((1, sequence_length), dtype=np.int32)
        for t, char in enumerate(seed_text):
            x_pred[0, t] = char_to_index[char]

        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds, temperature)
        next_char = index_to_char[next_index]

        generated_text += next_char
        seed_text = seed_text[1:] + next_char

    return generated_text
```

You can then generate text with different temperatures to see how it affects the output:

```python
print(generate_text(model, seed_text, length=100, temperature=0.5))  # More deterministic
print(generate_text(model, seed_text, length=100, temperature=1.0))  # More creative/random
```

### 4. **Use a larger dataset**:
For a language model to produce coherent text, it usually needs more training data. The small text snippet used here is insufficient for the model to generalize well. Consider using a larger corpus or dataset to improve the model's performance.

By applying these changes, you should see a noticeable improvement in the quality of the generated text!

In [2]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

# Load the dataset
# Assuming `text` contains the raw text data we could load a larget dataset
#text = open("sample_text.txt").read().lower()

# With some sample text for testing:
text = """
Once upon a time, in a land far, far away, there lived a king. The king was wise and loved by all his subjects. 
Every day, he would walk through his kingdom, talking to the people and listening to their concerns. 
The kingdom prospered under his rule, and the people were happy and content.
""".lower()


# Create a character-level mapping
chars = sorted(list(set(text)))
char_to_index = {c: i for i, c in enumerate(chars)}
index_to_char = {i: c for i, c in enumerate(chars)}

# Sequence length for LSTM training
sequence_length = 40
step = 3

# Prepare the input and output data
X = []
y = []

for i in range(0, len(text) - sequence_length, step):
    X.append(text[i:i + sequence_length])
    y.append(text[i + sequence_length])

# Vectorize the input and output
X_new = np.zeros((len(X), sequence_length), dtype=np.int32)
y_new = np.zeros((len(X)), dtype=np.int32)

for i, sequence in enumerate(X):
    for t, char in enumerate(sequence):
        X_new[i, t] = char_to_index[char]
    y_new[i] = char_to_index[y[i]]

y_new = to_categorical(y_new, num_classes=len(chars))

# Build the LSTM model
model = Sequential()
model.add(Embedding(input_dim=len(chars), output_dim=50, input_length=sequence_length))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(128))
model.add(Dense(len(chars), activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.001))

# Train the model
model.fit(X_new, y_new, batch_size=128, epochs=100)

# Sampling function to introduce randomness
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# Function to generate text
def generate_text(model, seed_text, length=100, temperature=1.0):
    generated_text = seed_text
    for _ in range(length):
        x_pred = np.zeros((1, sequence_length), dtype=np.int32)
        for t, char in enumerate(seed_text):
            x_pred[0, t] = char_to_index[char]

        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds, temperature)
        next_char = index_to_char[next_index]

        generated_text += next_char
        seed_text = seed_text[1:] + next_char

    return generated_text

# Example seed text for text generation
seed_text = "we are learning"
generated_text = generate_text(model, seed_text, length=200, temperature=0.5)
print("Generated text with temperature 0.5:")
print(generated_text)

generated_text = generate_text(model, seed_text, length=200, temperature=1.0)
print("\nGenerated text with temperature 1.0:")
print(generated_text)



Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - loss: 3.2955
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 131ms/step - loss: 3.2865
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 133ms/step - loss: 3.2748
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 130ms/step - loss: 3.2559
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 131ms/step - loss: 3.2229
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 126ms/step - loss: 3.1632
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 127ms/step - loss: 3.0590
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 128ms/step - loss: 2.9460
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 125ms/step - loss: 3.0297
Epoch 10/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 139ms/step - loss: 2.9631
Ep

The generated text seems nonsensical, which is typical when training a character-level language model, especially when using small datasets or if the model hasn't been trained long enough. Here are a few ways to potentially improve the output:

1. **Train on a larger dataset**: A more extensive and coherent dataset would allow the model to learn patterns better. Try using more substantial text data, such as a book or articles, to give the model more context to learn from.

2. **Adjust hyperparameters**:
   - **Number of epochs**: While 100 epochs is often enough for small datasets, you might experiment with more epochs depending on the dataset size and complexity.
   - **LSTM layers and neurons**: Try adding more layers or neurons to the LSTM to give the model more capacity to learn patterns.
   
3. **Sampling technique**:
   - The temperature parameter affects randomness. With a temperature of 0.5, the model tends to produce repetitive patterns, while at 1.0, it's more random. You could experiment with intermediate temperatures, like 0.7, for more balanced outputs.

If you're aiming for more coherent text, these modifications can help improve results:

### Recommendations:
1. **Increase Dataset Size**: Use a more significant text corpus, such as an entire book or collection of books. This will help the model understand more language structure.

2. **Increase Model Capacity**: You can increase the number of LSTM layers or the number of neurons in each layer. For example, adding more layers or increasing the neurons from 256 to 512 could improve the model's capacity to learn more complex patterns.

3. **Improve Sampling Strategy**: While generating text, you could try intermediate temperature values like 0.7 or 0.8 to balance randomness and structure.


In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Activation, Embedding
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the dataset
with open('../../Data/pg2600.txt', 'r', encoding='utf-8') as file:
    text = file.read().lower()

# Create a character-level mapping
chars = sorted(list(set(text)))
char_to_index = {char: idx for idx, char in enumerate(chars)}
index_to_char = {idx: char for idx, char in enumerate(chars)}

# Prepare the dataset for training
SEQUENCE_LENGTH = 100  # Sequence length for each input
STEP = 1  # Step size between sequences
sentences = []
next_chars = []

for i in range(0, len(text) - SEQUENCE_LENGTH, STEP):
    sentences.append(text[i:i + SEQUENCE_LENGTH])
    next_chars.append(text[i + SEQUENCE_LENGTH])

# Convert sequences to numeric indices
X = np.zeros((len(sentences), SEQUENCE_LENGTH), dtype=np.int32)
y = np.zeros((len(sentences)), dtype=np.int32)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t] = char_to_index[char]
    y[i] = char_to_index[next_chars[i]]

# One-hot encode the target variable
y = to_categorical(y, num_classes=len(chars))

# Build the LSTM model
model = Sequential()
model.add(Embedding(input_dim=len(chars), output_dim=50, input_length=SEQUENCE_LENGTH))
model.add(LSTM(128, return_sequences=False))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

# Train the model
model.fit(X, y, batch_size=128, epochs=20)

# Function to generate text after training
def generate_text(model, seed_text, length, temperature=1.0):
    generated = seed_text
    for _ in range(length):
        x_pred = np.zeros((1, SEQUENCE_LENGTH), dtype=np.int32)
        for t, char in enumerate(seed_text):
            x_pred[0, t] = char_to_index.get(char, 0)

        predictions = model.predict(x_pred, verbose=0)[0]
        predictions = np.asarray(predictions).astype('float64')
        predictions = np.log(predictions + 1e-8) / temperature
        exp_preds = np.exp(predictions)
        predictions = exp_preds / np.sum(exp_preds)
        probas = np.random.multinomial(1, predictions, 1)
        next_index = np.argmax(probas)
        next_char = index_to_char[next_index]

        generated += next_char
        seed_text = seed_text[1:] + next_char

    return generated

# Generate some text
seed_text = text[:SEQUENCE_LENGTH]  # Use the first sequence from the text as the seed
print("Generated text with temperature 0.5:")
print(generate_text(model, seed_text, 400, temperature=0.5))

print("\nGenerated text with temperature 1.0:")
print(generate_text(model, seed_text, 400, temperature=1.0))


This training evolution started at 10:40 AM, can we estimate when it will continue with this first epoch information?

Epoch 1/20
 1695/25214 ━━━━━━━━━━━━━━━━━━━━ 3:35:42 550ms/step - loss: 2.7043

Since the training has just begun and the time now is 10:59 AM, let's recalculate the estimate more precisely based on that.

### Initial Information:
- **Started training at**: 10:40 AM
- **Current time**: 10:59 AM (19 minutes have passed)
- **Steps completed so far**: 1,695 steps
- **Total steps for the epoch**: 25,214 steps
- **Time per step**: 550 milliseconds (0.55 seconds)

### Time Spent for 1,695 Steps:
In 19 minutes (1,140 seconds), the model completed 1,695 steps.
The average time per step can be calculated as:

$\frac{1,140 \, \text{seconds}}{1,695 \, \text{steps}} \approx 0.672 \, \text{seconds per step}$


### Time Remaining for the First Epoch:
- **Remaining steps**: 25,214 - 1,695 = 23,519 steps
- **Estimated time for remaining steps**:

  $23,519 \, \text{steps} \times 0.672 \, \text{seconds/step} = 15,801 \, \text{seconds} \approx 4.39 \, \text{hours}$

### Total Estimated Time for the First Epoch:
The first epoch will take approximately **4.39 more hours**, meaning it should finish around **3:24 PM** today.


### Revised Summary:
- **Epoch 1 will finish around 3:24 PM today**.
- Based on the time taken so far, each epoch would take about **4.4 hours** to complete. Therefore, training the full 20 epochs would take approximately:

$4.4 \, \text{hours/epoch} \times 20 = 88 \, \text{hours} \approx 3.67 \, \text{days}$



If you want to speed up training, consider adjusting parameters such as batch size or model complexity.

The **Long Short-Term Memory (LSTM)** architecture is a special kind of Recurrent Neural Network (RNN) that solves the vanishing gradient problem, making it effective for learning long-term dependencies in sequences. LSTM introduces a set of gates to control information flow, allowing it to retain relevant data for long periods while discarding unnecessary details.

Here's a breakdown of the LSTM architecture:

### LSTM Cell Architecture
Each LSTM cell consists of three key gates: the **Forget Gate**, the **Input Gate**, and the **Output Gate**. These gates regulate the flow of information in the network.

1. **Forget Gate**: Determines what information from the previous cell state $C_{t-1}$ should be discarded or retained.
   - Formula: 
     $$ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) $$
   - $f_t$ is the forget gate output, $h_{t-1}$ is the previous hidden state, $x_t$ is the input at time step $t$, and $\sigma$ is the sigmoid activation.

2. **Input Gate**: Decides what new information will be stored in the cell state.
   - Formula:
     $$ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) $$
   - This gate generates a candidate update:
     $$ \tilde{C_t} = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) $$

3. **Cell State Update**: The cell state $C_t$ is updated using the forget gate's output and the input gate's result.
   - Formula:
     $$ C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C_t} $$

4. **Output Gate**: Controls what part of the cell state is output as the hidden state for the next time step.
   - Formula:
     $$ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) $$
     - The hidden state is updated:
     $$ h_t = o_t \cdot \tanh(C_t) $$

### LSTM Diagram Breakdown

In summary, an LSTM cell works as follows:
- **Input**: Takes the current input $x_t$ and the previous hidden state $h_{t-1}$.
- **Forget Gate**: Decides what part of the previous memory to forget.
- **Input Gate**: Decides what new information to store in memory.
- **Cell State Update**: Updates the memory (cell state).
- **Output Gate**: Decides what part of the updated cell state to output.

This architecture allows LSTMs to learn and remember long-term dependencies in sequence data, making them particularly useful for time-series data, natural language processing, and other sequential tasks.



The Long Short-Term Memory (LSTM) architecture consists of several components that work together to allow the network to retain important information and discard irrelevant information over long sequences of data. Here's a textual breakdown of what the architecture typically looks like:

1. **Input to LSTM:**
   - Input $x_t$: The current input vector at time step $t$.
   - Previous hidden state $h_{t-1}$: The hidden state from the previous time step.
   - Previous cell state $C_{t-1}$: The cell state from the previous time step.

2. **Forget Gate $f_t$:**
   - The forget gate decides which information to discard from the previous cell state $C_{t-1}$.
   - It takes $x_t$ and $h_{t-1}$ as inputs and uses a sigmoid activation to produce a number between 0 and 1 for each value in the cell state. A value of 0 means "completely forget" and 1 means "completely keep."
   - $$ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) $$

3. **Input Gate $i_t$ and Candidate Cell State $\tilde{C}_t$:**
   - The input gate decides which new information will be added to the current cell state.
   - The candidate cell state $\tilde{C}_t$ is computed from $x_t$ and $h_{t-1}$ using a tanh activation.
   - $$ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) $$
   - $$ \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) $$

4. **Cell State Update $C_t$:**
   - The cell state is updated by combining the old cell state $C_{t-1}$ and the candidate cell state $\tilde{C}_t$.
   - This is done using the forget gate $f_t$ and the input gate $i_t$.
   - $$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t $$

5. **Output Gate $o_t$:**
   - The output gate decides what the next hidden state $h_t$ will be.
   - It applies a sigmoid activation to determine which parts of the cell state should be output.
   - The hidden state is then the cell state $C_t$, filtered through a tanh activation.
   - $$ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) $$
   - $$ h_t = o_t * \tanh(C_t) $$


This architecture allows LSTM networks to maintain long-range dependencies, which is particularly useful in tasks like language modeling, time-series forecasting, and sequence prediction.



- [Sherstinsky, A. (2018). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Retrieved from arXiv:1808.03314.](https://arxiv.org/pdf/1808.03314)