# LSTM Working Example: Understanding Information Flow in a Text Sentiment Task

### This Jupyter Notebook illustrates the practical working of an LSTM RNN using a sentiment analysis example. We'll trace how information flows through the Forget, Input, and Output gates to manage long-term and short-term memory, enabling the model to understand complex contextual dependencies in text.

## high level understanding - 

![Alt text for the image](images/lstm_training.png)

## 1. **Recap: LSTM Architecture & Gates**
* **Forget Gate ($f_t$)**: Decides what to discard from the long-term memory ($C_{t-1}$).
* **Input Gate ($i_t$) & Candidate Memory ($\tilde{C}_t$)**: Decide what new information to add to the long-term memory.
* **Output Gate ($o_t$)**: Decides what part of the long-term memory ($C_t$) to expose as the short-term memory ($H_t$).
* **Memory Cells**:
    * **Long-Term Memory ($C_t$)**: The cell state, carries information over long sequences.
    * **Short-Term Memory ($H_t$)**: The hidden state, reflects the immediate context and is used for output/prediction.

## 2. **Example Scenario: Restaurant Review Sentiment**
* **Task**: Predict whether the food in a restaurant (specifically a burger) is "good" (output = 1) or "bad" (output = 0) based on a review paragraph.
* **Review Text**: "I went to the restaurant and ordered burger. The burger looked tasty and crispy. But burger is not good for health. It has a lot of fats cholesterol. But this burger was made with whey protein. and only vegetables were used, so it was good."
* **Desired Output**: 1 (meaning the burger was good despite initial negative health comments, due to later positive context).

## 3. **Step-by-Step Information Flow in LSTM (Simplified View)**

The lecture simplifies the LSTM diagram into functional blocks for clarity, emphasizing the flow between the `Forget Gate`, `Input & Candidate Memory`, and `Output Gate`.

### **3.1. Word Vectorization (Embedding Layer)**
* **First Step**: Every word in the input text is converted into a numerical vector. This is typically done using an **embedding layer** (e.g., Word2Vec, GloVe, or learned embeddings).
* **Word2Vec Analogy**: The lecture uses a simplified Word2Vec analogy where words are embedded into a 3-dimensional space representing "good," "bad," and "healthy."
    * Example: For the word "tasty", its vector might be [0.9 (good), 0.0 (bad), 0.1 (healthy)].
* **Sequential Input**: These word vectors ($X_t$) are fed into the LSTM one word at a time (per time step $t$).

### **3.2. Processing the Review Through LSTM Gates**

Let's trace the flow of information for key phrases/sentences:

* **Initial Sentences**: "I went to the restaurant and ordered burger. The burger looked tasty and crispy."
    * **Input ($X_t$)**: Words like "tasty", "crispy" get converted into vectors (e.g., [0.9, 0.0, 0.1] for "tasty").
    * **Forget Gate**: Initially, the forget gate might output values close to `[1, 1, 1]` (or similar high values), meaning it wants to **keep** most of the previous context in the cell state ($C_{t-1}$) as the story unfolds.
    * **Input & Candidate Memory**: This gate activates to **add** the new positive information (about "tasty" and "crispy") to the long-term memory ($C_t$). The vectors for "good" and "healthy" dimensions in the memory cell would start increasing.
    * **Output Gate**: This gate selects relevant parts of $C_t$ to form $H_t$, which reflects the current (positive) sentiment.

* **Context Shift**: "But burger is not good for health. It has a lot of fats cholesterol."
    * **Input ($X_t$)**: Words like "not good", "health", "fats", "cholesterol" enter the LSTM. Their vectors would emphasize "bad" and potentially "unhealthy" features.
    * **Forget Gate**: As the context shifts towards negative health aspects, the forget gate might start outputting lower values (closer to 0) for the "good" and "tasty" dimensions in $C_{t-1}$, indicating that this positive context from earlier is now less relevant *for the current health context*.
    * **Input & Candidate Memory**: This gate now focuses on **adding** the new negative information related to health. The "bad" dimension in the candidate memory ($\tilde{C}_t$) would have high values, and the input gate would allow this to be added to $C_t$.
    * **Updated $C_t$**: The cell state ($C_t$) now contains a mix, but the "bad" and "unhealthy" aspects are emphasized.

* **Final Context Shift**: "But this burger was made with whey protein. and only vegetables were used, so it was good."
    * **Input ($X_t$)**: Words like "whey protein", "vegetables", "good" enter. Their vectors emphasize "good" and "healthy".
    * **Forget Gate**: As the sentiment shifts back to positive (the burger is good *because* of ingredients), the forget gate might once again output higher values for the "good" and "healthy" dimensions, potentially forgetting some of the previous "bad" health context that is now overridden.
    * **Input & Candidate Memory**: This gate works to **add** this new, overriding positive information to the cell state ($C_t$). The "good" and "healthy" dimensions in the candidate memory ($\tilde{C}_t$) would be high, influencing $C_t$.
    * **Output Gate**: Finally, the output gate reads from the *final* $C_t$ (which now has emphasized "good" and "healthy" due to the last sentences) and generates an $H_t$ that strongly reflects a positive sentiment, leading to the correct prediction (output = 1).

## 4. **Key Takeaways from the Example**

* **Dynamic Memory Management**: The example demonstrates how the LSTM's gates (Forget and Input/Candidate) continuously adjust the long-term memory ($C_t$) based on incoming words and their context.
* **Contextual Understanding**: The ability to selectively forget and add information allows the LSTM to understand complex narratives where initial impressions are later contradicted or refined. This is crucial for handling long-term dependencies.
* **Forward and Backward Propagation**: This entire process of feeding inputs, updating states, making predictions, and then adjusting weights (via backpropagation through time) ensures the network learns to correctly manage its memory and make accurate predictions. The weights ($W_f, W_i, W_c, W_o$ and biases) are continuously updated to optimize this behavior.

This detailed example highlights the power of LSTM in dealing with sequence data where the relevant information for a decision might appear much earlier or later in the input.

**Next Video**: Introduction to Gated Recurrent Units (GRUs), a simplified but effective alternative to LSTMs.