# Dissecting the LSTM Forget Gate: How LSTMs Decide What to Remember

### This Jupyter Notebook focuses on the first critical component of an LSTM cell: the Forget Gate. We will break down its internal workings, the mathematical operations involved, and its role in managing the long-term memory of the network.

## high level understanding - 

![Alt text for the image](images/forget_gate.png)

## Key Learning Objectives:

### 1. **Recall: Overall LSTM Architecture**
* Briefly review the overall LSTM architecture with its three main gates: Forget Gate, Input Gate & Candidate Memory, and Output Gate.
* Reiterate the roles of $X_t$ (current input), $H_{t-1}$ (previous hidden state/short-term memory), and $C_{t-1}$ (previous cell state/long-term memory).

### 2. **Focus: The Forget Gate's Role**
* The **Forget Gate** is responsible for deciding what information from the *previous cell state* ($C_{t-1}$) is no longer relevant and should be "forgotten" (i.e., discarded or reduced in importance).

### 3. **Step-by-Step Breakdown of the Forget Gate Operation**

#### **3.1. Input Preparation (Concatenation)**
* **Inputs**: The forget gate takes two main inputs:
    * The current input vector, $X_t$ (e.g., a word embedding, 4-dimensional in the example).
    * The previous hidden state vector, $H_{t-1}$ (e.g., 3-dimensional in the example, the same dimension as $C_{t-1}$ and the output of the gate).
* **Concatenation**: $H_{t-1}$ and $X_t$ are **concatenated** (combined end-to-end) to form a single input vector.
    * Example: If $H_{t-1}$ is 3-dim and $X_t$ is 4-dim, the concatenated vector will be 7-dim.

#### **3.2. Neural Network Layer (Linear Transformation + Sigmoid Activation)**
* The concatenated input vector ($[H_{t-1}, X_t]$) is fed into a neural network layer.
* This layer involves:
    * **Weight Matrix Multiplication**: The input vector is multiplied by a weight matrix ($W_f$) associated with the forget gate.
        * Example: If input is 1x7 and output of this layer is desired to be 1x3 (matching $C_{t-1}$'s dimension), then $W_f$ will be 7x3.
    * **Bias Addition**: A bias vector ($b_f$) is added to the result.
    * **Sigmoid Activation Function ($\sigma$)**: The result is passed through a sigmoid activation function.
        * The sigmoid function squashes the values between 0 and 1. This output, denoted as $f_t$, is the **forget gate vector**.
        * **Mathematical Representation**: $f_t = \sigma(W_f \cdot [H_{t-1}, X_t] + b_f)$
        * **Significance**: Each element in $f_t$ represents a "forget factor" for the corresponding element in the previous cell state $C_{t-1}$. A value close to 0 means "forget this part completely," while a value close to 1 means "keep this part completely."

#### **3.3. Point-wise Multiplication with Previous Cell State**
* The output of the sigmoid function, $f_t$, is then **point-wise multiplied** with the previous cell state, $C_{t-1}$.
* **Operation**: $f_t \times C_{t-1}$
* **Purpose**: This multiplication selectively scales down or eliminates information from $C_{t-1}$:
    * If an element in $f_t$ is 0, the corresponding element in $C_{t-1}$ becomes 0, effectively "forgetting" that piece of information.
    * If an element in $f_t$ is 1, the corresponding element in $C_{t-1}$ remains unchanged, effectively "remembering" that piece of information.
    * If an element in $f_t$ is between 0 and 1 (e.g., 0.5), the corresponding element in $C_{t-1}$ is partially scaled, reducing its influence.

### 4. **Illustrative Examples of Forget Gate Behavior:**
* **Scenario 1: Complete Forgetting**
    * If $f_t = [0, 0, 0]$ (all zeros), and $C_{t-1} = [6, 8, 9]$, then $f_t \times C_{t-1} = [0, 0, 0]$. All previous context is removed. This happens when the context of the sentence completely changes.
* **Scenario 2: Complete Remembering**
    * If $f_t = [1, 1, 1]$ (all ones), and $C_{t-1} = [6, 8, 9]$, then $f_t \times C_{t-1} = [6, 8, 9]$. No information is removed; all previous context is retained.
* **Scenario 3: Partial Forgetting**
    * If $f_t = [0.5, 1, 0.5]$, and $C_{t-1} = [6, 8, 9]$, then $f_t \times C_{t-1} = [3, 8, 4.5]$. Some parts of the context are reduced in importance, while others are fully retained. This allows for fine-grained control over what information from the past cell state is passed forward.

### Conclusion:
The Forget Gate, through its sigmoid activation and point-wise multiplication, provides the LSTM with a crucial mechanism to selectively forget or retain information from its long-term memory ($C_{t-1}$) based on the current input ($X_t$) and the previous short-term memory ($H_{t-1}$). This is a key reason why LSTMs can handle long-term dependencies effectively, as they can "decide" when old information is no longer relevant.

**Next Video**: The next lecture will discuss the **Input Gate** and **Candidate Memory**, which are responsible for deciding what *new* information gets added to the cell state.