### Introduction
This notebook demonstrates how to build and compare different Long Short-Term Memory (LSTM) network architectures for text classification using Keras. It explores four main approaches to represent a document:
* Using the final hidden state of a standard LSTM.
* Using the concatenated final hidden states of a Bidirectional LSTM (BiLSTM).
* Averaging all the output states from a BiLSTM over the entire sequence.
* Taking the maximum value (max-pooling) of all output states from a BiLSTM over the entire sequence.

A key focus is on correctly handling padded sequences when performing averaging or max-pooling, which is achieved by implementing custom Keras layers with masking support.

#### **Cell 1: Imports**
This cell imports all the necessary libraries and modules. We import Keras for building the neural network, NumPy for numerical operations, and scikit-learn for label encoding. Specific layers like `LSTM`, `Dense`, `Embedding`, and `Bidirectional` are imported from Keras.

In [None]:
# Import the Keras library, the high-level API for TensorFlow
import keras
# Import NumPy for numerical operations, especially for handling arrays
import numpy as np
# Import the preprocessing module from scikit-learn for tasks like label encoding
from sklearn import preprocessing
# Import necessary layers and components from Keras to build the model
from keras.layers import Dense, Input, Embedding, GlobalAveragePooling1D, Lambda, Layer, Multiply, GlobalMaxPooling1D, Conv1D, Concatenate, Dropout, LSTM, Bidirectional
# Import the Model class to create a neural network model, and Sequential for linear stacks of layers
from keras.models import Model, Sequential
# Import the Keras backend (K) to access low-level functions, useful for custom layers
from keras import backend as K
# Import TensorFlow, which Keras uses as its backend engine
import tensorflow as tf
# Import callbacks for saving the model, stopping training early, and custom actions during training
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback

#### **Cell 2: `load_embeddings` Function**
This function reads pre-trained word embeddings (like GloVe) from a file. It builds a vocabulary dictionary mapping words to integer IDs and an embedding matrix. Special tokens for padding (`_0_`) and unknown words (`_UNK_`) are added at the beginning.

In [None]:
# Define a function to load pre-trained word embeddings from a file
def load_embeddings(filename, max_vocab_size):

    # Initialize a dictionary to store the vocabulary (word -> integer id)
    vocab={}
    # Initialize a list to store the embedding vectors
    embeddings=[]
    # Open and read the specified file
    with open(filename) as file:
        
        # Read the first line to get the number of words and embedding dimension
        cols=file.readline().split(" ")
        num_words=int(cols[0])
        size=int(cols[1])
        # Add a zero vector for the padding token (ID 0)
        embeddings.append(np.zeros(size))
        # Add a zero vector for the Unknown Word token (ID 1)
        embeddings.append(np.zeros(size))
        # Add the padding token to the vocabulary with ID 0
        vocab["_0_"]=0
        # Add the UNK token to the vocabulary with ID 1
        vocab["_UNK_"]=1
        
        # Iterate through each line of the embeddings file
        for idx,line in enumerate(file):

            # Stop reading if the maximum vocabulary size is reached
            if idx+2 >= max_vocab_size:
                break

            # Split the line into the word and its vector components
            cols=line.rstrip().split(" ")
            # Convert the vector components to a NumPy array of floats
            val=np.array(cols[1:])
            # Get the word
            word=cols[0]
            
            # Add the embedding vector to our list of embeddings
            embeddings.append(val)
            # Add the word and its corresponding index (ID) to the vocabulary
            vocab[word]=idx+2

    # Convert the list of embeddings to a NumPy array and return it along with the vocabulary
    return np.array(embeddings), vocab

#### **Cell 3: `read_data` Function**
This function reads a tab-separated value (TSV) file containing text data and corresponding labels. It splits each line into a label and a pre-tokenized text, then appends them to separate lists.

In [None]:
# Define a function to read text data and labels from a file
def read_data(filename, vocab):
    # Initialize a list to store the text sequences (documents)
    X=[]
    # Initialize a list to store the corresponding labels
    Y=[]
    # Open the file with UTF-8 encoding
    with open(filename, encoding="utf-8") as file:
        # Iterate over each line in the file
        for line in file:
            # Strip whitespace and split the line by the tab character
            cols=line.rstrip().split("\t")
            # The first column is the label
            label=cols[0]
            # The second column is the text, which is assumed to be already tokenized (space-separated)
            text=cols[1].split(" ")
            # Append the list of tokens to the main text list
            X.append(text)
            # Append the label to the main labels list
            Y.append(label)
    # Return the lists of texts and labels
    return X, Y

#### **Cell 4: `get_word_ids` Function**
This function converts a list of documents (each a list of tokens) into a matrix of integer IDs. It looks up each token in the vocabulary. If a token is not found, it's assigned the ID for "Unknown" (UNK). Each document is then padded with zeros to ensure all sequences have the same length (`max_length`).

In [None]:
# Define a function to convert documents (lists of tokens) into sequences of word IDs
def get_word_ids(docs, vocab, max_length=200):
    
    # Initialize a list to hold the ID sequences for all documents
    doc_ids=[]
    
    # Iterate through each document in the input list
    for doc in docs:
        # Initialize a list to hold the word IDs for the current document
        wids=[]

        # Iterate through each token in the document, up to the max_length
        for token in doc[:max_length]:
            # Look up the token in the vocabulary (converted to lowercase). If not found, use the ID for UNK (1).
            val = vocab[token.lower()] if token.lower() in vocab else 1
            # Append the ID to the current document's list of IDs
            wids.append(val)
        
        # Pad the sequence with zeros (ID 0) up to the max_length
        for i in range(len(wids),max_length):
            wids.append(0)

        # Add the padded sequence of IDs to the main list
        doc_ids.append(wids)

    # Convert the list of ID sequences to a NumPy array and return it
    return np.array(doc_ids)

#### **Cell 5: Loading Embeddings**
This cell calls the `load_embeddings` function to load the first 100,000 GloVe word embeddings from a file. The function returns the embedding matrix and the vocabulary dictionary.

In [None]:
# Load the pre-trained GloVe embeddings and vocabulary from the specified file
# We limit the vocabulary size to the first 100,000 words for efficiency
embeddings, vocab=load_embeddings("../data/glove.42B.300d.50K.w2v.txt", 100000)

#### **Cell 6: Data Directory Path**
This cell specifies the directory where the training, development (validation), and test datasets are located.

In [None]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
# Set the path to the directory containing the text classification data
directory="../data/text_classification_sample_data"

#### **Cell 7: Reading Training and Development Data**
Here, the `read_data` function is used to load the text and labels for both the training and development sets from their respective `.tsv` files.

In [None]:
# Read the training data (text and labels) from 'train.tsv'
trainText, trainY=read_data("%s/train.tsv" % directory, vocab)
# Read the development (validation) data from 'dev.tsv'
devText, devY=read_data("%s/dev.tsv" % directory, vocab)

#### **Cell 8: Converting Text to Word IDs**
The `get_word_ids` function is called to convert the raw text of the training and development sets into padded sequences of integer IDs, which can be fed into the neural network.

In [None]:
# Convert the training text data into padded sequences of word IDs
trainX = get_word_ids(trainText, vocab, max_length=200)
# Convert the development text data into padded sequences of word IDs
devX = get_word_ids(devText, vocab, max_length=200)

#### **Cell 9: Encoding Labels**
The string labels (e.g., "positive", "negative") are converted into numerical format (0s and 1s) using scikit-learn's `LabelEncoder`. This is necessary for training a classification model.

In [None]:
# Initialize a LabelEncoder to convert string labels to integers
le = preprocessing.LabelEncoder()
# Fit the encoder on the training labels to learn the label-to-integer mapping
le.fit(trainY)
# Transform the training labels into their integer representations and convert to a NumPy array
Y_train=np.array(le.transform(trainY))
# Transform the development labels into their integer representations and convert to a NumPy array
Y_dev=np.array(le.transform(devY))

#### **Cell 10: `train` Function**
This is a helper function that standardizes the training process. It takes a compiled Keras model, prints its summary, and then trains it on the training data (`trainX`, `Y_train`) for 30 epochs, using the development data (`devX`, `Y_dev`) for validation.

In [None]:
# Define a helper function to train a given Keras model
def train(model):
    # Print a summary of the model's architecture (layers, parameters, etc.)
    print (model.summary())
    # Train the model using the .fit() method
    model.fit(trainX, Y_train, 
                # Provide the development set for validation after each epoch
                validation_data=(devX, Y_dev),
                # Set the number of training epochs to 30
                epochs=30, 
                # Set the batch size to 32
                batch_size=32)

### Model 1: Simple LSTM
First, we'll train a simple LSTM. In this model, the entire document is represented by the final hidden state vector output by the LSTM.

#### **Cell 11: `get_simple_lstm` Function**
This function defines and compiles a simple LSTM model.
1.  **Input Layer**: Takes sequences of word IDs.
2.  **Embedding Layer**: Converts word IDs into dense vectors using the pre-trained embeddings. `mask_zero=True` tells the model to ignore padded zeros in subsequent layers.
3.  **LSTM Layer**: Processes the sequence of embeddings. `return_sequences=False` means it only outputs the final hidden state.
4.  **Dense Layer**: A fully connected output layer with a sigmoid activation function for binary classification.
5.  **Compilation**: The model is compiled with binary cross-entropy loss and the Adam optimizer.

In [None]:
# Define a function to create a simple LSTM model
def get_simple_lstm(embeddings, lstm_size=25, dropout_rate=0.2):

    # Get the vocabulary size and embedding dimension from the shape of the embeddings matrix
    vocab_size, word_embedding_dim=embeddings.shape
    
    # Define the input layer, which expects sequences of integers of variable length
    word_sequence_input = Input(shape=(None,), dtype='int32')
    
    # Define the embedding layer
    word_embedding_layer = Embedding(vocab_size,          # The size of the vocabulary
                                    word_embedding_dim,  # The dimension of the embeddings
                                    weights=[embeddings],  # Initialize with pre-trained embeddings
                                    mask_zero=True,      # Enable masking for padded inputs (value 0)
                                    trainable=False)     # Freeze the embedding weights during training

    
    # Pass the input sequence through the embedding layer
    embedded_sequences = word_embedding_layer(word_sequence_input)
    
    # Add an LSTM layer. It will only return the final hidden state because return_sequences=False.
    lstm = LSTM(lstm_size, return_sequences=False, activation='tanh', dropout=dropout_rate)(embedded_sequences)
  
    # Add the final dense output layer with a sigmoid activation for binary classification
    predictions=Dense(1, activation="sigmoid")(lstm)

    # Create the Keras Model by specifying its inputs and outputs
    model = Model(inputs=word_sequence_input, outputs=predictions)

    # Compile the model with binary cross-entropy loss, the Adam optimizer, and accuracy metric
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
    
    # Return the compiled model
    return model

#### **Cell 12: Train the Simple LSTM**
This cell calls the `train` function to build and train the simple LSTM model defined above.

In [None]:
# Create and train the simple LSTM model using the helper function
train(get_simple_lstm(embeddings, lstm_size=25, dropout_rate=0.2))

### Model 2: Bidirectional LSTM (BiLSTM)
Next, we'll use a Bidirectional LSTM. A BiLSTM consists of two LSTMs: one processing the sequence from start to end (forward), and another from end to start (backward). The document is represented by concatenating the final hidden states of both LSTMs. This captures context from both directions.

#### **Cell 13: `get_simple_bilstm` Function**
This function defines a BiLSTM model. It's similar to the simple LSTM, but the `LSTM` layer is wrapped in a `Bidirectional` layer. The `merge_mode='concat'` argument specifies that the final forward and backward hidden states should be concatenated to form the final representation.

In [None]:
# Define a function to create a simple Bidirectional LSTM (BiLSTM) model
def get_simple_bilstm(embeddings, lstm_size=25, dropout_rate=0.2):

    # Get the vocabulary size and embedding dimension from the embeddings matrix
    vocab_size, word_embedding_dim=embeddings.shape

    # Define the input layer for sequences of word IDs
    word_sequence_input = Input(shape=(None,), dtype='int32')
    
    # Define the embedding layer, initialized with pre-trained weights and masking enabled
    word_embedding_layer = Embedding(vocab_size,
                                    word_embedding_dim,
                                    weights=[embeddings],
                                    mask_zero=True,
                                    trainable=False)

    
    # Pass the input through the embedding layer
    embedded_sequences = word_embedding_layer(word_sequence_input)
    
    # Add a Bidirectional LSTM layer. It wraps a standard LSTM.
    # merge_mode='concat' concatenates the final forward and backward hidden states.
    # return_sequences=False ensures only the final states are output.
    bi_lstm = Bidirectional(LSTM(lstm_size, return_sequences=False, activation='tanh', dropout=dropout_rate), merge_mode='concat')(embedded_sequences)
  
    # Add the final dense output layer for classification
    predictions=Dense(1, activation="sigmoid")(bi_lstm)

    # Create the Keras Model
    model = Model(inputs=word_sequence_input, outputs=predictions)

    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
    
    # Return the compiled model
    return model

#### **Cell 14: Train the Simple BiLSTM**
This cell builds and trains the simple BiLSTM model.

In [None]:
# Create and train the simple BiLSTM model
train(get_simple_bilstm(embeddings, lstm_size=25, dropout_rate=0.2))

### Advanced Pooling Strategies with Masking
The final hidden state can sometimes be a bottleneck, losing information from the beginning of a long sequence. A better approach is often to aggregate information from the *entire* sequence. This can be done by using the output of the LSTM at every time step (`return_sequences=True`).

However, since our sequences are padded with zeros, a simple average or max-pooling would be skewed by these pads. We need to create custom Keras layers that support masking to ensure the pooling operations are only performed over the actual, non-padded parts of the sequence.

#### **Cell 15: `MaskedAveragePooling1D` Custom Layer**
This custom Keras layer performs average pooling over the time-step dimension but correctly handles masking. It multiplies the inputs by the mask to zero out the padded steps, then calculates the sum and divides by the true number of non-masked steps.

In [None]:
# Define a custom Keras Layer for average pooling that supports masking
class MaskedAveragePooling1D(Layer):
    # The initialization method for the layer
    def __init__(self, **kwargs):
        # Indicate that this layer supports masking
        self.supports_masking = True
        # Call the parent class's constructor
        super(MaskedAveragePooling1D, self).__init__(**kwargs)

    # Define how the mask is computed for the output of this layer (it doesn't pass on a mask)
    def compute_mask(self, input, input_mask=None):
        return None

    # This is the main logic of the layer
    def call(self, x, mask=None):
        # Check if a mask was provided by the previous layer
        if mask is not None:
            # Cast the boolean mask to float type (e.g., True -> 1.0, False -> 0.0)
            mask = K.cast(mask, K.floatx())
            # Repeat the mask to match the shape of the input tensor 'x' along the last dimension
            mask = K.repeat(mask, x.shape[-1])
            # Transpose the mask to align its dimensions for element-wise multiplication
            mask = tf.transpose(mask, [0,2,1])
            # Multiply the input 'x' by the mask to zero out the padded (masked) elements
            x = x * mask
            
        # Sum the elements of 'x' along the time step axis (axis=1)
        # and divide by the sum of the mask to get the true average over non-padded steps
        return K.sum(x, axis=1) / K.sum(mask, axis=1)

    # Define the shape of the layer's output
    def compute_output_shape(self, input_shape):
        # The output shape is (batch_size, features), removing the time step dimension
        return (input_shape[0], input_shape[2])

#### **Cell 16: `MaskedMaxPooling1D` Custom Layer**
This custom layer performs max pooling while respecting the mask. Instead of zeroing out padded values (which could still be chosen if all other values are negative), it subtracts a very large number from the masked positions. This ensures they will never be selected as the maximum value.

In [None]:
# Define a custom Keras Layer for max pooling that supports masking
class MaskedMaxPooling1D(Layer):
    # The initialization method for the layer
    def __init__(self, **kwargs):
        # Indicate that this layer supports masking
        self.supports_masking = True
        # Call the parent class's constructor
        super(MaskedMaxPooling1D, self).__init__(**kwargs)

    # Define how the mask is computed for the output of this layer (it doesn't pass on a mask)
    def compute_mask(self, input, input_mask=None):
        return None

    # This is the main logic of the layer
    def call(self, x, mask=None):
        # Check if a mask was provided by the previous layer
        if mask is not None:
            # Invert the mask (e.g., True -> False, False -> True) because we want to penalize masked steps
            mask=tf.logical_not(mask)
            # Cast the boolean mask to float type (e.g., True -> 1.0, False -> 0.0)
            mask = K.cast(mask, K.floatx())
            # Repeat the mask to match the shape of the input tensor 'x'
            mask = K.repeat(mask, x.shape[-1])    
            # Transpose the mask to align its dimensions for element-wise operation
            mask = tf.transpose(mask, [0,2,1])
            
            # Multiply the mask by a large number (this will be subtracted)
            mask *= 10000
            # Subtract the large number from the masked positions in 'x'
            x = x - mask
        
        # Compute the maximum value along the time step axis (axis=1)
        return K.max(x, axis=1)
    
    # Define the shape of the layer's output
    def compute_output_shape(self, input_shape):
        # The output shape is (batch_size, features), removing the time step dimension
        return (input_shape[0], input_shape[2])

### Model 3: BiLSTM with Masked Average Pooling
Now we'll build a model that uses the custom `MaskedAveragePooling1D` layer. The BiLSTM is configured with `return_sequences=True` to output the hidden state at every time step. These outputs are then averaged by our custom layer to create a single vector representation for the document.

#### **Cell 17: `get_bilstm_with_average_pooling` Function**
This function defines the BiLSTM model followed by the custom average pooling layer.

In [None]:
# Define a function to create a BiLSTM model with masked average pooling
def get_bilstm_with_average_pooling(embeddings, lstm_size=25, dropout_rate=0.2):

    # Get vocab size and embedding dimension
    vocab_size, word_embedding_dim=embeddings.shape

    # Define the input layer
    word_sequence_input = Input(shape=(None,), dtype='int32')
    
    # Define the embedding layer with masking enabled
    word_embedding_layer = Embedding(vocab_size,
                                    word_embedding_dim,
                                    weights=[embeddings], 
                                    mask_zero=True,
                                    trainable=False)

    
    # Pass input through the embedding layer
    embedded_sequences = word_embedding_layer(word_sequence_input)
    
    # Add a BiLSTM layer. 'return_sequences=True' makes it output the hidden state for each time step.
    x = Bidirectional(LSTM(lstm_size, return_sequences=True, activation='tanh', dropout=dropout_rate), merge_mode='concat')(embedded_sequences)
    # Apply our custom masked average pooling layer to aggregate the sequences into a single vector
    x=MaskedAveragePooling1D()(x)

    # Add the final dense output layer
    x=Dense(1, activation="sigmoid")(x)

    # Create the Keras Model
    model = Model(inputs=word_sequence_input, outputs=x)

    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
    
    # Return the compiled model
    return model

#### **Cell 18: Train the BiLSTM with Average Pooling**
This cell builds and trains the BiLSTM model with masked average pooling.

In [None]:
# Create and train the BiLSTM with masked average pooling
train(get_bilstm_with_average_pooling(embeddings, lstm_size=25, dropout_rate=0.2))

### Model 4: BiLSTM with Masked Max Pooling
This final model is similar to the previous one, but instead of averaging the BiLSTM outputs, it uses the custom `MaskedMaxPooling1D` layer to take the maximum value across the time steps. Max pooling is effective at capturing the most important feature or signal in the sequence.

#### **Cell 19: `get_bilstm_with_max_pooling` Function**
This function defines the BiLSTM model followed by the custom max pooling layer.

In [None]:
# Define a function to create a BiLSTM model with masked max pooling
def get_bilstm_with_max_pooling(embeddings, lstm_size=25, dropout_rate=0.2):

    # Get vocab size and embedding dimension
    vocab_size, word_embedding_dim=embeddings.shape

    # Define the input layer
    word_sequence_input = Input(shape=(None,), dtype='int32')
    
    # Define the embedding layer with masking enabled
    word_embedding_layer = Embedding(vocab_size,
                                    word_embedding_dim,
                                    weights=[embeddings], 
                                     mask_zero=True,
                                    trainable=False)

    
    # Pass input through the embedding layer
    embedded_sequences = word_embedding_layer(word_sequence_input)
    
    # Add a BiLSTM layer that outputs the hidden state for each time step
    x = Bidirectional(LSTM(lstm_size, return_sequences=True, activation='tanh', dropout=dropout_rate), merge_mode='concat')(embedded_sequences)
    # Apply our custom masked max pooling layer to aggregate the sequences
    x=MaskedMaxPooling1D()(x)

    # Add the final dense output layer
    x=Dense(1, activation="sigmoid")(x)

    # Create the Keras Model
    model = Model(inputs=word_sequence_input, outputs=x)

    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
    
    # Return the compiled model
    return model

#### **Cell 20: Train the BiLSTM with Max Pooling**
This final cell builds and trains the BiLSTM model with masked max pooling.

In [None]:
# Create and train the BiLSTM with masked max pooling
train(get_bilstm_with_max_pooling(embeddings, lstm_size=25, dropout_rate=0.2))