This notebook explores the use of attention for text classification, comparing a model that represents a document by averaging its word embeddings to one that uses an attention mechanism to compute a weighted average over those embeddings.

### Cell 1: Imports
This cell imports all the necessary libraries. We'll use Keras for building the neural network, NumPy for numerical operations, and Scikit-learn for label encoding.

In [None]:
# Import the main Keras library for building neural networks
import keras
# Import NumPy for numerical operations, especially with arrays
import numpy as np
# Import preprocessing tools from scikit-learn, specifically for encoding labels
from sklearn import preprocessing
# Import specific layers and components from Keras to build the model
from keras.layers import Dense, Input, Embedding, Lambda, Layer, Multiply, Dropout, Dot
# Import the Model class to create a trainable model object
from keras.models import Model
# Import the Keras backend (we use it for custom layer operations)
from keras import backend as K
# Import TensorFlow, which Keras uses as its backend engine
import tensorflow as tf
# Import callbacks for saving the best model and stopping training early
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
# Import pandas for data manipulation, used here for visualizing attention
import pandas as pd

### Cell 2: Load Embeddings Function
This function, `load_embeddings`, reads a file containing pre-trained word embeddings (like GloVe or Word2Vec). It builds a vocabulary mapping words to integer IDs and an embedding matrix where the row index corresponds to a word's ID. It also adds special tokens for padding (`_0_`) and unknown words (`_UNK_`).

In [None]:
# Define a function to load pre-trained word embeddings from a file
def load_embeddings(filename, max_vocab_size):

    # Create an empty dictionary to store our vocabulary (word -> integer ID)
    vocab={}
    # Create an empty list to store the embedding vectors
    embeddings=[]
    # Open the specified file to read the embeddings
    with open(filename) as file:
        
        # Read the first line, which often contains the number of words and the embedding dimension
        cols=file.readline().split(" ")
        # Extract the total number of words in the file
        num_words=int(cols[0])
        # Extract the size (dimension) of each embedding vector
        size=int(cols[1])
        
        # Add a zero vector for the padding token (ID 0)
        embeddings.append(np.zeros(size))
        # Add a zero vector for the "Unknown" (UNK) token (ID 1)
        embeddings.append(np.zeros(size))
        # Add the padding token to our vocabulary
        vocab["_0_"]=0
        # Add the UNK token to our vocabulary
        vocab["_UNK_"]=1
        
        # Loop through each line in the embedding file
        for idx,line in enumerate(file):

            # Stop if we have reached the desired maximum vocabulary size
            if idx+2 >= max_vocab_size:
                break

            # Split the line into the word and its vector components
            cols=line.rstrip().split(" ")
            # Convert the vector components to a NumPy array of floats
            val=np.array(cols[1:])
            # The first column is the word itself
            word=cols[0]
            
            # Add the word's vector to our embeddings list
            embeddings.append(val)
            # Add the word to our vocabulary, mapping it to its new ID (index + 2)
            vocab[word]=idx+2

    # Convert the list of embeddings to a NumPy array and return it along with the vocabulary and embedding size
    return np.array(embeddings), vocab, size

### Cell 3: Read Data Function
The `read_data` function reads a tab-separated value (TSV) file where each line contains a label and a text document. It separates them and returns two lists: one for the texts (`X`) and one for the labels (`Y`).

In [None]:
# Define a function to read text data from a file
def read_data(filename, vocab):
    # Initialize an empty list to store the text documents (features)
    X=[]
    # Initialize an empty list to store the labels
    Y=[]
    # Open the file, specifying utf-8 encoding for broad character support
    with open(filename, encoding="utf-8") as file:
        # Iterate over each line in the file
        for line in file:
            # Remove trailing whitespace and split the line by the tab character
            cols=line.rstrip().split("\t")
            # The first column is the label
            label=cols[0]
            # The second column is the text, which is assumed to be already tokenized (words separated by spaces)
            text=cols[1].split(" ")
            # Append the list of tokens to the features list X
            X.append(text)
            # Append the label to the labels list Y
            Y.append(label)
    # Return the lists of documents and labels
    return X, Y

### Cell 4: Convert Words to IDs Function
This function, `get_word_ids`, takes a list of documents (each a list of tokens) and converts every token into its corresponding integer ID from the vocabulary. It also ensures all documents have the same length by truncating long ones and padding shorter ones with zeros. This fixed length is required for creating batches for the neural network.

In [None]:
# Define a function to convert documents (lists of tokens) into sequences of integer IDs
def get_word_ids(docs, vocab, max_length=200):
    
    # Initialize an empty list to hold the ID sequences for all documents
    doc_ids=[]
    
    # Iterate through each document in the input list
    for doc in docs:
        # Initialize an empty list for the current document's word IDs
        wids=[]
        # Iterate through the first `max_length` tokens of the document
        for token in doc[:max_length]:
            # Look up the token in the vocabulary (converted to lowercase). If not found, use the ID for UNK (1).
            val = vocab[token.lower()] if token.lower() in vocab else 1
            # Append the integer ID to the current document's list
            wids.append(val)
        
        # Pad the sequence with zeros until it reaches `max_length`
        for i in range(len(wids),max_length):
            wids.append(0)

        # Add the final padded sequence of IDs to our list of all documents
        doc_ids.append(wids)

    # Convert the list of lists into a NumPy array and return it
    return np.array(doc_ids)

If you haven't downloaded the glove vectors, do so first -- the top 50K words in the "Common Crawl (42B)"  vectors (300-dimensional) can be found here: [glove.42B.300d.50K.txt](https://drive.google.com/file/d/1n1jt0UIdI3CD26cY1EIeks39XH5S8O8M/view?usp=sharing); download it and place  in your `data` directory.

### Cell 5: Data Preparation
This section prepares the data for the model. First, it converts the pre-trained GloVe embeddings from their original text format to the word2vec format, which the `load_embeddings` function is designed to read. Then, it loads these embeddings into memory.

In [None]:
# Import the glove2word2vec conversion script from the gensim library
from gensim.scripts.glove2word2vec import glove2word2vec

# Specify the path to the original GloVe embeddings file
glove_file="../data/glove.42B.300d.50K.txt"
# Specify the path for the output file in word2vec format
glove_in_w2v_format="../data/glove.42B.300d.50K.w2v.txt"
# Run the conversion utility. The underscore `_` is used to discard the return value.
_ = glove2word2vec(glove_file, glove_in_w2v_format)

### Cell 6: Loading Embeddings
Here, we call the `load_embeddings` function to get our embedding matrix, vocabulary dictionary, and the dimension of the embeddings. We limit the vocabulary to the top 50,000 words from the file.

In [None]:
# Call the function to load embeddings, limiting the vocabulary to 50,000 words
embeddings, vocab, embedding_size=load_embeddings("../data/glove.42B.300d.50K.w2v.txt", 50000)

### Cell 7: Setting Data Directory
This cell defines the directory where the training, development (validation), and test datasets are located.

In [None]:
# Set this to the directory containing your train.tsv, dev.tsv, and test.tsv files
directory="../data/lmrd"

### Cell 8: Reading Datasets
We use the `read_data` function to load the raw text and labels for the training and development sets.

In [None]:
# Read the training data from the specified directory
trainText, trainY=read_data("%s/train.tsv" % directory, vocab)
# Read the development (validation) data
devText, devY=read_data("%s/dev.tsv" % directory, vocab)

### Cell 9: Numerical Conversion
The text documents are converted into padded sequences of integer IDs using the `get_word_ids` function. This prepares the data to be fed into the model's embedding layer.

In [None]:
# Convert the training text documents into padded sequences of word IDs
trainX = get_word_ids(trainText, vocab, max_length=200)
# Convert the development text documents into padded sequences of word IDs
devX = get_word_ids(devText, vocab, max_length=200)

### Cell 10: Label Encoding
The string labels (e.g., "positive", "negative") are converted into integers (0 and 1). A neural network requires numerical inputs and outputs.

In [None]:
# Create an instance of the LabelEncoder
le = preprocessing.LabelEncoder()
# Fit the encoder on the training labels to learn the mapping (e.g., "pos" -> 1, "neg" -> 0)
le.fit(trainY)
# Transform the training labels into integers and convert to a NumPy array
Y_train=np.array(le.transform(trainY))
# Transform the development labels into integers and convert to a NumPy array
Y_dev=np.array(le.transform(devY))

First, let's try a simple model that represents a document by averaging the embeddings for the words it contains.  We'll again use appropriate masking to accommodate zero-padded sequences.

### Cell 12: Custom Pooling Layer
This custom layer, `MaskedAveragePooling1D`, performs average pooling but is aware of the mask generated by the `Embedding` layer (when `mask_zero=True`). It ensures that the padded time steps (words) are not included in the average calculation, preventing the padding from skewing the document representation.

In [None]:
# Define a custom Keras Layer for average pooling that respects masking
class MaskedAveragePooling1D(Layer):
    # The __init__ method is the constructor for the layer
    def __init__(self, **kwargs):
        # Set a flag to indicate that this layer supports masking
        self.supports_masking = True
        # Call the parent class's constructor
        super(MaskedAveragePooling1D, self).__init__(**kwargs)

    # This method computes the mask for the output of this layer. We return None as this layer condenses the sequence, so the mask is no longer needed.
    def compute_mask(self, input, input_mask=None):
        return None

    # The `call` method contains the layer's logic. It takes the input tensor `x` and its mask.
    def call(self, x, mask=None):
        # Check if a mask was provided by the previous layer
        if mask is not None:
            # Cast the boolean mask to a float tensor (e.g., True -> 1.0, False -> 0.0)
            mask = K.cast(mask, K.floatx())
            # The mask has shape (batch_size, timesteps). We need to expand it to match the input tensor's shape (batch_size, timesteps, embedding_dim).
            # `K.repeat` adds a new dimension and copies the mask along it.
            mask = K.repeat(mask, x.shape[-1])
            # Transpose the mask to align its dimensions with the input tensor `x` for element-wise multiplication
            mask = tf.transpose(mask, [0,2,1])
            # Multiply the input `x` by the mask. This sets the embedding vectors of padded words to zero.
            x = x * mask
            
        # Sum the embeddings along the time steps axis (axis=1) and divide by the number of non-masked time steps to get the true average.
        # `K.sum(mask, axis=1)` correctly counts the number of actual words in each sequence.
        return K.sum(x, axis=1) / K.sum(mask, axis=1)

    # This method defines the shape of the layer's output
    def compute_output_shape(self, input_shape):
        # The output is a single vector per document, so the shape is (batch_size, embedding_dim)
        return (input_shape[0], input_shape[2])

### Cell 13: Building the Averaging Model
This function constructs the Keras model. It consists of:
1.  An **Input** layer for the integer sequences.
2.  An **Embedding** layer that converts integers to dense vectors using the pre-trained GloVe embeddings. `mask_zero=True` tells Keras to ignore the padding value (0). `trainable=False` freezes the embeddings so they don't change during training.
3.  Our custom `MaskedAveragePooling1D` layer to get the document vector.
4.  A final **Dense** layer with a sigmoid activation for binary classification.

In [None]:
# Define a function to create the embedding-averaging model
def get_embedding_average(embeddings):

    # Get the vocabulary size and embedding dimension from the shape of the embeddings matrix
    vocab_size, word_embedding_dim=embeddings.shape
    
    # Define the input layer, which expects sequences of integers of variable length (None)
    word_sequence_input = Input(shape=(None,), dtype='int32')
    
    # Define the embedding layer
    word_embedding_layer = Embedding(vocab_size,         # The number of words in our vocabulary
                                    word_embedding_dim, # The dimension of each word vector
                                    weights=[embeddings], # Initialize with our pre-trained GloVe embeddings
                                    mask_zero=True,       # Enable masking to ignore padding (zeros)
                                    trainable=False)      # Freeze the embeddings; we won't update them during training

    
    # Pass the input sequence through the embedding layer
    embedded_sequences = word_embedding_layer(word_sequence_input)
    # Apply our custom masked average pooling layer to get a single document vector
    x=MaskedAveragePooling1D()(embedded_sequences)
    
    # Add a final dense layer with one neuron and a sigmoid activation function for binary classification
    predictions=Dense(1, activation="sigmoid")(x)

    # Create the Keras Model, specifying the inputs and outputs
    model = Model(inputs=word_sequence_input, outputs=predictions)

    # Compile the model, defining the loss function, optimizer, and metrics
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
    
    # Return the compiled model
    return model

### Cell 14: Model Summary
We instantiate the model and print its summary to see the architecture and number of parameters.

In [None]:
# Create the embedding averaging model
embedding_model=get_embedding_average(embeddings)
# Print a summary of the model's architecture
print (embedding_model.summary())

### Cell 15: Training the Averaging Model
Now we train the baseline model. We use a `ModelCheckpoint` callback to save the best version of the model (based on validation loss) during training.

In [None]:
# Set the current model to be the embedding averaging model
model=embedding_model

# Define the filename for saving the best model
modelName="embedding_model.hdf5"
# Create a ModelCheckpoint callback to monitor validation loss and save only the best model
checkpoint = ModelCheckpoint(modelName, monitor='val_loss', verbose=0, save_best_only=True, mode='min')

# Train the model
model.fit(trainX, Y_train,                          # Training data and labels
            validation_data=(devX, Y_dev),          # Validation data and labels
            epochs=30, batch_size=128,              # Number of epochs and batch size
            callbacks=[checkpoint])                 # List of callbacks to use during training

Next, let's add attention to that simple model to learn a *weighted* average over words---giving more weight to words in the document that are more important for representing the document for the purpose of this classification.

### Cell 17: Custom Attention Layer
This custom `AttentionLayerMasking` layer calculates the attention weights. For each word vector in a sequence, it computes a score, turns these scores into a probability distribution (using softmax), and outputs these weights. It also correctly handles masking to ensure padded words get an attention weight of zero.

In [None]:
# Define a custom Keras Layer to compute attention weights, with support for masking
class AttentionLayerMasking(Layer):

    # The constructor takes the output dimension (though not explicitly used here, it's good practice)
    def __init__(self, output_dim, **kwargs):
        self.output_dim = output_dim
        super(AttentionLayerMasking, self).__init__(**kwargs)


    # The `build` method is where weights are created. It's called automatically by Keras.
    def build(self, input_shape):
        # Get the dimension of the input embeddings
        input_embedding_dim=input_shape[-1]
        
        # Create a trainable weight matrix (often called the context vector or kernel).
        # Its job is to learn to project the input embeddings into a single score.
        self.kernel = self.add_weight(name='kernel', 
                            shape=(input_embedding_dim,1), # Shape allows dot product with each word vector
                            initializer='uniform',         # Initialize weights uniformly
                            trainable=True)                # This weight will be learned during training
        super(AttentionLayerMasking, self).build(input_shape)

    # This layer consumes the mask, so we don't propagate it further
    def compute_mask(self, input, input_mask=None):
        return None

    # This `call` method contains the core logic for calculating attention
    def call(self, x, mask=None):
        
        # 1. Compute scores: Perform a dot product between each word's representation `x` and the learned kernel.
        # This results in a single score for each word in the sequence.
        x=K.dot(x, self.kernel)
        # 2. (Optional) Apply a non-linearity. Here we just exponentiate to make scores positive for softmax.
        x=K.exp(x)
        
        # 3. Apply the mask: Zero out the scores for any padded words.
        if mask is not None:
            # Cast the boolean mask to floats (True -> 1.0, False -> 0.0)
            mask = K.cast(mask, K.floatx())
            # Add a dimension to the mask so it can be broadcasted and multiplied with the scores `x`
            mask = K.expand_dims(mask, axis=-1)
            # Element-wise multiplication to zero out scores for padded positions
            x = x * mask
        
        # 4. Normalize scores to get weights (Softmax): Divide each score by the sum of all scores in the sequence.
        # This ensures the weights for each document sum to 1.
        x /= K.sum(x, axis=1, keepdims=True)
        # Squeeze the last dimension to get a final shape of (batch_size, timesteps)
        x=K.squeeze(x, axis=2)

        # Return the final attention weights
        return x

    # This method defines the shape of the layer's output
    def compute_output_shape(self, input_shape):
        # The output is a vector of weights, one for each time step.
        return (input_shape[0], input_shape[1])

### Cell 18: Building the Attention Model
This function builds the new model architecture:
1.  **Input** and **Embedding** layers are the same as before.
2.  A **Dense** layer with a `tanh` activation is applied to the word embeddings. This projects the embeddings into a new space, allowing the model to learn a representation specifically for calculating attention.
3.  Our custom `AttentionLayerMasking` takes these transformed embeddings and computes the attention weights.
4.  A **Lambda** layer performs a batch-wise dot product between the attention weights and the *original* word embeddings. This computes the weighted average, creating the final document representation (context vector).
5.  A final **Dense** layer with a sigmoid activation performs the classification.

In [None]:
# Define a function to create the model with the attention mechanism
def get_embedding_with_attention_masking(embeddings):

    # Get the vocabulary size and embedding dimension
    vocab_size, word_embedding_dim=embeddings.shape
    
    # Define the input layer for sequences of integers
    word_sequence_input = Input(shape=(None,), dtype='int32')
    
    # Define the embedding layer, initialized with pre-trained weights and masking enabled
    word_embedding_layer = Embedding(vocab_size,
                                    word_embedding_dim,
                                    weights=[embeddings], 
                                    mask_zero=True,
                                    trainable=False)

    
    # Get the embedded sequences from the input
    embedded_sequences = word_embedding_layer(word_sequence_input)
    
    # 1. Transform word embeddings into a new representation for attention calculation
    # A Dense layer with a 'tanh' activation is a common choice for this.
    attention_key_dim=300
    attention_input=Dense(attention_key_dim, activation='tanh')(embedded_sequences)

    # 2. Compute attention weights using our custom layer
    # The output `attention_output` has shape (batch_size, timesteps) and contains the weight for each word.
    attention_output = AttentionLayerMasking(word_embedding_dim, name="attention")(attention_input)
    
    # 3. Compute the document representation as a weighted average of the original embeddings
    # We use a Lambda layer to perform a batch dot product between the attention weights and the embeddings.
    # This effectively calculates sum(attention_weight_i * embedding_i) for each document.
    document_representation = Lambda(lambda x: K.batch_dot(x[0], x[1], axes=1), name='dot')([attention_output,embedded_sequences])

    # 4. Classify the resulting document representation
    x=Dense(1, activation="sigmoid")(document_representation)

    # Create the Keras Model
    model = Model(inputs=word_sequence_input, outputs=x)

    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
    
    # Return the compiled model
    return model

### Cell 19: Attention Model Summary
We instantiate the attention model and print its summary. Notice the additional `Dense` and `AttentionLayerMasking` layers.

In [None]:
# Create the attention-based model
embedding_attention_model=get_embedding_with_attention_masking(embeddings)
# Print a summary of the model's architecture
print (embedding_attention_model.summary())

### Cell 20: Training the Attention Model
We train the attention model, again using `ModelCheckpoint` to save the best performing version on the validation set.

In [None]:
# Set the current model to be the attention model
model=embedding_attention_model

# Define the filename for saving the best model
modelName="embedding_attention_model.hdf5"
# Create a ModelCheckpoint callback
checkpoint = ModelCheckpoint(modelName, monitor='val_loss', verbose=0, save_best_only=True, mode='min')

# Train the model
model.fit(trainX, Y_train, 
            validation_data=(devX, Y_dev),
            epochs=30, batch_size=128,
            callbacks=[checkpoint])

Now let's explore what words in a document a learned attention model is attending to.  

### Cell 22: Loading the Best Model
We first load the weights of the best model that were saved by the `ModelCheckpoint` callback during training.

In [None]:
# Re-instantiate the model structure
model=embedding_attention_model
# Load the saved weights from the best epoch during training
model.load_weights("embedding_attention_model.hdf5")

### Cell 23: Attention Analysis Function
This function, `analyze`, visualizes the attention weights. It uses a Keras `function` to create a "functor" that can access the outputs of intermediate layers of the model. We use this to get the output of our `AttentionLayerMasking` layer for a given input sentence. It then prints each word with its corresponding weight and plots the results.

In [None]:
# Define a function to analyze and visualize attention weights for a given document
def analyze(model, doc):
    
    # Tokenize the input document string
    words=doc.split(" ")
    # Convert the tokens into a padded sequence of word IDs
    text = get_word_ids([words], vocab, max_length=len(words))
   
    # Create a Keras Function to get the output of intermediate layers
    # `model.input` is the model's input tensor
    inp = model.input                                    
    # `outputs` is a list of the output tensors of each layer in the model (skipping the input layer itself)
    outputs = [layer.output for layer in model.layers[1:]]       
    # The functor takes the input and the learning phase (0 for test/inference) and returns the layer outputs
    functor = K.function([inp, K.learning_phase()], outputs) 

    # Prepare the input text for the model
    test = text[0]
    orig=words
    attention_weights=[]
    # Reshape the input to have a batch dimension of 1
    test=test.reshape((1,len(words)))
    # Run the input through the functor to get the outputs of all layers
    layer_outs = functor([test, 0.])

    # The attention layer is the 3rd layer in our model (0=Input, 1=Embedding, 2=Dense, 3=Attention). 
    # NOTE: The index might change if the model architecture is modified. Here we access the output of the 'attention' layer.
    attention_layer=layer_outs[2]
    
    # Iterate through the words and their corresponding attention weights
    for i in range(len(words)):
        # Get the attention weight for the i-th word
        val=attention_layer[0,i]
        # Append it to our list
        attention_weights.append(val)
        # Print the weight and the word
        print ("%.3f\t%s" % (val, orig[i]))
        
    # Create a pandas DataFrame for easy plotting
    df = pd.DataFrame({'words':orig, 'attention':attention_weights})
    # Create a bar plot to visualize the attention weights
    ax = df.plot.bar(x='words', y='attention', figsize=(10,4))

### Cell 24: Visualizing Attention on a Positive Example
Let's see which words get the most attention in a simple positive sentence. We expect words like "love" to have a high weight.

In [None]:
# Define a positive input sentence
text="i love this movie !"
# Analyze the attention weights for this sentence
analyze(model, text)

### Cell 25: Visualizing Attention on a Negative Example
Now let's try a negative sentence. Here, we expect words like "not" and "love" to be important for the model to make the correct prediction. The attention mechanism should highlight them.

In [None]:
# Define a negative input sentence
text="i do not love this movie !"
# Analyze the attention weights for this sentence
analyze(model, text)