This notebook explores the use of a bidirectional LSTM with attention for text classification.

### 1. Imports
This cell imports all the necessary libraries for the project.
* **Keras** and **TensorFlow** are used for building and training the neural network.
* **NumPy** and **Pandas** are used for numerical operations and data manipulation.
* **Scikit-learn** provides utilities like the `LabelEncoder`.
* **Scipy** and **math** are used for statistical calculations, specifically for the confidence interval.

In [None]:
# Import the main Keras library
import keras
# Import NumPy for numerical operations
import numpy as np
# Import preprocessing tools from scikit-learn, specifically for encoding labels
from sklearn import preprocessing
# Import specific layers and components needed from Keras to build the model
from keras.layers import Dense, Input, Embedding, Lambda, Layer, Multiply, Dropout, Dot, Bidirectional, LSTM
# Import the Model class to create the final model object
from keras.models import Model
# Import the Keras backend for low-level operations
from keras import backend as K
# Import TensorFlow, often used with Keras
import tensorflow as tf
# Import Keras callbacks for saving the best model and stopping training early
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
# Import pandas for data manipulation (though not heavily used in this script)
import pandas as pd
# Import the normal distribution function from scipy.stats to calculate z-scores
from scipy.stats import norm
# Import the square root function for calculating standard error
from math import sqrt 

### 2. Load Word Embeddings
This function is designed to load pre-trained word embeddings from a file (in word2vec format). It reads each word and its corresponding vector, stores them in a matrix, and creates a vocabulary that maps each word to an integer index. It also adds special tokens for padding (`_0_`) and unknown words (`_UNK_`).

In [None]:
# Define a function to load word embeddings from a file
def load_embeddings(filename, max_vocab_size):

    # Initialize a dictionary to store the vocabulary (word -> index)
    vocab={}
    # Initialize a list to store the embedding vectors
    embeddings=[]
    # Open the specified file to read the embeddings
    with open(filename) as file:
        
        # Read the first line, which contains the vocab size and embedding dimension
        cols=file.readline().split(" ")
        # Convert the number of words to an integer
        num_words=int(cols[0])
        # Convert the embedding dimension to an integer
        size=int(cols[1])
        # Append a vector of zeros for the padding token (at index 0)
        embeddings.append(np.zeros(size))
        # Append a vector of zeros for the unknown word token (at index 1)
        embeddings.append(np.zeros(size))
        # Add the padding token to our vocabulary with index 0
        vocab["_0_"]=0
        # Add the unknown word token to our vocabulary with index 1
        vocab["_UNK_"]=1
        
        # Iterate over each line in the embeddings file with an index
        for idx,line in enumerate(file):

            # If we have reached our desired maximum vocabulary size, stop reading
            if idx+2 >= max_vocab_size:
                break

            # Strip whitespace and split the line into word and vector components
            cols=line.rstrip().split(" ")
            # Convert the vector components into a NumPy array of floats
            val=np.array(cols[1:])
            # The first column is the word itself
            word=cols[0]
            
            # Add the word's vector to our list of embeddings
            embeddings.append(val)
            # Add the word to our vocabulary, mapping it to its new index (idx + 2)
            vocab[word]=idx+2

    # Convert the list of embeddings to a NumPy array and return it along with the vocab and embedding size
    return np.array(embeddings), vocab, size

### 3. Read Text Data
This function reads a tab-separated (TSV) file. It assumes each line contains a label followed by a tab, followed by pre-tokenized text. It parses these lines and returns two lists: one containing the tokenized texts (`X`) and another containing the corresponding labels (`Y`).

In [None]:
# Define a function to read the training/dev/test data
def read_data(filename, vocab):
    # Initialize a list to hold the documents (features)
    X=[]
    # Initialize a list to hold the labels
    Y=[]
    # Open the data file, specifying UTF-8 encoding
    with open(filename, encoding="utf-8") as file:
        # Iterate over each line in the file
        for line in file:
            # Strip whitespace and split the line by the tab character
            cols=line.rstrip().split("\t")
            # The first column is the label
            label=cols[0]
            # The second column is the text, which is already tokenized; split it by spaces
            text=cols[1].split(" ")
            # Add the list of tokens to the features list
            X.append(text)
            # Add the label to the labels list
            Y.append(label)
    # Return the lists of texts and labels
    return X, Y

### 4. Convert Words to Integer IDs
This function takes the tokenized documents and converts them into a numerical format that the model can understand. It maps each token to its ID from the vocabulary. It also ensures all sequences have the same length by truncating longer ones and padding shorter ones with a '0' ID.

In [None]:
# Define a function to convert documents of words into sequences of integer IDs
def get_word_ids(docs, vocab, max_length=200):
    
    # Initialize a list to store the ID sequences for all documents
    doc_ids=[]
    
    # Iterate through each document (which is a list of tokens)
    for doc in docs:
        # Initialize a list to store word IDs for the current document
        wids=[]
        # Iterate through each token in the document, up to the maximum length
        for token in doc[:max_length]:
            # Look up the token in the vocabulary (after converting to lowercase). If not found, use the ID for UNK (1).
            val = vocab[token.lower()] if token.lower() in vocab else 1
            # Append the corresponding ID to the list for the current document
            wids.append(val)
        
        # Pad the current document's ID sequence with 0s to make it 'max_length' long
        for i in range(len(wids),max_length):
            wids.append(0)

        # Add the final padded ID sequence for the document to the main list
        doc_ids.append(wids)

    # Convert the list of lists into a NumPy array and return it
    return np.array(doc_ids)

If you haven't downloaded the glove vectors, do so first -- the top 50K words in the "Common Crawl (42B)"  vectors (300-dimensional) can be found here: [glove.42B.300d.50K.txt](https://drive.google.com/file/d/1n1jt0UIdI3CD26cY1EIeks39XH5S8O8M/view?usp=sharing); download it and place  in your `data` directory.

### 5. Convert GloVe to Word2Vec Format
The `load_embeddings` function is written to parse the word2vec file format (which has a header line with vocabulary size and dimension). The original GloVe file does not have this header. This cell uses a utility from the `gensim` library to convert the GloVe file into the word2vec format.

In [None]:
# Import the conversion utility from gensim
from gensim.scripts.glove2word2vec import glove2word2vec

# Specify the path to the input GloVe file
glove_file="../data/glove.42B.300d.50K.txt"
# Specify the path for the output file in word2vec format
glove_in_w2v_format="../data/glove.42B.300d.50K.w2v.txt"
# Run the conversion function; the return value is not needed, so it's assigned to _
_ = glove2word2vec(glove_file, glove_in_w2v_format)

### 6. Execute Embedding Loading
This cell calls the `load_embeddings` function to actually load the converted GloVe vectors into memory. This populates the `embeddings` matrix, the `vocab` dictionary, and the `embedding_size` variable for later use.

In [None]:
# Call the function to load the embeddings from the converted file, limiting the vocabulary to 50,000 words
embeddings, vocab, embedding_size=load_embeddings("../data/glove.42B.300d.50K.w2v.txt", 50000)

### 7. Set Data Directory
This cell specifies the directory where the training, development, and test data files (`train.tsv`, `dev.tsv`, `test.tsv`) are located.

In [None]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
# Set the directory path for the dataset
directory="../data/lmrd"

### 8. Read All Datasets
Using the `read_data` function defined earlier, this cell reads the text and labels from the `train.tsv`, `dev.tsv`, and `test.tsv` files into memory.

In [None]:
# Read the training data from the train.tsv file
trainText, trainY=read_data("%s/train.tsv" % directory, vocab)
# Read the development (validation) data from the dev.tsv file
devText, devY=read_data("%s/dev.tsv" % directory, vocab)
# Read the test data from the test.tsv file
testText, testY=read_data("%s/test.tsv" % directory, vocab)

### 9. Vectorize All Text Data
This cell converts all the raw text data (for train, dev, and test sets) into padded sequences of integer IDs using the `get_word_ids` function. This is the final preprocessing step for the input features (`X`).

In [None]:
# Convert the training text into padded sequences of word IDs
trainX = get_word_ids(trainText, vocab, max_length=200)
# Convert the development text into padded sequences of word IDs
devX = get_word_ids(devText, vocab, max_length=200)
# Convert the test text into padded sequences of word IDs
testX = get_word_ids(testText, vocab, max_length=200)

### 10. Encode Labels
The model's output layer requires numerical labels. This cell uses `scikit-learn`'s `LabelEncoder` to convert the string labels (e.g., "positive", "negative") into integers (e.g., 1, 0).

In [None]:
# Initialize a LabelEncoder object from scikit-learn
le = preprocessing.LabelEncoder()
# Fit the encoder on the training labels to learn the classes
le.fit(trainY)
# Transform the training labels into integers and convert to a NumPy array
Y_train=np.array(le.transform(trainY))
# Transform the development labels into integers and convert to a NumPy array
Y_dev=np.array(le.transform(devY))
# Transform the test labels into integers and convert to a NumPy array
Y_test=np.array(le.transform(testY))

### 11. Custom Attention Layer
This cell defines a custom Keras layer for the attention mechanism. 
* **build()**: Creates the trainable weight `kernel` used to calculate attention scores.
* **call()**: Implements the forward pass. It computes an importance score for each input timestep, applies a softmax to get attention weights, and crucially, uses the `mask` to ignore padded parts of the sequence.
* **compute_mask()**: Indicates this layer consumes the mask but doesn't produce a new one.

In [None]:
# Define a custom Keras Layer for attention that handles masking
class AttentionLayerMasking(Layer):

    # The initializer for the layer
    def __init__(self, output_dim, **kwargs):
        # Store the output dimension
        self.output_dim = output_dim
        # Call the parent class's initializer
        super(AttentionLayerMasking, self).__init__(**kwargs)


    # This method creates the layer's trainable weights
    def build(self, input_shape):
        # Get the dimension of the input embeddings (last dimension of the input shape)
        input_embedding_dim=input_shape[-1]
        
        # Add a trainable weight (kernel) to the layer
        self.kernel = self.add_weight(name='kernel', 
                            # The shape is (input_dim, 1) to compute a dot product
                            shape=(input_embedding_dim,1),
                            # Initialize weights uniformly
                            initializer='uniform',
                            # This weight should be trainable
                            trainable=True)
        # Call the parent class's build method
        super(AttentionLayerMasking, self).build(input_shape)

    # This method specifies that the layer does not propagate a mask
    def compute_mask(self, input, input_mask=None):
        # Return None, as the output of this layer is the attention weights, not a sequence
        return None

    # This method contains the layer's logic (the forward pass)
    def call(self, x, mask=None):
        
        # Calculate the dot product of the input tensor 'x' and the learned kernel
        # This computes an unnormalized importance score for each time step
        x=K.dot(x, self.kernel)
        # Apply an exponentiation, a step in the softmax function
        x=K.exp(x)
        
        # If a mask is provided (from the Embedding layer), apply it
        if mask is not None:
            # Cast the boolean mask to the backend's float type (e.g., float32)
            mask = K.cast(mask, K.floatx())
            # Add a new dimension to the mask to make it compatible for element-wise multiplication
            mask = K.expand_dims(mask, axis=-1)
            # Multiply the scores by the mask to zero out scores for padded time steps
            x = x * mask
        
        # Normalize the scores by dividing by the sum over the time-step axis (axis 1)
        # This completes the softmax operation, yielding attention weights that sum to 1
        x /= K.sum(x, axis=1, keepdims=True)
        # Remove the last dimension, which is of size 1
        x=K.squeeze(x, axis=2)

        # Return the computed attention weights
        return x

    # This method computes the output shape of the layer
    def compute_output_shape(self, input_shape):
        # The output shape is (batch_size, num_timesteps)
        return (input_shape[0], input_shape[1])

Q1: Implement a BiLSTM with attention. Feel free to base your code on the models in Attention.ipynb and LSTM.ipynb

### 12. BiLSTM with Attention Model Architecture
This function constructs the Keras model.
1.  **Input Layer**: Defines the expected input shape.
2.  **Embedding Layer**: Maps integer IDs to GloVe vectors. `mask_zero=True` is critical; it tells subsequent layers to ignore the padded '0's.
3.  **Bidirectional LSTM**: Processes the sequence of vectors. `return_sequences=True` is necessary to get the output of every timestep for the attention mechanism.
4.  **Dense 'tanh' Layer**: Transforms the BiLSTM outputs into a new representation before calculating attention. This helps the model learn a better context for attention.
5.  **Attention Layer**: Uses our custom `AttentionLayerMasking` to calculate attention weights.
6.  **Lambda Layer**: Calculates the weighted sum of the BiLSTM outputs using the attention weights, producing a single vector representation for the entire document.
7.  **Output Layer**: A final `Dense` layer with a sigmoid activation function for binary classification.

In [None]:
# Define a function that builds and returns the BiLSTM with attention model
def get_bilstm_with_attention_masking(embeddings, lstm_size=25, dropout_rate=0.25):

    # Get the vocabulary size and embedding dimension from the shape of the embeddings matrix
    vocab_size, word_embedding_dim=embeddings.shape
    
    # Define the input layer for sequences of integers
    word_sequence_input = Input(shape=(None,), dtype='int32')
    
    # Define the embedding layer
    word_embedding_layer = Embedding(vocab_size,               # The size of the vocabulary
                                    word_embedding_dim,         # The dimension of the embeddings
                                    weights=[embeddings],       # Initialize with our pre-trained embeddings
                                    mask_zero=True,             # Enable masking for padding (value 0)
                                    trainable=False)            # Freeze the embedding weights

    
    # Pass the input sequence through the embedding layer
    embedded_sequences = word_embedding_layer(word_sequence_input)
    # Pass the embedded sequences through a Bidirectional LSTM layer
    # return_sequences=True is essential for attention, as we need the output of every time step
    bilstm_output = Bidirectional(LSTM(lstm_size, return_sequences=True, activation='tanh', dropout=dropout_rate), merge_mode='concat')(embedded_sequences)

    # First, transform each BiLSTM hidden state into a new vector for calculating importance
    attention_key_dim=300
    # A Dense layer with 'tanh' activation is a common way to do this transformation
    attention_input=Dense(attention_key_dim, activation='tanh')(bilstm_output)

    # Next, pass the transformed inputs through our custom attention layer.
    # This returns a normalized attention value a_i for each token i, where sum(a_i) = 1.
    attention_output = AttentionLayerMasking(word_embedding_dim, name="attention")(attention_input)
    
    # Now, multiply the attention weights by the original BiLSTM outputs to get a weighted average.
    # The Lambda layer performs a batch dot product between attention weights and BiLSTM outputs.
    document_representation = Lambda(lambda x: K.batch_dot(x[0], x[1], axes=1), name='dot')([attention_output,bilstm_output])

    # Pass the final document representation through a Dense layer for classification
    # A single neuron with a 'sigmoid' activation is used for binary classification
    x=Dense(1, activation="sigmoid")(document_representation)

    # Create the Keras Model, defining the input and output layers
    model = Model(inputs=word_sequence_input, outputs=x)

    # Compile the model with loss function, optimizer, and evaluation metrics
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
    
    # Return the compiled model
    return model

### 13. Instantiate and Summarize the Model
This cell calls the builder function to create an instance of the model and then prints a summary of its architecture. The summary is useful for verifying the layers, their output shapes, and the number of trainable parameters.

In [None]:
# Create the model using the defined function with an LSTM size of 25 and dropout of 0.25
bilstm_attention_model=get_bilstm_with_attention_masking(embeddings, lstm_size=25, dropout_rate=0.25)
# Print a summary of the model's architecture
print (bilstm_attention_model.summary())

### 14. Train the Model
This is where the model training happens. The `fit` method trains the model on the training data (`trainX`, `Y_train`) and validates it on the development set (`devX`, `Y_dev`). The `ModelCheckpoint` callback saves the model weights to a file whenever the validation loss improves, ensuring we keep the best version of the model.

In [None]:
# Assign the created model to a new variable for clarity
model=bilstm_attention_model

# Define the filename for saving the best model
modelName="bilstm_attention_model.hdf5"
# Create a ModelCheckpoint callback to monitor validation loss and save only the best model
checkpoint = ModelCheckpoint(modelName, monitor='val_loss', verbose=0, save_best_only=True, mode='min')

# Train the model
model.fit(trainX, Y_train, 
            # Provide the development set for validation after each epoch
            validation_data=(devX, Y_dev),
            # Set the number of training epochs
            epochs=30, 
            # Set the batch size
            batch_size=128,
            # Pass in the checkpoint callback
            callbacks=[checkpoint])

Q2. What is the accuracy of your model on the test data?  Report the accuracy score with 95% confidence intervals.  Feel free to use the dev data for model selection (e.g., to hyperparameter choices like the size of hidden LSTM state, etc.), but be careful not to use the test data for this.  See keras [model.predict](https://keras.io/models/model/#predict) to generate predictions for a trained model.

### 15. Confidence Interval Function
This helper function calculates the accuracy of the model's predictions and computes a confidence interval for that accuracy score. This gives a more reliable estimate of the model's performance by providing a range in which the true accuracy likely lies. It uses the normal approximation to the binomial distribution.

In [None]:
# Define a function to calculate binomial confidence intervals for accuracy
def binomial_confidence_intervals(predictions, truth, confidence_level=0.95):
    # Create a list to store whether each prediction was correct (1) or not (0)
    correct=[]
    # Iterate over the predictions and the true labels simultaneously
    for pred, gold in zip(predictions, truth):
        # Append 1 if the prediction matches the true label, otherwise append 0
        correct.append(int(pred==gold))
        
    # Calculate the success rate (accuracy) as the mean of the 'correct' list
    success_rate=np.mean(correct)

    # For a two-tailed test, find the area for each tail
    critical_value=(1-confidence_level)/2
    # Find the z-score (the number of standard deviations from the mean) for the critical value
    # norm.ppf is the inverse of the cumulative distribution function
    z_alpha=-1*norm.ppf(critical_value)
    
    # Calculate the standard error for a binomial proportion
    # The variance of a binomial distribution is p*(1-p)
    standard_error=sqrt((success_rate*(1-success_rate))/len(correct))

    # Calculate the lower bound of the confidence interval
    lower=success_rate-z_alpha*standard_error
    # Calculate the upper bound of the confidence interval
    upper=success_rate+z_alpha*standard_error
    # Print the formatted results
    print("%.3f, %s%% Confidence interval: [%.3f,%.3f]" % (success_rate, confidence_level*100, lower, upper))

### 16. Load Best Model Weights
Before evaluating on the test set, this cell loads the weights from the file saved by `ModelCheckpoint`. This ensures that we are using the model that performed best on the validation set during training, not necessarily the model from the final epoch, which might be overfitted.

In [None]:
# Re-assign the model object (this is good practice in notebooks to ensure the right object is used)
model=bilstm_attention_model

# Load the weights of the best model that were saved during training
model.load_weights("bilstm_attention_model.hdf5")

### 17. Make Predictions on Test Data
The trained model is now used to make predictions on the unseen test data. The `predict` method outputs raw probabilities (from the sigmoid function), which are then converted into binary class labels (0 or 1) by applying a 0.5 decision threshold.

In [None]:
# Use the trained model to generate predictions on the test set
predictions = model.predict(testX, batch_size=128)
# Convert the output probabilities to binary predictions (True/False) using a 0.5 threshold
binarized_predictions=predictions > .5

### 18. Calculate and Report Final Accuracy
This final code cell calls the `binomial_confidence_intervals` function to compute and print the model's accuracy on the test set, complete with the 95% confidence interval.

In [None]:
# Calculate and print the accuracy with a 95% confidence interval on the test set
# Note: The original notebook had a typo 'binomial_confidence_interval'. Corrected to 'binomial_confidence_intervals'.
binomial_confidence_intervals(binarized_predictions, Y_test, confidence_level=0.95)

Q3. Take the sentence "I do not like this movie." How is representing this sentence by using attention over the individual word embeddings different from representing it with attention over the output of each time step in an bidirectional LSTM?  What information does the LSTM output encode that individual word embeddings don't have access to?

A3: Word embeddings encode information about the word *type* but not about its specific use in context; the output of an LSTM at time t encodes information about the context a word *token* was used in -- for a single forward LSTM, the context of the sequence from word 1 through word t; for a BiLSTM, the context of the entire sequence.