### **Cell 1: Introduction**
This is the initial markdown cell that introduces the notebook's purpose and gives a necessary setup instruction.

---
This notebook explores convolutional neural networks for text, using the keras `Sequential` and `Functional` interfaces.
<br>

Before getting started, install the pydot library

```sh
conda install pydot=1.3.0
```

### **Cell 2: Importing Libraries**
This cell imports all the necessary libraries and modules. These include Keras for building the neural network, NumPy for numerical operations, and scikit-learn for data preprocessing. Utilities for visualizing the model are also imported.

In [None]:
# Import the main Keras library for building neural networks
import keras
# Import NumPy for efficient numerical operations, especially with arrays
import numpy as np
# Import the preprocessing module from scikit-learn for tasks like label encoding
from sklearn import preprocessing
# Import specific layers needed for the CNN model from Keras
from keras.layers import Dense, Input, Embedding, GlobalMaxPooling1D, Conv1D, Concatenate, Dropout
# Import the two main Keras model-building APIs: Sequential and Model (Functional)
from keras.models import Model, Sequential
# Import CountVectorizer for text feature extraction (though not used in the final model)
from sklearn.feature_extraction.text import CountVectorizer
# Import SVG to display model visualizations directly in the notebook
from IPython.display import SVG
# Import a utility to convert a Keras model to a dot format for visualization
from keras.utils.vis_utils import model_to_dot

### **Cell 3: Function to Load Word Embeddings**
This cell defines the `load_embeddings` function. Its purpose is to read a file containing pre-trained word embeddings (like GloVe or Word2Vec) and load them into memory. It creates a vocabulary dictionary mapping words to integer IDs and an embedding matrix where the row index corresponds to a word's ID. Special tokens for padding (`_0_`) and unknown words (`_UNK_`) are added.

In [None]:
# Define a function to load pre-trained word embeddings from a file
def load_embeddings(filename, max_vocab_size):

    # Initialize an empty dictionary to map words to their integer IDs
    vocab={}
    # Initialize an empty list to store the embedding vectors
    embeddings=[]
    # Open and read the specified embeddings file
    with open(filename) as file:
        
        # Read the first line, which contains the vocabulary size and embedding dimension
        cols=file.readline().split(" ")
        # Extract the total number of words in the file
        num_words=int(cols[0])
        # Extract the size (dimension) of each embedding vector
        size=int(cols[1])
        # Append a zero vector for the padding token (ID 0)
        embeddings.append(np.zeros(size))
        # Append another zero vector for the "Unknown" (UNK) token (ID 1)
        embeddings.append(np.zeros(size))
        # Add the padding token to our vocabulary with ID 0
        vocab["_0_"]=0
        # Add the UNK token to our vocabulary with ID 1
        vocab["_UNK_"]=1
        
        # Iterate through each line of the embeddings file with an index
        for idx,line in enumerate(file):

            # Stop reading if we have reached the desired maximum vocabulary size
            # We use idx+2 to account for the padding and UNK tokens
            if idx+2 >= max_vocab_size:
                break

            # Strip whitespace and split the line into the word and its vector parts
            cols=line.rstrip().split(" ")
            # Convert the vector parts (from the second element onwards) into a NumPy array
            val=np.array(cols[1:])
            # The first element is the word itself
            word=cols[0]
            
            # Add the word's vector to our list of embeddings
            embeddings.append(val)
            # Add the word to our vocabulary, mapping it to its new ID (index + 2)
            vocab[word]=idx+2

    # Convert the list of embeddings to a NumPy array and return it along with the vocabulary and embedding size
    return np.array(embeddings), vocab, size

### **Cell 4: Function to Read Data**
This cell defines the `read_data` function, which reads a tab-separated values (TSV) file containing text data and corresponding labels. It parses each line, separating the label from the pre-tokenized text, and returns them as two separate lists.

In [None]:
# Define a function to read text data from a labeled file
def read_data(filename, vocab):
    # Initialize an empty list to store the text sequences (features)
    X=[]
    # Initialize an empty list to store the labels
    Y=[]
    # Open the file, specifying UTF-8 encoding to handle a wide range of characters
    with open(filename, encoding="utf-8") as file:
        # Iterate through each line in the file
        for line in file:
            # Strip trailing whitespace and split the line by the tab character
            cols=line.rstrip().split("\t")
            # The first column is the label
            label=cols[0]
            # The second column is the text, which is assumed to be already tokenized (words separated by spaces)
            text=cols[1].split(" ")
            # Add the list of tokens to the features list
            X.append(text)
            # Add the label to the labels list
            Y.append(label)
    # Return the lists of text sequences and labels
    return X, Y

### **Cell 5: Function to Convert Text to Word IDs**
The `get_word_ids` function takes documents (as lists of tokens) and converts them into sequences of numerical IDs based on the provided vocabulary. It also ensures that all sequences have the same length by truncating longer ones and padding shorter ones with a '0' ID.

In [None]:
# Define a function to convert documents (lists of tokens) into padded sequences of integer IDs
def get_word_ids(docs, vocab, max_length=1000):
    
    # Initialize a list to hold all the processed documents (as ID sequences)
    doc_ids=[]
    
    # Iterate through each document in the input list
    for doc in docs:
        # Initialize a list to store word IDs for the current document
        wids=[]

        # Iterate through each token in the document, but only up to the specified max_length
        for token in doc[:max_length]:
            # Look up the lowercase token in the vocabulary; if not found, use the UNK ID (1)
            val = vocab[token.lower()] if token.lower() in vocab else 1
            # Append the corresponding ID to the list for the current document
            wids.append(val)
        
        # After processing tokens, pad the sequence to ensure it reaches max_length
        # This loop runs from the current length of wids up to max_length
        for i in range(len(wids),max_length):
            # Append the padding ID (0) to the end of the list
            wids.append(0)

        # Add the final padded sequence of word IDs to the main list
        doc_ids.append(wids)

    # Convert the list of lists into a 2D NumPy array and return it
    return np.array(doc_ids)

### **Cell 6: Loading the Pre-trained Embeddings**
This cell executes the `load_embeddings` function. It specifies the path to the GloVe word embeddings file and sets a vocabulary limit of 100,000 words. The returned embedding matrix, vocabulary dictionary, and embedding dimension are stored in their respective variables.

In [None]:
# Call the load_embeddings function to load GloVe embeddings
# Use a pre-trained file containing 300-dimensional vectors for 50K words
# Limit the vocabulary size to a maximum of 100,000 words
embeddings, vocab, embedding_size=load_embeddings("../data/glove.42B.300d.50K.w2v.txt", 100000)

### **Cell 7: Setting the Data Directory**
A simple variable assignment to hold the path to the dataset directory. This makes it easy to change the data source location in one place.

In [None]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
# Set a variable to store the path to the data directory for easy access
directory="../data/lmrd"

### **Cell 8: Reading the Training and Development Data**
Here, the `read_data` function is called to load the training and development (validation) sets from their respective TSV files. The text and labels for each set are stored in separate variables.

In [None]:
# Read the training data from the 'train.tsv' file within the specified directory
trainText, trainY=read_data("%s/train.tsv" % directory, vocab)
# Read the development (validation) data from the 'dev.tsv' file
devText, devY=read_data("%s/dev.tsv" % directory, vocab)

### **Cell 9: Preparing the Datasets for the Model**
This cell uses the `get_word_ids` function to convert the raw text of the training and development sets into padded numerical sequences. The `max_length` is set to 200, meaning each review will be represented by a vector of 200 integers.

In [None]:
# Convert the training text documents into padded sequences of word IDs, with a max length of 200
trainX = get_word_ids(trainText, vocab, max_length=200)
# Convert the development text documents into padded sequences of word IDs, also with a max length of 200
devX = get_word_ids(devText, vocab, max_length=200)

### **Cell 10: Encoding the Labels**
The text labels (e.g., 'positive', 'negative') need to be converted into numbers (e.g., 1, 0) for the model to process. This cell uses scikit-learn's `LabelEncoder` to perform this transformation for both the training and development labels.

In [None]:
# Initialize a LabelEncoder object from scikit-learn to convert string labels to integers
le = preprocessing.LabelEncoder()
# Fit the encoder on the training labels to learn the mapping (e.g., 'pos' -> 1, 'neg' -> 0)
le.fit(trainY)
# Transform the training labels into their integer representations and convert to a NumPy array
Y_train=np.array(le.transform(trainY))
# Transform the development labels using the same learned mapping and convert to a NumPy array
Y_dev=np.array(le.transform(devY))

### **Cell 11: Defining a CNN Model with the Sequential API**
This cell defines a function `cnn_sequential` that builds a simple CNN model using Keras's `Sequential` API. The model consists of an embedding layer (initialized with pre-trained GloVe vectors), a 1D convolutional layer to detect features (bigrams), a pooling layer to summarize the features, a dropout layer for regularization, and a final dense layer for binary classification.

In [None]:
# Define a function to create a CNN model using the Keras Sequential API
def cnn_sequential(embeddings, vocab_size, word_embedding_dim):
    # Initialize a Sequential model, which is a linear stack of layers
    model = Sequential()
    # Add the Embedding layer. It maps word IDs to dense vectors.
    # 'weights' initializes it with our pre-trained GloVe embeddings.
    # 'trainable=False' freezes the embeddings so they are not updated during training.
    model.add(Embedding(input_dim=vocab_size, output_dim=word_embedding_dim, weights=[embeddings], trainable=False))
    # Add a 1D convolutional layer. It acts as a feature detector sliding over the sequence.
    # 'filters=50': learns 50 different features.
    # 'kernel_size=2': looks at 2 words (bigrams) at a time.
    # 'activation="tanh"': applies the tanh activation function.
    model.add(Conv1D(filters=50, kernel_size=2, strides=1, padding="same", activation="tanh", name="CNN_bigram"))
    # Add a global max pooling layer. It takes the maximum value from each of the 50 feature maps.
    # This distills the most important feature detected in the entire sequence.
    model.add(GlobalMaxPooling1D())
    # Add a Dropout layer. It randomly sets 20% of its input units to 0 during training to prevent overfitting.
    model.add(Dropout(0.2))
    # Add the final output layer. A single neuron with a sigmoid activation is used for binary classification.
    model.add(Dense(1, activation='sigmoid'))
    # Compile the model, configuring it for training.
    # 'loss='binary_crossentropy'': The loss function for a two-class problem.
    # 'optimizer='adam'': An efficient optimization algorithm.
    # 'metrics=['acc']': The metric to monitor during training is accuracy.
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

    # Return the compiled model
    return model

### **Cell 12: Building and Visualizing the Sequential Model**
Here, the `cnn_sequential` function is called to create an instance of the model. The `.summary()` method is then used to print a textual representation of the model's architecture, and `model_to_dot` is used to create a graphical visualization.

In [None]:
# Create an instance of the sequential CNN model by calling the function defined above
cnn_sequential_model=cnn_sequential(embeddings, len(vocab), embedding_size)
# Print a summary of the model's architecture, including layers, output shapes, and number of parameters
print (cnn_sequential_model.summary())
# Generate a visual graph of the model architecture and display it as an SVG image in the notebook
SVG(model_to_dot(cnn_sequential_model).create(prog='dot', format='svg'))

### **Cell 13: Defining a CNN Model with the Functional API**
This cell defines a function `cnn` that builds a more complex CNN using Keras's `Functional` API. This model has multiple parallel convolutional layers with different kernel sizes (2, 3, and 4) to capture features of different n-gram sizes (bigrams, trigrams, etc.). The outputs of these parallel branches are then concatenated and passed to the final classification layers.

In [None]:
# Define a function to create a more complex multi-filter CNN using the Keras Functional API
def cnn(embeddings, vocab_size, word_embedding_dim):

    # Define the model's input layer. It expects sequences of integers of any length.
    word_sequence_input = Input(shape=(None,), dtype='int32')
    
    # Define the Embedding layer, initializing it with pre-trained weights and making it non-trainable.
    word_embedding_layer = Embedding(vocab_size,
                                    word_embedding_dim,
                                    weights=[embeddings],
                                    trainable=False)

    
    # Connect the embedding layer to the input. This creates the embedded sequences.
    embedded_sequences = word_embedding_layer(word_sequence_input)
    
    # --- Parallel Convolutional Layers ---
    # Create a Conv1D layer to detect bigram features (kernel_size=2).
    cnn2=Conv1D(filters=50, kernel_size=2, strides=1, padding="same", activation="tanh", name="CNN_bigram")(embedded_sequences)
    # Create a Conv1D layer to detect trigram features (kernel_size=3).
    cnn3=Conv1D(filters=50, kernel_size=3, strides=1, padding="same", activation="tanh", name="CNN_trigram")(embedded_sequences)
    # Create a Conv1D layer to detect 4-gram features (kernel_size=4).
    cnn4=Conv1D(filters=50, kernel_size=4, strides=1, padding="same", activation="tanh", name="CNN_4gram")(embedded_sequences)

    # --- Max Pooling for each convolutional path ---
    # Apply global max pooling to the output of the bigram CNN.
    maxpool2=GlobalMaxPooling1D()(cnn2)
    # Apply global max pooling to the output of the trigram CNN.
    maxpool3=GlobalMaxPooling1D()(cnn3)
    # Apply global max pooling to the output of the 4-gram CNN.
    maxpool4=GlobalMaxPooling1D()(cnn4)

    # Concatenate the results from all three max-pooling layers into a single flat vector.
    x=Concatenate()([maxpool2, maxpool3, maxpool4])

    # Apply a Dropout layer for regularization to the concatenated vector.
    x=Dropout(0.2)(x)
    # Add a fully connected (Dense) layer with 50 neurons.
    x=Dense(50)(x)
    # Add the final output Dense layer with a sigmoid activation for binary classification.
    x=Dense(1, activation="sigmoid")(x)

    # Create the final Model by specifying its input layer and output layer.
    model = Model(inputs=word_sequence_input, outputs=x)

    # Compile the model with the appropriate loss function, optimizer, and metrics.
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
    
    # Return the compiled functional model.
    return model

### **Cell 14: Building and Visualizing the Functional Model**
Similar to cell 12, this cell instantiates the functional CNN model, prints its summary, and generates a visual diagram of its more complex, parallel architecture.

In [None]:
# Create an instance of the functional CNN model by calling the 'cnn' function
cnn_functional_model=cnn(embeddings, len(vocab), embedding_size)
# Print a summary of the functional model's architecture
print (cnn_functional_model.summary())
# Generate and display a visual graph of the functional model
SVG(model_to_dot(cnn_functional_model).create(prog='dot', format='svg'))

### **Cell 15: Training the Model**
This cell kicks off the training process. The `model.fit()` method is called on the functional model, providing it with the training data (`trainX`, `Y_train`) and the validation data (`devX`, `Y_dev`). The model will train for 10 epochs, updating its weights in batches of 128 samples.

In [None]:
# Assign the more complex functional model to the 'model' variable for training
model=cnn_functional_model
# Call the fit method to train the model
model.fit(trainX, Y_train, 
            # Provide the development set for validation after each epoch
            validation_data=(devX, Y_dev),
            # Train for 10 complete passes over the entire training dataset
            epochs=10, 
            # Process the data in batches of 128 samples at a time
            batch_size=128)

### **Cell 16: Empty Cell**
This is an empty code cell, often left for future experiments or code snippets.