### Introduction

This notebook demonstrates how to build and train a Multilayer Perceptron (MLP) for binary text classification. It uses the Keras library, a high-level neural networks API, running on top of TensorFlow. The process involves loading text data, converting it into a numerical format using a bag-of-words model, defining the neural network architecture, and then training and evaluating it.

### Cell 1: Environment Setup

The first markdown cell provides instructions for installing the specific versions of TensorFlow and Keras required to run the notebook. This ensures that the code runs without compatibility issues.

Thie notebook explores the multilayer perceptron for binary text classification, using the keras library.  Before starting, be sure to install the following versions of tensorflow and keras (for python 3.7):

```sh
pip install tensorflow==1.13.0-rc2
pip install keras==2.2.4
```

### Cell 2: Importing Necessary Libraries

This cell imports all the required modules and functions. We import Keras for building the neural network, NumPy for numerical operations, and scikit-learn for data preprocessing and feature extraction.

In [None]:
# Import the Keras library, the main tool for building our neural network.
import keras
# Import NumPy for efficient numerical operations, especially with arrays.
import numpy as np
# Import the preprocessing module from scikit-learn to encode our labels.
from sklearn import preprocessing
# Import Dense for fully-connected layers and Dropout for regularization.
from keras.layers import Dense, Dropout
# Import Sequential to build our model layer-by-layer.
from keras.models import Sequential
# Import CountVectorizer to convert text data into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
# Import callbacks to save the best model and stop training early if performance stagnates.
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback

### Cell 3: Data Reading Function

This cell defines a helper function `read_data` to load the dataset from a tab-separated values (TSV) file. The function reads each line, separates the label from the text, and stores them in two separate lists.

In [None]:
# Define a function to read data from a specified file.
def read_data(filename):
    # Initialize an empty list to store the text data (features).
    X=[]
    # Initialize an empty list to store the labels.
    Y=[]
    # Open the file with UTF-8 encoding to handle a wide range of characters.
    with open(filename, encoding="utf-8") as file:
        # Loop through each line in the file, keeping track of the index (though idx is not used).
        for idx,line in enumerate(file):
            # Remove trailing whitespace and split the line by the tab character.
            cols=line.rstrip().split("\t")
            # The first column is the label.
            label=cols[0]
            # The second column is the pre-tokenized text.
            # It's assumed the text is already split into words.
            text=cols[1]
            # Add the text to the features list.
            X.append(text)
            # Add the label to the labels list.
            Y.append(label)

    # Return the lists of features and labels.
    return X, Y

### Cell 4: Setting the Data Directory

This cell defines a variable `directory` that holds the path to the data folder. This makes it easy to change the data source location without modifying the rest of the code.

In [None]:
# Define the path to the directory containing the data files.
directory="../data/lmrd"

### Cell 5: Loading and Preprocessing Data

Here, the data is loaded using the `read_data` function. The text data is then converted into a numerical format using `CountVectorizer`, which creates a binary bag-of-words representation. Finally, the string labels (e.g., "positive", "negative") are converted into integers (e.g., 1, 0) using `LabelEncoder`.

In [None]:
# Load the training, development (validation), and test datasets using the function defined above.
trainX, trainY=read_data("%s/train.tsv" % directory)
devX, devY=read_data("%s/dev.tsv" % directory)
testX, testY=read_data("%s/test.tsv" % directory)

# Initialize the CountVectorizer.
# max_features=10000: Use only the 10,000 most frequent words.
# analyzer=str.split: Split text into words using whitespace.
# lowercase=True: Convert all text to lowercase.
# strip_accents=None: Do not remove accents.
# binary=True: Use 1 if a word is present and 0 otherwise, instead of word counts.
vectorizer = CountVectorizer(max_features=10000, analyzer=str.split, lowercase=True, strip_accents=None, binary=True)

# Learn the vocabulary from the training data and transform it into a feature matrix.
X_train = vectorizer.fit_transform(trainX)
# Transform the development data using the vocabulary learned from the training data.
X_dev = vectorizer.transform(devX)
# Transform the test data using the same vocabulary.
X_test = vectorizer.transform(testX)

# Get the size of the vocabulary (number of features) from the shape of the training matrix.
_,vocabSize=X_train.shape

# Initialize the LabelEncoder to convert string labels to integers.
le = preprocessing.LabelEncoder()
# Learn the label mapping from the training labels (e.g., 'pos' -> 1, 'neg' -> 0).
le.fit(trainY)

# Transform the labels for all datasets into their integer representations.
Y_train=le.transform(trainY)
Y_dev=le.transform(devY)
Y_test=le.transform(testY)

### Cell 6: Defining the MLP Model Architecture

This function `mlp` defines the structure of our neural network. It's a simple sequential model with one hidden layer.

1.  **Input Layer**: Implicitly defined by `input_shape` in the first `Dense` layer. It has a size equal to our vocabulary size (10,000).
2.  **Hidden Layer**: A `Dense` layer with 10 neurons and a 'relu' activation function.
3.  **Dropout Layer**: Randomly sets 20% of the input units to 0 during training to prevent overfitting.
4.  **Output Layer**: A `Dense` layer with a single neuron and a 'sigmoid' activation function, which outputs a probability between 0 and 1, perfect for binary classification.

The model is then compiled with a loss function, an optimizer, and a metric to monitor.

In [None]:
# Define a function that creates and returns the MLP model.
def mlp():
    # Initialize a Sequential model, which allows building a model layer by layer.
    model = Sequential()
    # Add the first (hidden) layer: a fully-connected (Dense) layer with 10 neurons.
    # 'relu' (Rectified Linear Unit) is the activation function.
    # 'input_shape' specifies the number of features for the first layer (our vocabulary size).
    model.add(Dense(10, activation='relu', input_shape=(vocabSize,)))
    # Add a Dropout layer to prevent overfitting. It will randomly drop 20% of neuron connections during training.
    model.add(Dropout(0.2))
    # Add the output layer: a Dense layer with 1 neuron.
    # 'sigmoid' activation squashes the output to a value between 0 and 1, representing a probability.
    model.add(Dense(1, activation='sigmoid'))

    # Configure the model for training.
    # 'loss='binary_crossentropy' is used for binary (two-class) classification problems.
    # 'optimizer='adam'' is an efficient optimization algorithm.
    # 'metrics=['acc']' specifies that we want to monitor accuracy during training.
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
        
    # Return the compiled model.
    return model

### Cell 7: Training and Evaluation Function

The `train_and_evaluate` function handles the entire workflow of training the model, monitoring its performance, and evaluating it on the dev and test sets. It uses two important callbacks:
* **`EarlyStopping`**: Halts the training process if the validation loss (`val_loss`) doesn't improve for a set number of epochs (`patience=10`), preventing wasted computation and overfitting.
* **`ModelCheckpoint`**: Saves the model's weights to a file (`mymodel.hdf5`) only when the validation loss improves. This ensures we can later use the best version of the model, not necessarily the one from the final epoch.

After training, it loads the best saved weights and evaluates the final model's accuracy on both the development and test sets.

In [None]:
# Define a function that takes a model, trains it, and evaluates its performance.
def train_and_evaluate(model):
    # Print a summary of the model's architecture (layers, parameters, etc.).
    print (model.summary())

    # Configure EarlyStopping to monitor the validation loss.
    early_stopping = EarlyStopping(monitor='val_loss',
        min_delta=0,       # Minimum change to qualify as an improvement.
        patience=10,       # Number of epochs with no improvement after which training will be stopped.
        verbose=0,         # Suppress verbose output.
        mode='auto')       # Automatically infer the direction of improvement (min for loss).

    # Define the filename for saving the best model.
    modelName="mymodel.hdf5"
    # Configure ModelCheckpoint to save the model with the best validation loss.
    checkpoint = ModelCheckpoint(modelName, monitor='val_loss', verbose=0, save_best_only=True, mode='min')

    # Start the training process.
    model.fit(X_train, Y_train, 
                validation_data=(X_dev, Y_dev), # Data to use for validation after each epoch.
                epochs=5,                       # The maximum number of times to iterate over the entire dataset.
                callbacks=[checkpoint, early_stopping]) # List of callbacks to use during training.
    
    # Load the weights of the best model saved by the ModelCheckpoint callback.
    model.load_weights(modelName)

    # Evaluate the performance of the best model on the development set.
    dev_loss, dev_accuracy = model.evaluate(X_dev, Y_dev, batch_size=128)
    # Print the development accuracy, formatted to three decimal places.
    print("Dev Accuracy: %.3f" % dev_accuracy)

    # Evaluate the performance of the best model on the test set.
    test_loss, test_accuracy = model.evaluate(X_test, Y_test, batch_size=128)
    # Print the final test accuracy.
    print("Test Accuracy: %.3f" % test_accuracy)


### Cell 8: Running the Experiment

This final cell brings everything together. It first calls `mlp()` to create a new instance of the compiled model and then passes this model to the `train_and_evaluate()` function to start the training and evaluation process.

In [None]:
# Create the MLP model by calling the mlp() function and then pass it to the
# train_and_evaluate function to run the complete training and evaluation pipeline.
train_and_evaluate(mlp())