### Notebook Explanation and Comments

This notebook demonstrates how to build and evaluate a Multilayer Perceptron (MLP) for binary text classification using the Keras library. It covers data preprocessing, model building, training, evaluation, and statistical analysis of the model's performance.

**Cell 1: Introduction and Setup**
This markdown cell introduces the notebook's purpose and provides the necessary `pip` commands to install specific versions of TensorFlow and Keras that are compatible with the code.

Thie notebook explores the multilayer perceptron for binary text classification for your text classification problem, using the keras library.  Before starting, be sure to install the following versions of tensorflow and keras (for python 3.7):

```sh
pip install tensorflow==1.13.0-rc2
pip install keras==2.2.4
```

**Cell 2: Importing Libraries**
This cell imports all the required libraries and modules.
* `keras`: The main library for building the neural network.
* `numpy`: For numerical operations, especially with arrays.
* `sklearn.preprocessing`: Used for `LabelEncoder` to convert text labels into numbers.
* `keras.layers`: Contains `Dense` (a fully connected layer) and `Dropout` (for regularization).
* `keras.models`: `Sequential` is used to build the model layer by layer.
* `sklearn.feature_extraction.text`: `CountVectorizer` is used to convert text data into a matrix of token counts.
* `keras.callbacks`: Provides tools like `ModelCheckpoint` and `EarlyStopping` to enhance the training process.
* `random.choices`: Used in the bootstrap function for random sampling.
* `pandas`: Used for data manipulation and for creating the density plot.

In [ ]:
# Import the main Keras library
import keras
# Import NumPy for numerical operations
import numpy as np
# Import preprocessing tools from scikit-learn, specifically for label encoding
from sklearn import preprocessing
# Import layers for building the MLP: Dense for fully-connected layers and Dropout for regularization
from keras.layers import Dense, Dropout
# Import the Sequential model type, which allows for building models layer-by-layer
from keras.models import Sequential
# Import CountVectorizer to convert text into numerical feature vectors
from sklearn.feature_extraction.text import CountVectorizer
# Import callbacks to monitor and control the training process
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
# Import the 'choices' function for random sampling with replacement (used for bootstrapping)
from random import choices
# Import pandas for data manipulation and visualization
import pandas as pd

**Cell 3: Data Reading Function**
This cell defines a function `read_data` to load the dataset from a tab-separated values (TSV) file. It iterates through each line, splits it into a label and text, and appends them to separate lists.

In [ ]:
# Define a function to read data from a specified file
def read_data(filename):
    # Initialize empty lists to store the text (X) and labels (Y)
    X=[]
    Y=[]
    # Open the file with UTF-8 encoding to handle various characters
    with open(filename, encoding="utf-8") as file:
        # Iterate over each line in the file, keeping track of the index (though idx is not used)
        for idx,line in enumerate(file):
            # Split the line by the tab character ("\t") and remove any trailing whitespace
            cols=line.rstrip().split("\t")
            # The first column is the label
            label=cols[0]
            # The second column is the text
            text=cols[1]
            # Append the text to the X list
            X.append(text)
            # Append the label to the Y list
            Y.append(label)

    # Return the lists of texts and labels
    return X, Y

**Cell 4: Setting the Data Directory**
This cell defines a variable `directory` that holds the path to the data files. This path needs to be updated by the user to point to the correct location of `train.tsv`, `dev.tsv`, and `test.tsv`.

In [ ]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
# Set the path to the folder containing the dataset files
directory="../data/text_classification_sample_data"

**Cell 5: Loading and Preprocessing Data**
This cell performs the main data loading and preprocessing steps.
1.  **Read Data**: It calls the `read_data` function to load the training, development (validation), and test sets.
2.  **Vectorize Text**: It initializes `CountVectorizer` to convert text into a "bag-of-words" numerical format. `binary=True` means it only records the presence (1) or absence (0) of a word, not its frequency. `max_features=10000` limits the vocabulary to the 10,000 most frequent words.
3.  **Fit and Transform**: `fit_transform` is used on the training data (`trainX`) to learn the vocabulary and transform the text into vectors. `transform` is used on the dev and test sets to ensure they are vectorized using the *same* vocabulary learned from the training data.
4.  **Encode Labels**: `LabelEncoder` is used to convert the string labels (e.g., "positive", "negative") into integers (e.g., 1, 0).

In [ ]:
# Read the training, development (dev), and test data using the function defined above
trainX, trainY=read_data("%s/train.tsv" % directory)
devX, devY=read_data("%s/dev.tsv" % directory)
testX, testY=read_data("%s/test.tsv" % directory)

# Initialize CountVectorizer to convert text into a matrix of word occurrences
# max_features=10000: Use the top 10,000 most frequent words as the vocabulary
# analyzer=str.split: Split text into words based on whitespace
# lowercase=True: Convert all text to lowercase before tokenizing
# strip_accents=None: Do not remove accents
# binary=True: Use binary values (1 if a word is present, 0 otherwise) instead of word counts
vectorizer = CountVectorizer(max_features=10000, analyzer=str.split, lowercase=True, strip_accents=None, binary=True)

# Learn the vocabulary from the training data and transform it into a sparse matrix
X_train = vectorizer.fit_transform(trainX)
# Transform the dev and test data using the vocabulary learned from the training data
X_dev = vectorizer.transform(devX)
X_test = vectorizer.transform(testX)

# Get the size of the vocabulary from the shape of the training matrix
_,vocabSize=X_train.shape

# Initialize a LabelEncoder to convert string labels to integers
le = preprocessing.LabelEncoder()
# Fit the encoder on the training labels to learn the mapping (e.g., 'positive' -> 1, 'negative' -> 0)
le.fit(trainY)

# Transform the labels for all datasets into their integer representations
Y_train=le.transform(trainY)
Y_dev=le.transform(devY)
Y_test=le.transform(testY)

**Cell 6: MLP Model Definition**
This cell defines a function `mlp` that creates a simple Multilayer Perceptron model.
* **`Sequential()`**: Creates a linear stack of layers.
* **`Dense(10, ...)`**: The first (hidden) layer with 10 neurons, a 'relu' activation function, and an input shape matching the vocabulary size.
* **`Dropout(0.5)`**: A regularization layer that randomly sets 50% of neuron activations to zero during training to prevent overfitting.
* **`Dense(1, ...)`**: The final (output) layer with a single neuron and a 'sigmoid' activation function, which outputs a probability between 0 and 1, suitable for binary classification.
* **`model.compile(...)`**: Configures the model for training, specifying the loss function (`binary_crossentropy`), the optimizer (`adam`), and the metric to track (`acc` for accuracy).

In [ ]:
# Define a function to create the Multilayer Perceptron (MLP) model
def mlp():
    # Initialize a Sequential model
    model = Sequential()
    # Add the first Dense (fully connected) layer. This is the hidden layer.
    # 10: Number of neurons in this layer.
    # activation='relu': Use the Rectified Linear Unit activation function.
    # input_shape=(vocabSize,): Specify the shape of the input data (the size of our vocabulary).
    model.add(Dense(10, activation='relu', input_shape=(vocabSize,)))
    # Add a Dropout layer for regularization.
    # 0.5: The fraction of input units to drop, helps prevent overfitting.
    model.add(Dropout(0.5))
    # Add the output layer.
    # 1: A single neuron for binary classification.
    # activation='sigmoid': Sigmoid activation squashes the output to a probability between 0 and 1.
    model.add(Dense(1, activation='sigmoid'))

    # Compile the model to configure the learning process.
    # loss='binary_crossentropy': Loss function suitable for binary (0/1) classification.
    # optimizer='adam': A popular and effective optimization algorithm.
    # metrics=['acc']: The metric to be evaluated by the model during training and testing.
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
        
    # Return the compiled model
    return model

**Cell 7: Training and Evaluation Function**
This cell defines `train_and_evaluate`, a helper function to streamline the model training and evaluation process.
* **Callbacks**: It sets up `EarlyStopping` to halt training if the validation loss (`val_loss`) doesn't improve for 5 epochs (`patience=5`), and `ModelCheckpoint` to save the best version of the model based on `val_loss`.
* **Training**: It trains the model using `model.fit`, passing the training and validation data.
* **Evaluation**: After training, it loads the best saved model weights and evaluates its accuracy on both the dev and test sets.

In [ ]:
# Define a function to handle the training and evaluation of a given model
def train_and_evaluate(model, verbose=1, epochs=10):
    # If verbose is greater than 0, print the model's architecture summary
    if verbose > 0:
        print (model.summary())

    # Set up EarlyStopping to prevent overfitting.
    # monitor='val_loss': Stop training when the loss on the validation set stops improving.
    # min_delta=0: Any improvement, no matter how small, is considered.
    # patience=5: Number of epochs with no improvement after which training will be stopped.
    early_stopping = EarlyStopping(monitor='val_loss',
        min_delta=0,
        patience=5,
        verbose=0, 
        mode='auto')

    # Set up ModelCheckpoint to save the best model during training.
    modelName="mymodel.hdf5"
    # monitor='val_loss': The metric to monitor for saving the best model.
    # save_best_only=True: Only save the model if 'val_loss' has improved.
    # mode='min': The monitored quantity should be minimized (loss).
    checkpoint = ModelCheckpoint(modelName, monitor='val_loss', verbose=0, save_best_only=True, mode='min')

    # Train the model
    model.fit(X_train, Y_train, 
                # Provide the development (validation) data to monitor performance
                validation_data=(X_dev, Y_dev),
                # Number of times to iterate over the entire training dataset
                epochs=epochs,
                # Control the verbosity of the training output
                verbose=verbose,
                # List of callbacks to apply during training
                callbacks=[checkpoint, early_stopping])
    
    # Load the weights of the best model saved by the checkpoint
    model.load_weights(modelName)

    # Evaluate the best model's performance on the development (validation) set
    dev_loss, dev_accuracy = model.evaluate(X_dev, Y_dev, batch_size=128, verbose=verbose)
    if verbose > 0:
        print("Dev Accuracy: %.3f" % dev_accuracy)

    # Evaluate the best model's performance on the test set
    test_loss, test_accuracy = model.evaluate(X_test, Y_test, batch_size=128, verbose=verbose)
    if verbose > 0:
        print("Test Accuracy: %.3f" % test_accuracy)

    # Return the final accuracy scores for the dev and test sets
    return dev_accuracy, test_accuracy

**Cell 8: Question 1**
This markdown cell poses the first task: to experiment with five different MLP architectures to find the best-performing one on the development data. One of the models must be logistic regression (an MLP with no hidden layers).

Q1: Experiment with the network structure that works best for your binary classification dataset.  Explore the following choices: a.) number of layers in the MLP;  b.) the size of each layer; c.) the activation functions; d.) the use of dropout.  Which architecture performs best on the development data?  Create 5 different models and execute them below.  One of the models should be logistic regression (i.e., an MLP with *no* hidden layers).

**Cell 9: Model 1 (Example MLP)**
This cell defines and trains the first model for Q1. This is the same MLP architecture defined in the `mlp` function earlier: one hidden layer with 10 neurons, 'relu' activation, and 50% dropout.

In [ ]:
# Define a function to create the first model architecture
def get_model1():
    # Initialize a Sequential model
    model = Sequential()
    # Add a hidden layer with 10 neurons, 'relu' activation, and the specified input shape
    model.add(Dense(10, activation='relu', input_shape=(vocabSize,)))
    # Add a Dropout layer with a rate of 0.5 for regularization
    model.add(Dropout(0.5))
    # Add the output layer with 1 neuron and 'sigmoid' activation for binary classification
    model.add(Dense(1, activation='sigmoid'))

    # Compile the model with binary cross-entropy loss and the adam optimizer
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
        
    # Return the compiled model
    return model

# Create an instance of the first model
model1=get_model1()
# Train and evaluate the model using the helper function
dev_accuracy, test_accuracy=train_and_evaluate(model1)

**Cells 10-12: Placeholder for Models 2-4**
These cells are placeholders for the user to define and test their own model architectures as required by Q1.

In [ ]:
# --- MODEL 2 ---
# Define a function for the second model architecture
def get_model2():
    model = Sequential()
    # TODO: your model here. For example, a deeper network:
    # model.add(Dense(16, activation='relu', input_shape=(vocabSize,))) 
    # model.add(Dropout(0.5))
    # model.add(Dense(8, activation='relu'))
    # model.add(Dropout(0.5))
    # model.add(Dense(1, activation='sigmoid'))

    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
        
    return model

# Create, train, and evaluate the second model
# model2=get_model2()
# dev_accuracy, test_accuracy=train_and_evaluate(model2)

In [ ]:
# --- MODEL 3 ---
# Define a function for the third model architecture
def get_model3():
    model = Sequential()
    # TODO: your model here. For example, a wider network:
    # model.add(Dense(32, activation='relu', input_shape=(vocabSize,))) 
    # model.add(Dropout(0.5))
    # model.add(Dense(1, activation='sigmoid'))

    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
        
    return model

# Create, train, and evaluate the third model
# model3=get_model3()
# dev_accuracy, test_accuracy=train_and_evaluate(model3)

In [ ]:
# --- MODEL 4 ---
# Define a function for the fourth model architecture
def get_model4():
    model = Sequential()
    # TODO: your model here. For example, using a different activation like 'tanh':
    # model.add(Dense(10, activation='tanh', input_shape=(vocabSize,))) 
    # model.add(Dropout(0.25))
    # model.add(Dense(1, activation='sigmoid'))

    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
        
    return model

# Create, train, and evaluate the fourth model
# model4=get_model4()
# dev_accuracy, test_accuracy=train_and_evaluate(model4)

**Cell 13: Model 5 (Logistic Regression)**
This cell defines and trains the logistic regression model. This is achieved by creating a `Sequential` model with only a single `Dense` output layer and no hidden layers. This architecture is mathematically equivalent to logistic regression.

In [ ]:
# Define a function to create the logistic regression model
def get_logreg():
    # This is an MLP with no hidden layers, which is equivalent to logistic regression.

    # Initialize a Sequential model
    model = Sequential()
    # Add only the output layer directly.
    # It takes the input and directly computes the final output via the sigmoid function.
    model.add(Dense(1, activation='sigmoid', input_shape=(vocabSize,)))
    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
        
    # Return the compiled model
    return model
    
# Create an instance of the logistic regression model
logreg=get_logreg()
# Train and evaluate the model
dev_accuracy, test_accuracy=train_and_evaluate(logreg)

**Cell 14: Prediction Explanation**
This markdown cell briefly explains how to use the trained model to make predictions on new data.

We can generate predictions for a given test set with the `predict_classes` function:

**Cell 15: Making Predictions**
This cell demonstrates how to generate class predictions (0 or 1) for the test set using the trained logistic regression model.

In [ ]:
# Select the model to use for predictions (here, the logistic regression model)
model=logreg
# Use the .predict_classes() method to get the predicted class labels for the test set
predictions=model.predict_classes(X_test)

**Cell 16: Question 2**
This markdown cell asks the user to calculate 95% confidence intervals for the test accuracy of their single best model from Q1, using the bootstrap resampling method.

Q2: For the single best model you identified in Q1 above, calculate 95% confidence intervals it makes on the test data.

**Cell 17: Accuracy Metric Function**
This cell defines a simple helper function to calculate accuracy by comparing the ground truth labels with the model's predictions.

In [ ]:
# Define a function to calculate accuracy
def accuracy(truth, predictions):
    # Initialize a counter for correct predictions
    correct=0.
    # Loop through each true label and its corresponding prediction
    for idx in range(len(truth)):
        g=truth[idx]  # Ground truth label
        p=predictions[idx] # Predicted label
        # If the prediction matches the truth, increment the counter
        if g == p:
            correct+=1
    # Return the ratio of correct predictions to the total number of predictions
    return correct/len(truth)

**Cell 18: Bootstrap Function**
This cell defines the `bootstrap` function to calculate confidence intervals.
1.  **Resampling**: It repeatedly (B=10,000 times) creates a new dataset by sampling with replacement from the original predictions.
2.  **Metric Calculation**: For each resampled dataset, it calculates the accuracy.
3.  **Percentiles**: After all iterations, it collects all the calculated accuracies and finds the percentiles that correspond to the 95% confidence interval (the 2.5th and 97.5th percentiles) and the median (50th percentile).

In [ ]:
# Define a function to perform bootstrap resampling to estimate confidence intervals
# gold: The ground truth labels
# predictions: The model's predictions
# metric: The evaluation function to use (e.g., accuracy)
# B: The number of bootstrap samples to create
# confidence_level: The desired confidence level (e.g., 0.95 for 95%)
def bootstrap(gold, predictions, metric, B=10000, confidence_level=0.95):
    # Calculate the percentile boundaries for the confidence interval
    critical_value=(1-confidence_level)/2
    lower_sig=100*critical_value      # e.g., 2.5 for 95% CI
    upper_sig=100*(1-critical_value)  # e.g., 97.5 for 95% CI
    
    # Combine the ground truth and predictions into pairs for easy resampling
    data=[]
    for g, p in zip(gold, predictions):
        data.append([g,p])

    # List to store the metric score for each bootstrap sample
    accuracies=[]
    
    # Main bootstrap loop
    for b in range(B):
        # Create a new sample by choosing with replacement from the original data
        choice=choices(data, k=len(data))
        # Convert the sample to a NumPy array for easy slicing
        choice=np.array(choice)
        # Calculate the accuracy on this bootstrap sample
        accuracy_score=metric(choice[:,0], choice[:,1]) # choice[:,0] is truth, choice[:,1] is prediction
        
        # Store the calculated accuracy
        accuracies.append(accuracy_score)
    
    # Calculate the percentiles from the distribution of accuracies
    percentiles=np.percentile(accuracies, [lower_sig, 50, upper_sig])
    
    # Extract the lower bound, median, and upper bound of the confidence interval
    lower=percentiles[0]
    median=percentiles[1]
    upper=percentiles[2]
    
    # Return the calculated values
    return lower, median, upper

**Cell 19: Calculating and Printing Confidence Intervals**
This cell calls the `bootstrap` function with the test labels and predictions to compute the confidence interval for accuracy and then prints the result in a formatted string.

In [ ]:
# Calculate the 95% confidence interval for accuracy using the bootstrap function
lower, median, upper=bootstrap(Y_test, predictions, accuracy)
# Print the results, showing the median accuracy and the [lower, upper] confidence interval
print ("Accuracy: %.3f [%.3f, %.3f]" % (median, lower, upper))

**Cell 20: Question 3**
This markdown cell introduces the third task: to investigate the effect of random weight initialization on model performance. The user needs to train their best non-logistic-regression model 10 times, record the development accuracy each time, and plot the distribution of these accuracies.

Q3: Unlike logistic/linear regression, neural networks converge to different solutions as a function of their *initialization* (the random choice of the initial values for parameters).  For the best model that's not logistic regression you identified in Q1 above, train the model 10 times and save the accuracies attained on the development data.  Plot the distribution of dev accuracies using [pandas.DataFrame.plot.density](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.density.html). 

**Cell 21: Re-training Loop**
This cell implements the experiment for Q3. It loops 10 times, and in each iteration, it re-trains `model1` from scratch and appends the resulting development accuracy to a list. The `verbose=0` argument is used to suppress the detailed training output for each run.

In [ ]:
# Initialize an empty list to store the development accuracies from each run
dev_accuracies=[]

# Select the model to be re-trained (here, model1 is chosen)
# Note: The model weights are re-initialized each time `train_and_evaluate` is called,
# as the `fit` method starts from the current state of the weights. For a truly fresh start,
# you would re-create the model inside the loop: `model = get_model1()`. However, Keras's
# training process itself introduces sufficient randomness for this experiment.
model=model1

# Loop 10 times to train and evaluate the model repeatedly
for i in range(10):
    # Train and evaluate the model, with verbose=0 to keep the output clean
    dev_accuracy, test_accuracy=train_and_evaluate(model, verbose=0)
    # Append the resulting development accuracy to our list
    dev_accuracies.append(dev_accuracy)
    # Print the results of the current iteration
    print("iteration: %s\t%.3f\t%.3f" % (i, dev_accuracy, test_accuracy))

**Cell 22: Plotting the Accuracy Distribution**
This cell uses pandas to create a Kernel Density Estimate (KDE) plot. A KDE plot is a way to visualize the distribution of a continuous variable. Here, it shows the distribution of the 10 development accuracy scores obtained in the previous cell.

In [ ]:
# Convert the list of accuracies into a pandas DataFrame
df=pd.DataFrame(dev_accuracies)
# Use the DataFrame's built-in plotting function to create a Kernel Density Estimate (KDE) plot.
# This visualizes the distribution of the development accuracies.
ax = df.plot.kde()