# **Info 557 Graduate Project**

Angelina Allen


This project follows a complete machine learning pipeline to predict whether specific emotions are present in a given text:

1. **Data Loading**: CSV files are loaded using the Hugging Face datasets library.
2. **Preprocessing**: Text is tokenized and labels are formatted for multi-label classification.
3. **Model Training**: A neural network model uses embeddings, CNN, GRU, and attention. Then is trained using TensorFlow/Keras with early stopping and model checkpoints
4. **Prediction**: The trained model generates predictions and saves them in a submission file.

Below, we will walk through my code and ensure that all parts of the project are clear and understood.

## **Imports**

In [None]:
import argparse
import pandas as pd
import numpy as np
import tensorflow as tf
import datasets
import transformers

## **Explanatory Data Analysis**

This was done separately from the file added to the github.

*   Checked dataset shape and column names
*   Verified if any missing values
*   Looked for duplicate rows
*   Analyzed text length distribution to identify possible outliers
*   Reviewed label distributions to understand class imbalance in hopes to help fix prediction score



In [None]:
# Load the training data
df = pd.read_csv("train.csv")

# Basic structure
print("Shape of dataset:", df.shape)
print("\nColumn names:\n", df.columns)

# Check for missing values
print("\nMissing values per column:\n", df.isnull().sum())

# Check for duplicate rows
print("\nNumber of duplicate rows:", df.duplicated().sum())

# Text length analysis
df["text_length"] = df["text"].astype(str).apply(len)

print("Text length stats:\n", df["text_length"].describe())

# Outlier check: very long or very short texts
print("Shortest texts:")
print(df.nsmallest(5, "text_length")[["text", "text_length"]])

print("\nLongest texts:")
print(df.nlargest(5, "text_length")[["text", "text_length"]])

## **Custom Layer**
I constructed this custom global sum pool that sums all values across the time/sequence dimension for my attention. We learned about doing this in our lecture, but I also got help from Google.


In [None]:
class GlobalSumPooling1D(tf.keras.layers.Layer):
    def call(self, inputs):
        return tf.reduce_sum(inputs, axis=1)

## **Tokenization Functions and Dataset Loading**

* Text inputs are tokenized using a pretrained tokenizer.  
* Sequences are truncated or padded to a fixed length to ensure consistency.
* Labels are converted into numerical multi-hot vectors for multi-label classification.

I kept all these parts the same as the baseline. I figure this will be a safe way to process my data in my model correctly. I made sure to build my model to work with this given tokenizer and Hugging Face function.

 **train/dev split** validation strategy was used:

* The training set was used to fit the model parameters.
* A separate validation (dev) set was used for hyperparameter tuning and early stopping.
* Early stopping was monitored using validation F1-score to prevent overfitting.

This approach ensures that model performance is measured on unseen data.

In [None]:
# Tokenizer from baseline
tokenizer = transformers.AutoTokenizer.from_pretrained("distilroberta-base")
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 10

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=MAX_LEN, padding="max_length")

# loading dataset
def load_datasets(train_path="train.csv", dev_path="dev.csv"):
    # train/dev split
    hf_dataset = datasets.load_dataset("csv", data_files={"train": train_path, "validation": dev_path})
    labels = hf_dataset["train"].column_names[1:]

    def gather_labels(example):
        return {"labels": [float(example[l]) for l in labels]}

    hf_dataset = hf_dataset.map(gather_labels)
    hf_dataset = hf_dataset.map(tokenize, batched=True)

    train_dataset = hf_dataset["train"].to_tf_dataset(
        columns="input_ids",
        label_cols="labels",
        batch_size=BATCH_SIZE,
        shuffle=True
    )
    dev_dataset = hf_dataset["validation"].to_tf_dataset(
        columns="input_ids",
        label_cols="labels",
        batch_size=BATCH_SIZE
    )

    return train_dataset, dev_dataset, labels

It is important to note that the hyperparameters shown were not the only ones tested. I extensively tuned and adjusted the model to identify the combination that produced the best performance. The values presented reflect the final configuration that achieved the highest score.

## **Nueral Network Model**

The model combines:

* Creating a regularizer that adds L2 weight decay to reduce overfitting
* An embedding layer to convert tokens into dense vectors
* A spatial dropout that drops entire word embedding vectors to prevent overfitting
* A convolutional layer to extract local n-gram features
* A bidirectional GRU layer to capture context in both directions
* An attention mechanism to focus on the most important tokens
* Fully connected layers for classification to refine the learned feature
* Then outputs one sigmoid neuron per label for multi-label classification

In [None]:
def build_model(num_labels):
    # Adds L2 weight decay to reduce overfitting
    regularizer = tf.keras.regularizers.L2(0.0005)

    # inputs defines model input
    inputs = tf.keras.Input(shape=(MAX_LEN,), dtype=tf.int32, name="input_ids")

    # Embedding converts token IDs into dense vectors
    x = tf.keras.layers.Embedding(
        input_dim=tokenizer.vocab_size,
        output_dim=128,
        input_length=MAX_LEN,
        mask_zero=True,
        embeddings_regularizer=regularizer
    )(inputs)

    # drops entire word embedding vectors to prevent overfitting
    x = tf.keras.layers.SpatialDropout1D(0.2)(x)

    # CNN extracts local n-gram features and downsamples
    x = tf.keras.layers.Conv1D(filters=128, kernel_size=3, padding="same", activation="relu")(x)
    x = tf.keras.layers.MaxPooling1D(pool_size=2)(x)

    # Bidirectional Gru processes text forward and backward to capture context
    x = tf.keras.layers.Bidirectional(
        tf.keras.layers.GRU(
            64,
            dropout=0.3,
            recurrent_dropout=0.3,
            return_sequences=True
        )
    )(x)

    # Attention applies an attention mechanism that assigns importance weights to each token,
    # reweights token representations accordingly, and aggregates them into a single sequence-level feature vector.
    score = tf.keras.layers.Dense(1)(x)
    weights = tf.keras.layers.Softmax(axis=1)(score)
    x = tf.keras.layers.Multiply()([x, weights])
    x = GlobalSumPooling1D()(x)

    # Dense head applies a fully connected layer to refine learned features and uses
    # dropout for regularization to reduce overfitting.
    x = tf.keras.layers.Dense(64, activation="relu", kernel_regularizer=regularizer)(x)
    x = tf.keras.layers.Dropout(0.4)(x)

    # output one sigmoid neuron per label for multi-label classification
    outputs = tf.keras.layers.Dense(num_labels, activation="sigmoid")(x)

    # Build model
    model = tf.keras.Model(inputs=inputs, outputs=outputs)

    # Uses the Adam optimizer with an exponentially decaying learning rate and
    # gradient clipping to ensure stable and efficient training
    optimizer = tf.keras.optimizers.Adam(
        learning_rate=tf.keras.optimizers.schedules.ExponentialDecay(
            initial_learning_rate=1e-3,
            decay_steps=2000,
            decay_rate=0.95
        ),
        clipnorm=1.0
    )

    # Compiles the model using the Adam optimizer, binary cross-entropy loss,
    # and a micro-averaged F1 score metric with a fixed threshold.
    model.compile(
        optimizer=optimizer,
        loss=tf.keras.losses.BinaryCrossentropy(),
        metrics=[tf.keras.metrics.F1Score(average="micro", threshold=0.4)]
    )

    return model


## **Training Method**

The model is trained using:
* Binary Cross-Entropy loss for multi-label classification
* Adam optimizer with learning rate decay
* Gradient clipping to stabilize training
* Early stopping to prevent overfitting

Validation F1-score is used to monitor performance.

In [None]:
def train(model_path="model", train_path="train.csv", dev_path="dev.csv"):
    train_dataset, dev_dataset, labels = load_datasets(train_path, dev_path)
    model = build_model(len(labels))

    model.fit(
        train_dataset,
        validation_data=dev_dataset,
        epochs=EPOCHS,
        callbacks=[
            tf.keras.callbacks.ModelCheckpoint( #Saves the best model
                filepath=model_path + ".keras",
                monitor="val_f1",
                mode="max",
                save_best_only=True
            ),
            tf.keras.callbacks.EarlyStopping( #Stops training early if validation F1 stops improving
                monitor="val_f1",
                patience=3,
                mode="max",
                restore_best_weights=True
            )
        ]
    )

## **Prediction Method**

* Loads the saved trained model instead of training a new one.
* Now uses test text only.
* Converts model probability outputs into binary predictions using a 0.4 threshold.
* Inserts the predictions back into the test dataframe.
* Saves everything as submission.zip format.
* Includes a small command-line interface so you can run train or predict from the terminal.

I kept this close to the baseline model. The main change I made was tuning the decision threshold used to convert predicted probabilities into binary class labels, ultimately selecting 0.4. I experimented with several thresholds to improve performance on lower-frequency emotions, but 0.4 provided the best overall results.

In [None]:
# close to the same as baseline just update to fit mine
def predict(model_path="model.keras", input_path="test-ref.csv"):
    model = tf.keras.models.load_model(
        model_path,
        compile=False,
        custom_objects={"GlobalSumPooling1D": GlobalSumPooling1D})
    df = pd.read_csv(input_path)
    labels = df.columns[1:]

    enc = tokenizer(df["text"].tolist(), truncation=True, max_length=MAX_LEN, padding="max_length", return_tensors="tf")
    tf_dataset = tf.data.Dataset.from_tensor_slices(enc["input_ids"]).batch(BATCH_SIZE)

    preds = model.predict(tf_dataset)
    preds = (preds > 0.4).astype(int)  # Converts predicted probabilities into binary class labels using a 0.4 decision threshold

    df.iloc[:, 1:] = preds
    df.to_csv("submission.zip", index=False, compression=dict(
        method='zip', archive_name='submission.csv'
    ))

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("command", choices={"train", "predict"})
    args = parser.parse_args()
    globals()[args.command]()

## **Project File Structure**

The project uses the following structure:

- `train.csv`: Training dataset
- `dev.csv`: Validation dataset
- `test-ref.csv`: Test dataset
- `model.keras`: Saved trained model
- `submission.zip`: ZIP file containing the final predictions

## **Submission Process**

1. The trained model is loaded using TensorFlow.
2. The dev/test data is tokenized using the same tokenizer as during training.
3. The model generates probability predictions.
4. Probabilities are converted into binary labels using a fixed threshold (0.4).
5. Predictions are saved into `submission.csv`.
6. The CSV file is compressed into `submission.zip` for submission.

## **Model Evaluation**

The model showed a micro F1 score of .83 on the dev set and a .82 on the test set. Compared to the baseline score of 0.75, I believe this indicates a noticeable improvement. Also, good generalization from the fact that the scores did not go down on the test dataset as much. Overall, the model performed well, especially given the challenges of multi-label classification with low-frequency emotions.

## **Key Observations**
* Adding the attention mechanism allowed the model to focus on important tokens more effectively, though low-frequency emotions remain challenging.
* Dropout and L2 regularization reduced overfitting.
* Hyperparameter tuning improved stability and convergence. This part was the most tendious but the most rewarding when you found the right balance.

## **What I Learned**
This project helped me understand how deep learning models process text and how regularization and validation strategies improve generalization. It showed me all the work that goes into making deep learning model and how hard it is to achieve a perfect score. I learned that progress comes from making incremental improvements rather than aiming for perfection. Also, the importance of generalization ensuring a model performs well not just on a specific dataset, but across other datasets and real-world applications.

## **Future Improvements**

In the future, I would love to really focus on how I can get a better performance for the low-frequency emotions. Possibly using a better transformer that has build in launguage knowlege, experimenting with weighted classes, or cross-validation for more robust evaluation. Overall, I learned a lot from this project and am excited to continue working with deep learning models.