# Toxic Comment Classification Challenge

The goal of this challenge is to develop methods for detecting and classifying toxicity levels in online comments, aiming to outperform the state-of-the-art models available via the Perspective API. This notebook guides the process of training Machine Learning models to classify 6 toxicity types and analyzes results using metrics like accuracy, recall, precision, F1-score, and the challenge's main metric: mean column-wise ROC AUC.

## Step 1. Load in required libraries

In [None]:
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.svm import LinearSVC
import spacy
import numpy as np
import pickle
import os
import seaborn as sns
import warnings
from wordcloud import WordCloud
from sklearn.exceptions import InconsistentVersionWarning

In [None]:
# Suppress InconsistentVersionWarning present in some of the output
warnings.filterwarnings("ignore", category=InconsistentVersionWarning)

## Step 2. Load the data
**The train.csv file should be placed in a folder named train and the test data should be placed in a folder named test. If they aren't then the script won't work since the data will not be loaded in.**

In [None]:
# Used to load in the training and test data. 
train_df = pd.read_csv("train/train.csv")
test_df = pd.read_csv("test/test.csv")
test_labels = pd.read_csv("test/test_labels.csv")
print(len(train_df))
print(len(test_df))
print(len(test_labels))


### Build evaluation test set to match training data's structure
The `train.csv` file contains 8 columns: an ID (unique row identifier), comment text, and 6 toxic class labels. In contrast, `test.csv` has only 2 columns (ID and comment text), while `test_labels.csv` includes the ID and 6 class labels.

This structure works for generating a submission file matching the format of `sample_submission.csv`. However, to evaluate models, we need class labels to compare predictions. Some rows in `test_labels.csv` have -1 for all labels, indicating they are not used for scoring. The code below merges `test.csv` with `test_labels.csv`, excluding rows where all class labels are -1, leaving only rows used for evaluation to assess model performance locally.

In [None]:
# The data in test.csv 

# Prints the columns for all dataframes
print(test_df.columns)
print(test_labels.columns)
print(train_df.columns)

# Merges the test dataset into one since they both have the row ID columns
merged_df = pd.merge(test_df, test_labels, on='id')

# The class label columns
label_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# Filter out rows where the sum of the specified columns equals -6 (This indicates rows that are not used for scoring evaluation and therefore have placeholder labels)
final_test_df = merged_df[merged_df[label_columns].sum(axis=1) != -6]

# Display the final test dataframe
final_test_df.head()

print(len(final_test_df))

## Step 3. Perform EDA (Exploratory Data Analysis)

In [None]:
train_df.columns # lists all columns

In [None]:
train_df.describe() # There are no missing values

In [None]:
len(train_df[train_df['comment_text'].isnull()]) # This checks the number of missing comments in the dataset

## Step 4. Analyze class distribution

In [None]:
# Download punctuation tokens so they can be filtered out easily
nltk.download('punkt')
nltk.download('punkt_tab')

The code below aims to create a column named `no_toxic_label` which counts the number of rows without a class label assigned (all class labels have a value of 0). This will help to gauge how many non toxic comments are present in the training data.

In [None]:
# Create a copy of the dataframe to make modifications without affecting the actual training data
data_analysis = train_df.copy()

# This sums the values for each row from the 2nd positional column (the first labelled column named toxic) until the final column
row_sums = data_analysis.iloc[:, 2:].sum(axis=1)

# Add a new column named no_toxic_label used to identify all unlabelled comments
data_analysis['no_toxic_label'] = (row_sums == 0).astype(int)

# Find rows where the sum of all class labels is equal to zero (indicating unlabelled comments)
rows_with_zero_sum = data_analysis[row_sums == 0]

print(f"Count of rows that do not have any labels assigned to them indicating they are not toxic: {len(rows_with_zero_sum)}.")  # Count of unlabelled comments

print(f"Percentage of all comments that have no toxic labels: {round(len(rows_with_zero_sum)/ len(train_df) * 100,2)}%.") # We can see 89.83% of all comments have no labels

In [None]:
# Columns representing class labels
classes = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', 'no_toxic_label']

# Step 1: Precompute sentences and tokens for all comments
data_analysis["sentence_count"] = data_analysis["comment_text"].apply(lambda text: len(sent_tokenize(text)))
data_analysis["token_count"] = data_analysis["comment_text"].apply(lambda text: len(word_tokenize(text)))

# Step 2: Aggregate counts for each class
class_sentence_counts = {}
class_token_counts = {}

# Loop through each class label and filters all rows for rows that have a value of 1 in this current class label. Then the total sentence count and token count is computed for this class
# and stored in the respective dictionaries
for class_label in classes:
    class_rows = data_analysis[data_analysis[class_label] == 1]
    class_sentence_counts[class_label] = class_rows["sentence_count"].sum()
    class_token_counts[class_label] = class_rows["token_count"].sum()

The code below aims to print out for each class the the total tokens, total sentences, percentage of all tokens belonging to this class, percentage of all sentences belonging to this class, and its class distribution.

In [None]:
# This will simply get the total tokens and sentences in the entire dataset
total_tokens = data_analysis["token_count"].sum()
total_sentences = data_analysis["sentence_count"].sum()

# Output results in a neat formatted way
print(f"{'Class':<15}{'Total Sentences':>20}{'Total Tokens':>20}{'Pct Sentences':>20}{'Pct Tokens':>20}")
print("-" * 100)

# For each class label, print out the class label, the total sentence count, total token count, percentage of all sentences belonging to this class and percentage of all tokens belonging to this class
for class_label in classes:
    total_sentences_class = class_sentence_counts[class_label]
    total_tokens_class = class_token_counts[class_label]
    percentage_sentences = round((total_sentences_class / total_sentences) * 100, 1)
    percentage_tokens = round((total_tokens_class / total_tokens) * 100, 1)
    
    print(f"{class_label:<15}{total_sentences_class:>20,}{total_tokens_class:>20,}{percentage_sentences:>20.1f}%{percentage_tokens:>20.1f}%")

print("-" * 100)

# Display class distribution summary
print("\nClass Distribution Summary:")
class_distribution = data_analysis[classes].sum()
print(class_distribution)

The code below aims to print out all of these distribution values in multiple bar charts.

In [None]:
# Extract the labels and counts from the token dictionary
labels = list(class_token_counts.keys())
token_counts = list(class_token_counts.values())

# Extract the counts from the sentence dictionary
sentence_counts = list(class_sentence_counts.values())

# Create a figure with 3 subplots (1 row, 3 columns)
fig, axs = plt.subplots(1, 3, figsize=(18, 6))

# Plot class distribution (binary values for each class)
axs[0].bar(class_distribution.index, class_distribution.values, color='lightskyblue')
axs[0].set_title('Class Distribution')
axs[0].set_xlabel('Toxicity Classes')
axs[0].set_ylabel('Number of Instances')
axs[0].tick_params(axis='x', rotation=45)

# Plot the sentence counts
axs[1].bar(labels, sentence_counts, color='lightcoral')
axs[1].set_title('Number of Sentences per Class')
axs[1].set_xlabel('Toxicity Classes')
axs[1].set_ylabel('Number of Sentences')
axs[1].tick_params(axis='x', rotation=45)

# Plot the token counts
axs[2].bar(labels, token_counts, color='lightgreen')
axs[2].set_title('Number of Tokens per Class')
axs[2].set_xlabel('Toxicity Classes')
axs[2].set_ylabel('Number of Tokens')
axs[2].tick_params(axis='x', rotation=45)

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

### Final thoughts on class distribution

Based on the analysis shown above, 90% of all comments have no toxic labels assigned to them indicating that only 10% of training examples will help the model to predict toxic comments. This is an unbalanced dataset that will lead inherently to more non toxic label predictions given the small amount of toxic comment examples. Besides this the toxic class has the most examples, sentences and tokens of the toxic class labels and it is considerably more than the lowest occurring class which would be threat, severe_toxic, and identity_hate. Threat, severe_toxic, and identity_hate collectively make up 3% of all training examples making it challenging for the model to predict these classes.

## Most Common Words per Class

The code below prints out the top 10 most frequently occurring words per class to understand the most common words for each of them.

In [None]:
# Initialize CountVectorizer with parameters to ignore common stop words
vectorizer = CountVectorizer(stop_words='english', max_features=10)

# Prepare a list to store results
results = []

# Calculate and print top words per class
for class_label in classes:
    # Filter comments for the current class
    class_comments = data_analysis[data_analysis[class_label] == 1]['comment_text']
    
    # Transform the text into word frequency counts
    word_counts = vectorizer.fit_transform(class_comments)
    
    # Get the most common words
    common_words = vectorizer.get_feature_names_out()
    word_freq = word_counts.sum(axis=0).A1

    # Sort the words by frequency (highest to lowest)
    sorted_word_freq, sorted_common_words = zip(*sorted(zip(word_freq, common_words), reverse=True))
    
    # Store results in the list
    for word, freq in zip(sorted_common_words, sorted_word_freq):
        results.append({"Class": class_label, "Word": word, "Frequency": freq})

# Adjust pandas display settings to prevent truncation
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_colwidth', None)  # Show full width of each column
pd.set_option('display.expand_frame_repr', False)  # Prevent wrapping of the table

# Convert results into a pandas DataFrame
results_df = pd.DataFrame(results)

# Print or save the table
print(results_df)

The code below aims to print out the same shown above in multiple bar charts.

In [None]:
# Create subplots
fig, axs = plt.subplots(1, 7, figsize=(24, 8))

# Loop through each class and plot the most common words
for idx, class_label in enumerate(classes):
    # Filter comments for the current class
    class_comments = data_analysis[data_analysis[class_label] == 1]['comment_text']
    
    # Transform the text into word frequency counts
    word_counts = vectorizer.fit_transform(class_comments)
    
    # Get the most common words and their frequencies
    common_words = vectorizer.get_feature_names_out()
    word_freq = word_counts.sum(axis=0).A1
    
    # Sort the words by frequency (highest to lowest)
    sorted_word_freq, sorted_common_words = zip(*sorted(zip(word_freq, common_words), reverse=True))
    
    # Plot the sorted words and their frequencies
    axs[idx].bar(sorted_common_words, sorted_word_freq, color='skyblue')
    axs[idx].set_title(f"Most Common Words for {class_label.capitalize()}")
    axs[idx].set_xlabel('Words')
    axs[idx].set_ylabel('Frequency')
    axs[idx].tick_params(axis='x', rotation=45)

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

The code below aims to print a word cloud of most common words per class.

In [None]:
# Create subplots (one for each class)
fig, axs = plt.subplots(1, 7, figsize=(24, 6))

# Loop through each class and generate a word cloud
for idx, class_label in enumerate(classes):
    # Filter comments for the current class
    class_comments = data_analysis[data_analysis[class_label] == 1]['comment_text']
    
    # Combine all comments into a single string
    combined_comments = " ".join(class_comments)
    
    # Generate the word cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(combined_comments)
    
    # Plot the word cloud
    axs[idx].imshow(wordcloud, interpolation='bilinear')
    axs[idx].set_title(f"Word Cloud for {class_label.capitalize()}")
    axs[idx].axis('off')  # Hide axes for better visualization

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()


## Step 5. Preprocess the data

In [None]:
def preprocess_text(text):
    """
    Function aimed to perform text preprocessing on the comment text to prepare it for the feature extraction methods. 
    """
    # Lower case all letters
    text = text.lower()
    # Remove any unwanted characters and replace with a space
    text = re.sub(r'[^a-zA-Z0-9\s.,!?\'"-]', ' ', text)
    # Replace multiple spaces with a single space
    text = re.sub('\s+', ' ', text)
    # Removes any leading or trailing spaces
    text = text.strip(' ')
    return text

In [None]:
print(train_df['comment_text'].head())  # Used to show the training comments before preprocessing
train_df_preprocessed = train_df.copy() # Create a copy of the training data so we do not use the original dataframe
train_df_preprocessed['comment_text'] = train_df['comment_text'].map(lambda text: preprocess_text(text))    # Pass each comment text into the function to be preprocessed
print(train_df_preprocessed['comment_text'].head()) # Shows the results of the preprocessing on a few examples

In [None]:
print(final_test_df['comment_text'].head())  # Used to show the evaluation test comments before preprocessing
test_df_preprocessed = final_test_df.copy() # Create a copy of the training data so we do not use the original dataframe
test_df_preprocessed['comment_text'] = final_test_df['comment_text'].map(lambda text: preprocess_text(text))    # Pass each comment text into the function to be preprocessed
print(test_df_preprocessed['comment_text'].head()) # Shows the results of the preprocessing on a few examples

print(len(test_df_preprocessed))

In [None]:
print(test_df['comment_text'].head())  # Used to show the submission test comments before preprocessing
submission_test_df_preprocessed = test_df.copy() # Create a copy of the training data so we do not use the original dataframe
submission_test_df_preprocessed['comment_text'] = submission_test_df_preprocessed['comment_text'].map(lambda text: preprocess_text(text))    # Pass each comment text into the function to be preprocessed
print(submission_test_df_preprocessed['comment_text'].head()) # Shows the results of the preprocessing on a few examples

print(len(submission_test_df_preprocessed))

## Step 6a. Use TFIDFVectorizer to vectorize the data
To train ML algorithms on comment text, we need to convert it into a numeric representation that extracts key features. One method is TfidfVectorizer, which computes the term frequency-inverse document frequency (TF-IDF). This method assigns higher weights to terms that are frequent within a document but rare across the corpus, making them more informative. Common words like the, or, and for are de-emphasized.

In [None]:
# Initialize a TfidfVectorizer to first fit our training data and then convert the evaluation and submission test datasets
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

In [None]:
# Extract the comment text columns from all 3 preprocessed dataframes
train_X = train_df_preprocessed['comment_text']
test_X = test_df_preprocessed['comment_text']
submission_test_X = submission_test_df_preprocessed['comment_text']

## 

In [None]:
# Fit and then transform the training comments
train_X_vectorized = vectorizer.fit_transform(train_X)
train_X_vectorized

In [None]:
# Transform the evaluation test comments
test_X_vectorized = vectorizer.transform(test_X)
test_X_vectorized

In [None]:
# Transform the submission test comments
submission_test_X_vectorized = vectorizer.transform(submission_test_X)
submission_test_X_vectorized

## Step 6b. Use Spacy to convert the comments into word embeddings
Another method is using word embeddings instead of traditional vectorized approaches. Pretrained vocabularies, like Word2Vec or GloVe, assign unique numeric arrays to words, capturing semantic relationships between them. By passing comment text through such a vocabulary, each word is converted into its numeric embedding, which can then be used to train ML algorithms. In this case we use SpaCy's vocabulary to create the word embeddings.

In [None]:
def generate_word_embeddings(text, output_file):
    """
        This function aims to load in a spacy model and convert the passed in text to its word embedding form. It will then save the data to file. If the embedding file already exists,
        it will simply load it in instead of generating new embeddings to save on compute time.
    """
    try:
        # If the embedding file exists, load it
        if os.path.exists(output_file):
            print(f"Loading embeddings from {output_file}.")
            embeddings = np.load(output_file)
        else:
            # Generate new embeddings
            print(f"The file {output_file} could not be found. Generating new embeddings...")
            nlp = spacy.load("en_core_web_md")  # Load the spaCy model
            embeddings = np.array([doc.vector for doc in nlp.pipe(text, batch_size=128)])  # Generate embeddings
            
            # Ensure the directory for the output file exists
            os.makedirs(os.path.dirname(output_file), exist_ok=True)
            
            # Save the embeddings to the output file
            np.save(output_file, embeddings)
            print(f"Embeddings saved to {output_file}.")
    except Exception as e:
        print(f"An error occurred: {e}")
        raise
    return embeddings


In [None]:
# This will get the embeddings for all training comment texts
train_embeddings = generate_word_embeddings(train_df_preprocessed['comment_text'], "embeddings/train_embeddings.npy")

In [None]:
# This will get the embeddings for all evaluation test comment texts
test_embeddings = generate_word_embeddings(test_df_preprocessed['comment_text'], "embeddings/test_embeddings.npy")

In [None]:
# This will get the embeddings for all submission test comment texts
submission_train_embeddings = generate_word_embeddings(submission_test_df_preprocessed['comment_text'], "embeddings/submission_train_embeddings.npy")

## Step 7. Building Multilabel Classification Model Functions
Since we aim to predict 6 class labels per comment, this is a multi-label problem. OneVsRestClassifier handles this by building 6 binary classifiers, each focusing on one label. For example, one classifier predicts whether a comment is toxic (1) or not toxic (0).

In [None]:
# List all the class labels and access those columns from the training and evaluation test set
label_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train_y = train_df_preprocessed[label_columns]
test_y = test_df_preprocessed[label_columns]

### Logistic Regression Model

In [None]:
def train_logistic_regression_model(train_x, train_y, test_x, model_filename, submission_test_x=None, test_ids=None, submission_filename=None):
    """
    Train a Logistic Regression model and optionally create a submission file.

    Parameters:
    - train_x: Training feature matrix (numpy array if it is word embeddings or sparse matrix if vectorized).
    - train_y: Training target labels (pandas Dataframe).
    - test_x: Test feature matrix for evaluation (numpy array if it is word embeddings or sparse matrix if vectorized).
    - model_filename: File name to save the trained model.
    - submission_test_x (optional): Feature matrix for submission predictions (numpy array if it is word embeddings or sparse matrix if vectorized).
    - test_ids (optional): IDs corresponding to `submission_test_x`.
    - submission_filename (optional): File name to save the submission file.
    """
    logistic_regression_model = None

    # Try to load the model if it exists, otherwise train a new one
    try:
        if os.path.exists(model_filename):
            with open(model_filename, "rb") as f:
                logistic_regression_model = pickle.load(f)
        else:
            raise FileNotFoundError(f"Model file '{model_filename}' not found.")
    except FileNotFoundError as e:
        print(e)
        print("Training a new Logistic Regression model...")
        logistic_regression_model = OneVsRestClassifier(LogisticRegression(max_iter=5000, C=3, class_weight='balanced'))
        logistic_regression_model.fit(train_x, train_y)

        # Save the trained model to a file
        os.makedirs(os.path.dirname(model_filename), exist_ok=True)  # Ensure the directory exists
        with open(model_filename, "wb") as f:
            pickle.dump(logistic_regression_model, f)
        print(f"Model saved to {model_filename}.")

    # If the submission filename, the test_ids are passed in, and the submission data are passed in then create the submission file
    if (submission_filename and test_ids is not None and len(test_ids) > 0 and submission_test_x is not None and submission_test_x.getnnz() > 0):
        # Generate predictions for the submission test dataset
        submission_logistic_regression_predictions = logistic_regression_model.predict(submission_test_x)

        # Create a DataFrame with the predictions and the corresponding IDs
        submission_logistic_regression_predictions_df = pd.DataFrame(submission_logistic_regression_predictions, columns=train_y.columns)
        submission_logistic_regression_predictions_df.insert(0, 'id', test_ids.reset_index(drop=True))
        # Ensures the directory exists first before attempting to write the file
        os.makedirs(os.path.dirname(submission_filename), exist_ok=True)  # Ensure the directory exists

        # Save the predictions to a CSV file
        submission_logistic_regression_predictions_df.to_csv(submission_filename, index=False, encoding='utf-8')
        print(f"Predictions saved to {submission_filename}.")

    # Predict on the test set
    logistic_regression_predictions = logistic_regression_model.predict(test_x)
    return logistic_regression_predictions

### SVM Model

In [None]:
def train_svm_model(train_x, train_y, test_x, model_filename, submission_test_x=None, test_ids=None, submission_filename=None):
    """
    Train a SVM model and optionally create a submission file.

    Parameters:
    - train_x: Training feature matrix (numpy array if it is word embeddings or sparse matrix if vectorized).
    - train_y: Training target labels (pandas Dataframe).
    - test_x: Test feature matrix for evaluation (numpy array if it is word embeddings or sparse matrix if vectorized).
    - model_filename: File name to save the trained model.
    - submission_test_x (optional): Feature matrix for submission predictions (numpy array if it is word embeddings or sparse matrix if vectorized).
    - test_ids (optional): IDs corresponding to `submission_test_x`.
    - submission_filename (optional): File name to save the submission file.
    """
    svm_model = None

    # Try to load the model if it exists, otherwise train a new one
    try:
        if os.path.exists(model_filename):
            with open(model_filename, "rb") as f:
                svm_model = pickle.load(f)
        else:
            raise FileNotFoundError(f"Model file '{model_filename}' not found.")
    except FileNotFoundError as e:
        print(e)
        print("Training a new SVM model...")
        svm_model = OneVsRestClassifier(LinearSVC(random_state=42, max_iter=1000, class_weight="balanced"))
        svm_model.fit(train_x, train_y)

        # Save the trained model to a file
        os.makedirs(os.path.dirname(model_filename), exist_ok=True)  # Ensure the directory exists
        with open(model_filename, "wb") as f:
            pickle.dump(svm_model, f)
        print(f"Model saved to {model_filename}.")

    # If the submission filename, the test_ids are passed in, and the submission data are passed in then create the submission file
    if (submission_filename and test_ids is not None and len(test_ids) > 0 and submission_test_x is not None and submission_test_x.getnnz() > 0):
        # Generate predictions for the submission test dataset
        submission_svm_predictions = svm_model.predict(submission_test_x)
        # Create a DataFrame with the predictions and the corresponding IDs
        submission_svm_predictions_df = pd.DataFrame(submission_svm_predictions, columns=train_y.columns)
        submission_svm_predictions_df.insert(0, 'id', test_ids.reset_index(drop=True))

        # Save the predictions to a CSV file
        submission_svm_predictions_df.to_csv(submission_filename, index=False, encoding='utf-8')
        print(f"Predictions saved to {submission_filename}.")

    # Predict on test data
    svm_predictions = svm_model.predict(test_x)
    return svm_predictions


### Random Forest Model

In [None]:
def train_random_forest_model(train_x, train_y, test_x, model_filename, submission_test_x=None, test_ids=None, submission_filename=None):
    """
    Train a Random Forest model and optionally create a submission file.

    Parameters:
    - train_x: Training feature matrix (numpy array if it is word embeddings or sparse matrix if vectorized).
    - train_y: Training target labels (pandas Dataframe).
    - test_x: Test feature matrix for evaluation (numpy array if it is word embeddings or sparse matrix if vectorized).
    - model_filename: File name to save the trained model.
    - submission_test_x (optional): Feature matrix for submission predictions (numpy array if it is word embeddings or sparse matrix if vectorized).
    - test_ids (optional): IDs corresponding to `submission_test_x`.
    - submission_filename (optional): File name to save the submission file.
    """
    random_forest_model = None

    # Try to load the model if it exists, otherwise train a new one
    try:
        if os.path.exists(model_filename):
            with open(model_filename, "rb") as f:
                random_forest_model = pickle.load(f)
        else:
            raise FileNotFoundError(f"Model file '{model_filename}' not found.")
    except FileNotFoundError as e:
        print(e)
        print("Training a new Random Forest model...")
        random_forest_model = OneVsRestClassifier(RandomForestClassifier())
        random_forest_model.fit(train_x, train_y)

        # Save the trained model to a file
        os.makedirs(os.path.dirname(model_filename), exist_ok=True)  # Ensure the directory exists
        with open(model_filename, "wb") as f:
            pickle.dump(random_forest_model, f)
        print(f"Model saved to {model_filename}.")

    # If the submission filename, the test_ids are passed in, and the submission data are passed in then create the submission file
    if (submission_filename and test_ids is not None and len(test_ids) > 0 and submission_test_x is not None and submission_test_x.getnnz() > 0):
        # Generate predictions for the submission test dataset
        submission_random_forest_predictions = random_forest_model.predict(submission_test_x)
        # Create a DataFrame with the predictions and the corresponding IDs
        submission_random_forest_predictions_df = pd.DataFrame(submission_random_forest_predictions, columns=train_y.columns)
        submission_random_forest_predictions_df.insert(0, 'id', test_ids)

        # Save the predictions to a CSV file
        submission_random_forest_predictions_df.to_csv(submission_filename, index=False)
        print(f"Predictions saved to {submission_filename}.")

    random_forest_predictions = random_forest_model.predict(test_x)
    return random_forest_predictions

## Step 8. Train vectorized comments

In [None]:
# Extract test IDs to perform mapping operation
test_ids = submission_test_df_preprocessed['id']

# Logistic Regression model
logistic_regression_predictions_vectorized = train_logistic_regression_model(train_X_vectorized, train_y, test_X_vectorized, "models/Vectorized/Tuned/tuned_logistic_regression_vectorized.pkl", submission_test_X_vectorized, test_ids, "submission/vectorized_logistic_regression_submission.csv")
# SVM Model
svm_predictions_vectorized = train_svm_model(train_X_vectorized, train_y, test_X_vectorized, "models/Vectorized/Tuned/tuned_svm_vectorized.pkl", submission_test_X_vectorized, test_ids, "submission/vectorized_svm_submission.csv")

# Random Forest Model
random_forest_predictions_vectorized = train_random_forest_model(train_X_vectorized, train_y, test_X_vectorized, "models/Vectorized/Untuned/random_forest_classifier_vectorized.pkl")

## Step 9. Evaluate Model Performance of vectorized approach

In [None]:
def evaluate_model_performance(predictions, actual, model_name, feature_extraction, special_features=None):
    """
        Function that takes in the predictions and actual labels and evaluate's the model's performance.

        Parameters:
        - predictions: predictions generated by the model
        - actual: actual ground truth class labels
        - model_name: the name of the model passed in as a string
        - feature_extraction: the name of the method passed in as a string
        - special_features (optional): list of all special hyperparameters for the model
    """
    # Convert actual to numpy array if it is a DataFrame
    actual = actual.values if isinstance(actual, pd.DataFrame) else actual

    # Printing out metrics like accuracy, f1 score, precision score, and recall score
    print(f"Evaluating the {model_name} using {feature_extraction} to extract features, and evaluated the performance using: accuracy, F1 Score, Precision, Recall, mean column-wise ROC AUC.")
    if special_features:
        print(f"This model uses {special_features}.")
    print(f"Also generated a Classification Report and Confusion Matrix.")
    print("Accuracy:", accuracy_score(actual, predictions))
    print("F1 Score (Micro):", f1_score(actual, predictions, average='micro'))
    print("Precision (Micro):", precision_score(actual, predictions, zero_division=0, average='micro'))
    print("Recall (Micro):", recall_score(actual, predictions, average='micro'))

    # Mean Column-Wise ROC AUC
    try:
        roc_auc_per_label = []
        # This loops through each class label and generates the roc_auc score for each of them. Then append it to an array so we can simply compute the mean of all roc auc scores for
        # the mean column-wise ROC AUC score.
        for i, label in enumerate(label_columns):
            roc_auc_label = roc_auc_score(actual[:, i], predictions[:, i])
            roc_auc_per_label.append(roc_auc_label)
        mean_roc_auc = np.mean(roc_auc_per_label)
        print("Mean Column-Wise ROC AUC:", mean_roc_auc)
    except ValueError:
        print("Mean Column-Wise ROC AUC could not be calculated. Ensure predictions are probabilities or scores.")

    print("\nClassification Report:\n", classification_report(actual, predictions, zero_division=0, target_names=label_columns))
    
    # Set up subplots for the confusion matrices
    fig, axes = plt.subplots(1, len(label_columns), figsize=(20, 4))
    fig.suptitle("Confusion Matrices for Each Label", fontsize=16)
    
    for i, label in enumerate(label_columns):
        actual_label = actual[:, i]
        predictions_label = predictions[:, i]
        
        cm = confusion_matrix(actual_label, predictions_label)
        
        # Plotting the confusion matrix for each label
        sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", 
                    xticklabels=["Not " + label, label], 
                    yticklabels=["Not " + label, label],
                    cbar=False, ax=axes[i])
        axes[i].set_title(f"'{label}'")
        axes[i].set_xlabel('Predicted')
        axes[i].set_ylabel('Actual')
    
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])  # Adjust layout to make room for the title
    plt.show()

#### Logistic Regression Model Performance

In [None]:
evaluate_model_performance(logistic_regression_predictions_vectorized, test_y, "Logistic Regression model", "TfidfVectorizer", "a C value of 3, balanced class weight, and max iterations of 5000")  # Evaluates performance of logistic regression that is tuned

#### SVM Model Performance

In [None]:
evaluate_model_performance(svm_predictions_vectorized, test_y, "SVM", "TfidfVectorizer", "C of 0.5, balanced class weight, and random state of 42")  # Evaluates performance of SVM. Achieved best results using class_weighted of balanced

#### Random Forest Model Performance

In [None]:
evaluate_model_performance(random_forest_predictions_vectorized, test_y, "Random Forest", "TfidfVectorizer")  # Evaluates performance of Random Forest Model

## Step 10. Train embedding comments

In [None]:
# Extract test IDs to perform mapping operation
test_ids = test_df_preprocessed['id']

# Logistic Regression model
logistic_regression_predictions_embeddings = train_logistic_regression_model(train_embeddings, train_y, test_embeddings, "models/Embeddings/Tuned/tuned_logistic_regression_embeddings.pkl")
# SVM Model
svm_predictions_embeddings = train_svm_model(train_embeddings, train_y, test_embeddings, "models/Embeddings/Tuned/tuned_svm_classifier_embeddings.pkl")

# Random Forest Model. Only managed to train one random forest embedding model given the time it took to train
random_forest_predictions_embeddings = train_random_forest_model(train_embeddings, train_y, test_embeddings, "models/Embeddings/Tuned/tuned_random_forest_classifier_embeddings.pkl")

## Step 11. Evaluate embeddings approach

#### Logistic Regression Model Performance

In [None]:
evaluate_model_performance(logistic_regression_predictions_embeddings, test_y, "Logistic Regression", "Word Embeddings", "a C value of 3, balanced class weight, and max iterations of 5000")  # Evaluates performance of logistic regression

#### SVM Model Performance

In [None]:
evaluate_model_performance(svm_predictions_embeddings, test_y, "SVM", "Word Embeddings", "C of 0.5, balanced class weight, and random state of 42")  # Evaluates performance of SVM

#### Random Forest Model Performance

In [None]:
evaluate_model_performance(random_forest_predictions_embeddings, test_y, "Random Forest", "Word Embeddings")  # Evaluates performance of Random Forest