# Emotion-Aware Toxicity Detection

This project looks at how to automatically spot harmful comments on social media by combining standard toxicity indicators with emotional signals. I use the Jigsaw Unintended Bias dataset and mix simple TF–IDF text features with GoEmotions embeddings, which capture feelings like anger, disgust, or contempt. The idea is to check how these emotions relate to toxic language and whether they help the model make better decisions. Besides accuracy, I also look at fairness by examining how the model behaves on comments tied to different identity groups. This shows where adding emotional information helps and where it might cause problems. In the end, the project produces a small but useful pipeline that tests whether emotional context can improve harmful-content detection while cutting down on biased false positives.

## 1. Data Loading and Sanity Checks

I load train.csv, set up the identity columns, peek at a few rows and the shape, check the average toxicity and missing texts, add a text_len helper to see how long comments are, look at how often each identity flag shows up, and summarize the target distribution plus the share above 0.5 to see the imbalance. In the end I remove rows without comment text.

In [12]:

from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, precision_recall_curve, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset
import pandas as pd
import numpy as np
from pathlib import Path
import os
import scipy.sparse as sp

train_df = pd.read_csv("train.csv")

# Used later for fairness and bias analysis. Each value is between 0 and 1, showing how strongly the comment refers to that identity
identity_cols = [
    'asian','atheist','bisexual','black','buddhist','christian','female','heterosexual','hindu',
    'homosexual_gay_or_lesbian','intellectual_or_learning_disability','jewish','latino','male','muslim',
    'other_disability','other_gender','other_race_or_ethnicity','other_religion',
    'other_sexual_orientation','physical_disability','psychiatric_or_mental_illness','transgender','white'
]

display(train_df.head())
print(f"Shape: {train_df.shape}")

# average toxicity score.
print(f"Target mean (toxicity): {train_df['target'].mean():.3f}")

print(f"Missing comment_text: {train_df['comment_text'].isna().sum()}")

# number of characters in each comment and statistic description
train_df["text_len"] = train_df["comment_text"].str.len()
print(train_df["text_len"].describe().round(3))

# statistics for identity columns
iden_means = train_df[identity_cols].mean().sort_values(ascending=False)
print("Top identity indicators by mean:")
print(iden_means.head(10).round(3))

# if > 0.5 it is toxic
print(train_df["target"].describe().round(3))
print(f"Targets > 0.5: {(train_df['target'] > 0.5).mean():.3f}")

train_df = train_df.dropna(subset=["comment_text"]).reset_index(drop=True)


Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,bisexual,black,buddhist,christian,female,heterosexual,hindu,homosexual_gay_or_lesbian,intellectual_or_learning_disability,jewish,latino,male,muslim,other_disability,other_gender,other_race_or_ethnicity,other_religion,other_sexual_orientation,physical_disability,psychiatric_or_mental_illness,transgender,white,created_date,publication_id,parent_id,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,2015-09-29 10:50:41.987077+00,2,,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,2015-09-29 10:50:42.870083+00,2,,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,2015-09-29 10:50:45.222647+00,2,,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,2015-09-29 10:50:47.601894+00,2,,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015-09-29 10:50:48.488476+00,2,,2006,rejected,0,0,0,1,0,0.0,4,47


Shape: (1804874, 45)
Target mean (toxicity): 0.103
Missing comment_text: 3
count    1804871.000
mean         297.235
std          269.197
min            1.000
25%           94.000
50%          202.000
75%          414.000
max         1906.000
Name: text_len, dtype: float64
Top identity indicators by mean:
female                           0.128
male                             0.109
christian                        0.095
white                            0.057
muslim                           0.049
black                            0.034
homosexual_gay_or_lesbian        0.026
jewish                           0.018
psychiatric_or_mental_illness    0.012
asian                            0.012
dtype: float64
count    1804874.000
mean           0.103
std            0.197
min            0.000
25%            0.000
50%            0.000
75%            0.167
max            1.000
Name: target, dtype: float64
Targets > 0.5: 0.059


There are about 1.8M rows with 45 columns, only 3 comments are missing text.
Top identity indicators: For example, female 0.128 means annotators tagged about 12.8% of comments as referring to “female.”

## 2. Data Preparation and Train/Validation Split

Convert the continuous toxicity score (target) into a binary label: 0 for non-toxic and 1 for toxic. Split the data into training (90%) and validation (10%) sets, making sure the class balance (ratio of toxic to non-toxic comments) stays the same in both splits.

In [13]:

# results will be the same every run
RANDOM_SEED = 42

# produce the same sequence of values every time
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
torch.cuda.manual_seed_all(RANDOM_SEED)

# Same input with same weights give same output every run
torch.backends.cudnn.deterministic = True

# dont use fastest (different) algorithm on every run - fixed same kerner and algorithm
torch.backends.cudnn.benchmark = False

# Remove NaNs in comment_text"
train_df = train_df.dropna(subset=["comment_text"]).reset_index(drop=True)

# binary classification label where 0 = not toxic, 1 = toxic
train_df["label"] = (train_df["target"] > 0.5).astype(int)

# Splits dataset into 90% train and 10% validation.
train_text, val_text, y_train, y_val = train_test_split(
    train_df["comment_text"], # x (text) - all comment texts
    train_df["label"], # y - all labels (0 or 1)
    test_size=0.1,
    stratify=train_df["label"], # keeps toxicity ratio equal in both splits
    random_state=RANDOM_SEED,
)


## 3. Logistic Regression Baseline with TF–IDF Features

Raw comment text is converted into numeric features using a TF–IDF vectorizer, which captures both single words and word pairs while filtering out very rare terms. These features are then used to train a Logistic Regression classifier that predicts whether a comment is toxic or not. After training, the model outputs toxicity probabilities for the validation set, which are turned into binary predictions using a 0.5 threshold. Evaluate how well the model works using three metrics: ROC-AUC (how well it ranks toxic vs non-toxic), PR-AUC (how well it detects the toxic class in an imbalanced setting), and F1-score at 0.5 (the balance between precision and recall at the chosen decision threshold).

In [16]:
# definition of TF–IDF vectorizer - converts raw text → numerical feature vectors that ML models can understand
tfidf = TfidfVectorizer(
    lowercase=True, # converts all text to lowercase (dog will be same as Dog)
    ngram_range=(1, 2), # insults can be phrases, so I use unigrams (single words) and bigrams (two-word phrases)
    min_df=5, # ignore words that appear in fewer than 5 documents
    max_features=200_000, # vocabulary size limit to the top 200k tokens (so we dont get over milion features)
    strip_accents="unicode", # converts accented characters
    dtype=np.float32 # save memory
)

# learn vocabulary (train) from train_text and converts text into a sparse matrix
X_train = tfidf.fit_transform(train_text)

# only transform, we use same vocabulary from train text
X_val   = tfidf.transform(val_text)

clf = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    solver="liblinear",        # very reliable for TF–IDF where I have lots of features (200k)
    C=1.0,                     # regularization strength (low -> strict, high -> not strict)
    random_state=RANDOM_SEED,
)

# train
clf.fit(X_train, y_train)
val_scores = clf.predict_proba(X_val)[:, 1]
val_preds_05 = (val_scores >= 0.5).astype(int)

roc  = roc_auc_score(y_val, val_scores)
prauc = average_precision_score(y_val, val_scores)
f1_05 = f1_score(y_val, val_preds_05)

# Find threshold that maximizes F1 on validation
thresholds = np.linspace(0.01, 0.99, 99)
f1s = [f1_score(y_val, (val_scores >= t).astype(int)) for t in thresholds]
best_idx = int(np.argmax(f1s))
best_t = float(thresholds[best_idx])
best_f1 = float(f1s[best_idx])

print(f"ROC-AUC:        {roc:.3f}")
print(f"PR-AUC:         {prauc:.3f}")
print(f"F1@0.50:        {f1_05:.3f}")
print(f"Best F1:        {best_f1:.3f}  (threshold={best_t:.2f})")


ROC-AUC:        0.959
PR-AUC:         0.709
F1@0.50:        0.578
Best F1:        0.648  (threshold=0.80)


Logistic Regression scores: 

ROC-AUC: 0.959
The model separates toxic from non-toxic comments very well. It assigns higher toxicity scores to harmful comments with high consistency.

PR-AUC: 0.708
This shows strong performance on the toxic class, which is rare in the dataset. The model can detect toxic comments reasonably well without too many false positives.

F1 (0.5): 0.574
At the default 0.5 threshold, the balance between precision and recall is moderate. The score suggests that the threshold may not be ideal and could be tuned to improve performance.

F1 score is best using 0.8 threshold because comments that are truly toxic often get very high probabilities.

## 4. DistilBERT Toxicity Classification Baseline 

Unlike TF–IDF, which relies on surface-level word frequency, DistilBERT captures semantics such as intent, phrasing, and implicit meaning. Although the main baseline in this project is a TF–IDF + Logistic Regression model, DistilBERT is included as a reference baseline to show how much performance can be gained by using a modern deep language model that does not rely on handcrafted features or emotional signals (Result is after core part).

In [4]:
'''
# train_df: must have "comment_text" (no NaNs) and "target"
train_df = train_df.dropna(subset=["comment_text"]).reset_index(drop=True)
train_df["label"] = (train_df["target"] > 0.5).astype(int)

N_SUBSAMPLE = 200_000 

# If dataset is smaller than N_SUBSAMPLE, just use all
if len(train_df) > N_SUBSAMPLE:
    train_df, _ = train_test_split(
        train_df,
        train_size=N_SUBSAMPLE,
        stratify=train_df["label"],
        random_state=RANDOM_SEED
    )

train_text, val_text, y_train, y_val = train_test_split(
    train_df["comment_text"],
    train_df["label"],
    test_size=0.1,
    stratify=train_df["label"],
    random_state=RANDOM_SEED,
)

# Model and tokenizer setup

model_name = "distilbert-base-uncased"
max_length = 128

tokenizer = AutoTokenizer.from_pretrained(model_name)

class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = list(texts)
        self.labels = list(labels)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        enc = tokenizer(
            self.texts[idx],
            padding="max_length",
            truncation=True,
            max_length=max_length,
            return_tensors="pt",
        )
        return {
            "input_ids": enc["input_ids"].squeeze(0),
            "attention_mask": enc["attention_mask"].squeeze(0),
            "labels": torch.tensor(self.labels[idx], dtype=torch.long),
        }

train_ds = TextDataset(train_text, y_train)
val_ds   = TextDataset(val_text, y_val)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    problem_type="single_label_classification",
)

# Metrics and training arguments

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = torch.softmax(torch.tensor(logits), dim=1)[:, 1].numpy()
    preds = (probs > 0.5).astype(int)
    return {
        "roc_auc": roc_auc_score(labels, probs),
        "pr_auc":  average_precision_score(labels, probs),
        "f1_0.5":  f1_score(labels, preds),
    }

training_args = TrainingArguments(
    output_dir="./bert-toxic-out",
    evaluation_strategy="epoch",  # in new HF versions: eval_strategy="epoch"
    save_strategy="no",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=200,
    load_best_model_at_end=False,
    fp16=torch.cuda.is_available(),  # automatically use GPU mixed precision if available
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    compute_metrics=compute_metrics,
)

# Train

trainer.train()

# Validation and threshold tuning

val_logits = trainer.predict(val_ds).predictions
val_probs = torch.softmax(torch.tensor(val_logits), dim=1)[:, 1].numpy()

thresholds = np.linspace(0.1, 0.9, 17)
f1s = [f1_score(y_val, val_probs > t) for t in thresholds]
best_t = thresholds[int(np.argmax(f1s))]

print(f"Best threshold: {best_t:.3f} | F1: {max(f1s):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_val, val_probs):.3f}")
print(f"PR-AUC:  {average_precision_score(y_val, val_probs):.3f}")
print(f"F1@0.5:  {f1_score(y_val, val_probs > 0.5):.3f}")
print(f"F1@best: {max(f1s):.3f}")
'''

'\n# train_df: must have "comment_text" (no NaNs) and "target"\ntrain_df = train_df.dropna(subset=["comment_text"]).reset_index(drop=True)\ntrain_df["label"] = (train_df["target"] > 0.5).astype(int)\n\n# If dataset is smaller than N_SUBSAMPLE, just use all\nif len(train_df) > N_SUBSAMPLE:\n    sample_df, _ = train_test_split(\n        train_df,\n        train_size=N_SUBSAMPLE,\n        stratify=train_df["label"],\n        random_state=RANDOM_SEED\n    )\nelse:\n    sample_df = train_df\n\ntrain_text, val_text, y_train, y_val = train_test_split(\n    sample_df["comment_text"],\n    sample_df["label"],\n    test_size=0.1,\n    stratify=sample_df["label"],\n    random_state=RANDOM_SEED,\n)\n\n# Model and tokenizer setup\n\nmodel_name = "distilbert-base-uncased"\nmax_length = 128\n\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\nclass TextDataset(Dataset):\n    def __init__(self, texts, labels):\n        self.texts = list(texts)\n        self.labels = list(labels)\n\n    def __le

 DistilBERT results interpretation:

- Best threshold: 0.450 | F1: 0.663
- ROC-AUC: 0.965
- PR-AUC:  0.753
- F1 (0.5):  0.661
- F1 (best): 0.663

 DistilBERT achieves very strong performance (ROC-AUC = 0.965, PR-AUC = 0.753) and only by litle outperforms the TF–IDF + Logistic Regression baseline in terms of F1-score (using 0.8 threshold). However, the improvement over Logistic Regression is relatively small when compared to the substantially higher computational cost and training time. This highlights an important trade-off: while transformer-based models provide better semantic understanding, simpler linear models remain competitive and more efficient, motivating the exploration of emotion-aware features as a lightweight alternative to close the performance gap without the overhead of full transformer models.

## 5. Emotion Feature Extraction with GoEmotions (Affective Signals)

This section extracts explicit emotion signals from each comment using a pretrained GoEmotions classifier. For every text, the model outputs a 28-dimensional probability vector (e.g., anger, disgust, contempt, neutral). These emotion probabilities are cached and later concatenated with TF–IDF features to build the emotion-aware toxicity model and to analyze fairness effects.

In [17]:
GOEMOTIONS_MODEL = "SamLowe/roberta-base-go_emotions" # RoBERTa model was trained to detect emotions like anger, disgust, fear, joy, neutral, etc.

# create folder for cache
CACHE_DIR = "./cache_goemotions" 
os.makedirs(CACHE_DIR, exist_ok=True)

# caching extracted emotion futures
TRAIN_CACHE = os.path.join(CACHE_DIR, "goemotions_probs_train.npy")
VAL_CACHE   = os.path.join(CACHE_DIR, "goemotions_probs_val.npy")
LABELS_CACHE = os.path.join(CACHE_DIR, "goemotions_labels.npy")

device = "cuda" if torch.cuda.is_available() else "cpu"

# load tokenizer from pretrained model, loads neural network and move it to CPU/GPU and puts model to evaluation mode
tokenizer_em = AutoTokenizer.from_pretrained(GOEMOTIONS_MODEL)
model_em = AutoModelForSequenceClassification.from_pretrained(GOEMOTIONS_MODEL).to(device)
model_em.eval()

# Label names (length should be 28) - (e.g., anger, disgust, neutral)
emotion_labels = [model_em.config.id2label[i] for i in range(model_em.config.num_labels)]
np.save(LABELS_CACHE, np.array(emotion_labels, dtype=object))

def extract_goemotions_probs(texts, batch_size=64, max_length=256):
    """
    Returns an (n_samples x n_emotions) numpy array of emotion probabilities.
    """
    probs_all = []
    with torch.no_grad(): # tells PyTorch this is not training, don’t track gradients.
        texts = list(texts)
        for i in range(0, len(texts), batch_size): # loop over texts in batches
            batch = texts[i:i+batch_size]
            enc = tokenizer_em( 
                batch,
                truncation=True,
                padding=True,
                max_length=max_length,
                return_tensors="pt", # return PyTorch tensors - i am using PyTorch
            ).to(device) # tokenizes text batch and prepares it for the model

            logits = model_em(**enc).logits # runs the model and gets raw emotion scores
            probs = torch.sigmoid(logits).cpu().numpy() # converts logits to probabilities
            probs_all.append(probs)

    return np.vstack(probs_all) # return on big matrix of batches results

# run with caching if exist
if os.path.exists(TRAIN_CACHE) and os.path.exists(VAL_CACHE):
    E_train = np.load(TRAIN_CACHE)
    E_val   = np.load(VAL_CACHE)
    print("Loaded cached GoEmotions probabilities.")
else:
    E_train = extract_goemotions_probs(train_text, batch_size=64, max_length=256)
    E_val   = extract_goemotions_probs(val_text, batch_size=64, max_length=256)

    np.save(TRAIN_CACHE, E_train)
    np.save(VAL_CACHE, E_val)
    print("Computed and cached GoEmotions probabilities.")

print("Emotion labels (28):", emotion_labels)
print("E_train shape:", E_train.shape)
print("E_val shape:  ", E_val.shape)
print("Example (first row, top-5 emotions):")
top5 = np.argsort(E_train[0])[::-1][:5]
for idx in top5:
    print(f"  {emotion_labels[idx]}: {E_train[0, idx]:.3f}")


Computed and cached GoEmotions probabilities.
Emotion labels (28): ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
E_train shape: (1624383, 28)
E_val shape:   (180488, 28)
Example (first row, top-5 emotions):
  surprise: 0.533
  curiosity: 0.334
  neutral: 0.214
  confusion: 0.030
  excitement: 0.015


Computed and cached GoEmotions probabilities.
Emotion labels (28): ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
E_train shape: (1624383, 28)
E_val shape:   (180488, 28)
Example (first row, top-5 emotions):
  surprise: 0.533
  curiosity: 0.334
  neutral: 0.214
  confusion: 0.030
  excitement: 0.015

## 6. Emotion-Only Toxicity Classification (GoEmotions + Logistic Regression)

This section trains a toxicity classifier using only emotion probabilities extracted with the GoEmotions model. Each comment is represented as a 28-dimensional emotional feature vector, and a Logistic Regression model is used to predict toxicity based on these signals alone. The goal is to evaluate how much emotional information contributes to toxicity detection without relying on textual content.

In [18]:
# I use scaler to assign fair weights to all emotions
emo_clf = Pipeline(steps=[
    ("scaler", StandardScaler(with_mean=True, with_std=True)),
    ("lr", LogisticRegression(
        max_iter=2000,
        solver="lbfgs",
        class_weight="balanced",
        n_jobs=-1 if "n_jobs" in LogisticRegression().get_params() else None
    ))
])

emo_clf.fit(E_train, y_train)

# Predict probabilities on validation
val_proba = emo_clf.predict_proba(E_val)[:, 1]

# Basic metrics (threshold 0.5) 
val_pred_05 = (val_proba >= 0.5).astype(int)

roc = roc_auc_score(y_val, val_proba)
pr  = average_precision_score(y_val, val_proba)
f1  = f1_score(y_val, val_pred_05)

print("Emotion-only LogisticRegression")
print(f"ROC-AUC: {roc:.4f}")
print(f"PR-AUC : {pr:.4f}")
print(f"F1@0.5 : {f1:.4f}")
print()

print("Confusion matrix @0.5 [ [TN FP], [FN TP] ]:")
print(confusion_matrix(y_val, val_pred_05))
print()

# choose a better threshold on validation by maximizing F1
prec, rec, thr = precision_recall_curve(y_val, val_proba)
# precision_recall_curve returns thr of length (len(prec)-1)
f1s = (2 * prec[:-1] * rec[:-1]) / (prec[:-1] + rec[:-1] + 1e-12)
best_idx = int(np.argmax(f1s))
best_thr = float(thr[best_idx])
best_f1 = float(f1s[best_idx])

val_pred_best = (val_proba >= best_thr).astype(int)

print(f"Best threshold by F1: {best_thr:.4f}")
print(f"F1@best_thr        : {best_f1:.4f}")
print()
print("Confusion matrix with best thr [ [TN FP], [FN TP] ]:")
print(confusion_matrix(y_val, val_pred_best))
print()
print("Classification report @best_thr:")
print(classification_report(y_val, val_pred_best, digits=4))

# show which emotions the model relies on most
# Positive coef => increases toxicity probability; negative => decreases it.
lr = emo_clf.named_steps["lr"]
coefs = lr.coef_.ravel()

if "emotion_labels" in globals() and len(emotion_labels) == len(coefs):
    idx_sorted = np.argsort(coefs)
    print("\nTop emotions decreasing toxicity (most negative coefficients):")
    for i in idx_sorted[:8]:
        print(f"  {emotion_labels[i]:>14s}: {coefs[i]: .4f}")

    print("\nTop emotions increasing toxicity (most positive coefficients):")
    for i in idx_sorted[-8:][::-1]:
        print(f"  {emotion_labels[i]:>14s}: {coefs[i]: .4f}")
else:
    # Fallback if emotion_labels not defined
    idx_sorted = np.argsort(coefs)
    print("\nTop coefficients (negative -> positive):")
    print(coefs[idx_sorted[:8]])
    print(coefs[idx_sorted[-8:][::-1]])


Emotion-only (GoEmotions) -> LogisticRegression
ROC-AUC: 0.8172
PR-AUC : 0.3424
F1@0.5 : 0.3214

Confusion matrix @0.5 [ [TN FP], [FN TP] ]:
[[146090  23754]
 [  4057   6587]]

Best threshold by F1: 0.7773
F1@best_thr        : 0.3860

Confusion matrix with best thr [ [TN FP], [FN TP] ]:
[[162480   7364]
 [  6337   4307]]

Classification report @best_thr:
              precision    recall  f1-score   support

           0     0.9625    0.9566    0.9595    169844
           1     0.3690    0.4046    0.3860     10644

    accuracy                         0.9241    180488
   macro avg     0.6657    0.6806    0.6728    180488
weighted avg     0.9275    0.9241    0.9257    180488


Top emotions decreasing toxicity (most negative coefficients):
         neutral: -0.4290
     disapproval: -0.4118
        approval: -0.3040
       confusion: -0.2333
  disappointment: -0.2329
      admiration: -0.1905
       gratitude: -0.1854
       curiosity: -0.1802

Top emotions increasing toxicity (most posi

Emotion-Only Toxicity Classification result:

- ROC-AUC (0.817)  -> Higher means toxic comments generally “feel more emotional” than non-toxic ones.
- PR-AUC (0.342)   -> Low value means many emotional comments are falsely flagged as toxic.
- F1 @ 0.5 (0.321) -> Low score shows that emotions alone trigger too many false positives.
- Best F1 (0.386)  -> Even then, performance remains limited.
- Confusion matrix -> FP Emotional but harmless comments wrongly labeled as toxic, FN Toxic comments that are calm or indirect and lack strong emotion
- Accuracy         -> is misleading because most comments are non-toxic, always predicting “non-toxic” would already give high accuracy.
- Coefficients     -> Emotions that reduce toxicity prediction (Neutral, approval, gratitude, curiosity), are signals of normal or positive communication
                 -> Emotions that increase toxicity prediction (Annoyance, anger, disgust), often present in toxic language, but also in non-toxic complaints

The emotion-only model is sensitive to emotional intensity but lacks the semantic understanding required to distinguish harmful attacks from benign emotional expression. As a result, it produces many false positives and achieves limited F1 performance, confirming that emotional cues alone are insufficient for reliable toxicity detection.

## 7. TF–IDF + Emotion feature fusion classifier

This section builds a fusion toxicity model by combining two types of information for each comment:
- What is said (TF–IDF word/bigram features), and
- How it feels (28 GoEmotions probabilities).

The goal is to test whether adding emotional signals improves toxicity prediction (and later fairness), compared to using TF–IDF or emotions alone.

In [19]:
# safety checks - same number of comments in TF–IDF and emotions
assert X_train.shape[0] == E_train.shape[0], f"Mismatch: X_train {X_train.shape[0]} vs E_train {E_train.shape[0]}"
assert X_val.shape[0]   == E_val.shape[0],   f"Mismatch: X_val {X_val.shape[0]} vs E_val {E_val.shape[0]}"
assert len(y_train) == E_train.shape[0]
assert len(y_val)   == E_val.shape[0]

# convert emotions to sparse like TF–IDF and fuse with TF–IDF
Etr_sp = sp.csr_matrix(E_train.astype(np.float32))
Eva_sp = sp.csr_matrix(E_val.astype(np.float32))

# hstack - put features side-by-side -> words + emotions together
X_train_fused = sp.hstack([X_train, Etr_sp], format="csr")
X_val_fused   = sp.hstack([X_val,   Eva_sp], format="csr")

# verify dimensions, must be 200k + 28
print("Shapes:")
print("  X_train:", X_train.shape, "E_train:", E_train.shape, "=> fused:", X_train_fused.shape)
print("  X_val  :", X_val.shape,   "E_val  :", E_val.shape,   "=> fused:", X_val_fused.shape)

# train fused classifier
clf_fused = LogisticRegression(
    max_iter=2000,
    class_weight="balanced",
    n_jobs=-1
)
clf_fused.fit(X_train_fused, y_train)

# evaluate fused model
val_scores_fused = clf_fused.predict_proba(X_val_fused)[:, 1]
val_preds_fused_05 = (val_scores_fused > 0.5).astype(int)

print("\nTF–IDF + Emotions (FUSED) metrics:")
print(f"ROC-AUC: {roc_auc_score(y_val, val_scores_fused):.3f}")
print(f"PR-AUC:  {average_precision_score(y_val, val_scores_fused):.3f}")
print(f"F1@0.5:  {f1_score(y_val, val_preds_fused_05):.3f}")
print("Confusion @0.5:\n", confusion_matrix(y_val, val_preds_fused_05))

# best threshold by F1 on validation
prec, rec, thr = precision_recall_curve(y_val, val_scores_fused)
f1s = (2 * prec[:-1] * rec[:-1]) / (prec[:-1] + rec[:-1] + 1e-12)
best_idx = int(np.argmax(f1s))
best_thr = float(thr[best_idx])
best_f1  = float(f1s[best_idx])

val_preds_fused_best = (val_scores_fused >= best_thr).astype(int)

print(f"\nBest threshold by F1 (fused): {best_thr:.4f}")
print(f"Best F1 (fused):             {best_f1:.4f}")
print("Confusion @best_thr:\n", confusion_matrix(y_val, val_preds_fused_best))

# inspect emotion coefficients inside the fused model
# last 28 coefficients correspond to emotions (because we appended E_* at the end).
if "emotion_labels" not in globals():
    emotion_labels = [f"emo_{i}" for i in range(E_train.shape[1])]

emo_coef = clf_fused.coef_.ravel()[-len(emotion_labels):]
idx_sorted = np.argsort(emo_coef)

print("\nFUSED model: emotions decreasing toxicity (most negative):")
for i in idx_sorted[:8]:
    print(f"  {emotion_labels[i]:>14s}: {emo_coef[i]: .4f}")

print("\nFUSED model: emotions increasing toxicity (most positive):")
for i in idx_sorted[-8:][::-1]:
    print(f"  {emotion_labels[i]:>14s}: {emo_coef[i]: .4f}")

Shapes:
  X_train: (1624383, 200000) E_train: (1624383, 28) => fused: (1624383, 200028)
  X_val  : (180488, 200000) E_val  : (180488, 28) => fused: (180488, 200028)

TF–IDF + Emotions (FUSED) metrics:
ROC-AUC: 0.959
PR-AUC:  0.717
F1@0.5:  0.579
Confusion @0.5:
 [[158681  11163]
 [  1765   8879]]

Best threshold by F1 (fused): 0.8179
Best F1 (fused):             0.6537
Confusion @best_thr:
 [[165860   3984]
 [  3541   7103]]

FUSED model: emotions decreasing toxicity (most negative):
           grief: -2.0799
     realization: -1.6744
      excitement: -1.3661
          relief: -1.3437
     disapproval: -1.2050
     nervousness: -1.0860
       confusion: -0.9698
        approval: -0.8710

FUSED model: emotions increasing toxicity (most positive):
       annoyance:  3.0564
           anger:  1.6882
         disgust:  1.3618
   embarrassment:  0.5998
       amusement:  0.3033
            fear:  0.0519
         sadness: -0.2536
         remorse: -0.2564


TF–IDF + Emotion feature fusion result:

Compared to the TF–IDF-only model, the fusion model achieves very similar ranking performance (ROC-AUC = 0.959 vs. 0.959), indicating that lexical features remain the dominant source of information for toxicity detection. However, the fusion model slightly improves PR-AUC (0.717 vs. ~0.709), suggesting better handling of the minority toxic class when precision and recall are jointly considered. 
The best F1 score of the fusion model (0.654) is marginally higher than the TF–IDF-only model (≈0.648), showing a small but consistent improvement when emotional features are added.

Most improvements we see in Coefficients. These emotional effects are much stronger and more meaningful in the fused model than in the emotion-only model, where emotions often caused false alarms. This indicates that emotions are most useful in context, not in isolation.

## 8. Fairness by Identity Group (Baseline TF–IDF vs Fused TF–IDF+Emotions)

This code compares false positives and false negatives across identity groups (e.g., male, female, black, muslim) for two models: a baseline TF–IDF model and a fused TF–IDF + emotions model. The goal is to see whether adding emotions changes bias-related errors, especially false positive rate (FPR) on identity-linked comments.

In [20]:
# ensures val_text is a pandas Series so it has original row indices
assert hasattr(val_text, "index"), "val_text must be a pandas Series"

val_idx = val_text.index

# Pull identity metadata for validation set
val_meta = train_df.loc[val_idx, identity_cols].copy()

# get prediction scores for both models
base_scores = clf.predict_proba(X_val)[:, 1]
fused_scores = clf_fused.predict_proba(X_val_fused)[:, 1]

# Choose thresholds:
BASE_THR = 0.5
FUSED_THR = 0.8179

# Converts probabilities into hard predictions 0 or 1
base_pred = (base_scores >= BASE_THR).astype(int)
fused_pred = (fused_scores >= FUSED_THR).astype(int)

y_val_arr = np.asarray(y_val, dtype=int)

# metric helpers
def rates_from_preds(y_true, y_pred):
    """
    Returns rates + confusion counts.
    FPR = FP / (FP + TN)
    FNR = FN / (FN + TP)
    """
    # creates confusion matrix and extracts:
    # TN = correct non-toxic
    # FP = non-toxic wrongly predicted toxic (false positives)
    # FN = toxic missed (false negatives)
    # TP = correct toxic
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()

    # False Positive Rate: Out of all truly non-toxic comments, how many got wrongly flagged toxic?
    fpr = fp / (fp + tn) if (fp + tn) > 0 else np.nan
    
    # False Negative Rate: Out of all truly toxic comments, how many did the model miss?
    fnr = fn / (fn + tp) if (fn + tp) > 0 else np.nan

    # True Positive Rate / Recall: Out of all toxic comments, how many did the model catch?
    tpr = tp / (tp + fn) if (tp + fn) > 0 else np.nan 
    
    # Precision: When the model predicts toxic, how often is it correct?
    ppv = tp / (tp + fp) if (tp + fp) > 0 else np.nan 

    # prevalence (toxicity rate): What % of comments in that group are actually toxic?
    tox_rate = y_true.mean() if len(y_true) else np.nan

    return {
        "n": int(len(y_true)),
        "tox_rate": float(tox_rate),
        "TN": int(tn), "FP": int(fp), "FN": int(fn), "TP": int(tp),
        "FPR": float(fpr), "FNR": float(fnr),
        "TPR": float(tpr), "PPV": float(ppv)
    }

def subgroup_mask(identity_series: pd.Series, thr: float = 0.5):
    """Identity present if column >= thr."""
    return identity_series.fillna(0.0).values >= thr


# build fairness table (per identity + overall)
rows = []

# Overall (all val)
rows.append({
    "group": "ALL",
    "base": rates_from_preds(y_val_arr, base_pred),
    "fused": rates_from_preds(y_val_arr, fused_pred),
})

# Identity groups
for col in identity_cols:
    m = subgroup_mask(val_meta[col], thr=0.5)
    if m.sum() == 0:
        continue

    rows.append({
        "group": col,
        "base": rates_from_preds(y_val_arr[m], base_pred[m]),
        "fused": rates_from_preds(y_val_arr[m], fused_pred[m]),
    })

# Convert to a readable dataframe
def flatten_row(r):
    out = {"group": r["group"]}
    for prefix, d in [("base", r["base"]), ("fused", r["fused"])]:
        out[f"{prefix}_n"] = d["n"]
        out[f"{prefix}_tox_rate"] = d["tox_rate"]
        out[f"{prefix}_FPR"] = d["FPR"]
        out[f"{prefix}_FNR"] = d["FNR"]
        out[f"{prefix}_TPR"] = d["TPR"]
        out[f"{prefix}_PPV"] = d["PPV"]
        out[f"{prefix}_FP"] = d["FP"]
        out[f"{prefix}_FN"] = d["FN"]
    # deltas (fused - base)
    out["ΔFPR"] = out["fused_FPR"] - out["base_FPR"]
    out["ΔFNR"] = out["fused_FNR"] - out["base_FNR"]
    out["ΔPPV"] = out["fused_PPV"] - out["base_PPV"]
    out["ΔTPR"] = out["fused_TPR"] - out["base_TPR"]
    return out

fair_df = pd.DataFrame([flatten_row(r) for r in rows])

# Sort to see where FPR improved/worsened most
fair_df_sorted = fair_df.sort_values(by="ΔFPR")

# Pretty formatting for display
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", 200)

print("Fairness comparison (negative ΔFPR means fused reduces false positives):")
display(
    fair_df_sorted[
        ["group",
         "base_n", "base_tox_rate", "base_FPR", "base_FNR", "base_PPV", "base_TPR",
         "fused_n", "fused_tox_rate", "fused_FPR", "fused_FNR", "fused_PPV", "fused_TPR",
         "ΔFPR", "ΔFNR", "ΔPPV", "ΔTPR",
         "base_FP", "fused_FP", "base_FN", "fused_FN"
        ]
    ]
)

# summary diagnostics
print("\nThresholds used:")
print(f"  BASE_THR : {BASE_THR}")
print(f"  FUSED_THR: {FUSED_THR}")

# Where fused reduces FPR the most
top_improve = fair_df.sort_values("ΔFPR").head(10)[["group", "ΔFPR", "base_FPR", "fused_FPR", "base_n"]]
top_worsen  = fair_df.sort_values("ΔFPR", ascending=False).head(10)[["group", "ΔFPR", "base_FPR", "fused_FPR", "base_n"]]

print("\nTop 10 groups with biggest FPR reduction (fused - base):")
display(top_improve)

print("\nTop 10 groups with biggest FPR increase (fused - base):")
display(top_worsen)


Fairness comparison (negative ΔFPR means fused reduces false positives):


Unnamed: 0,group,base_n,base_tox_rate,base_FPR,base_FNR,base_PPV,base_TPR,fused_n,fused_tox_rate,fused_FPR,fused_FNR,fused_PPV,fused_TPR,ΔFPR,ΔFNR,ΔPPV,ΔTPR,base_FP,fused_FP,base_FN,fused_FN
20,physical_disability,7,0.0,0.285714,,0.0,,7,0.0,0.0,,,,-0.285714,,,,2,0,0,0
4,black,1486,0.218708,0.40913,0.169231,0.362416,0.830769,1486,0.218708,0.184324,0.350769,0.496471,0.649231,-0.224806,0.181538,0.134054,-0.181538,475,214,55,114
23,white,2417,0.19487,0.374615,0.142251,0.356575,0.857749,2417,0.19487,0.156732,0.352442,0.5,0.647558,-0.217883,0.210191,0.143425,-0.210191,729,305,67,166
8,heterosexual,124,0.177419,0.323529,0.181818,0.352941,0.818182,124,0.177419,0.117647,0.409091,0.52,0.590909,-0.205882,0.227273,0.167059,-0.227273,33,12,4,9
10,homosexual_gay_or_lesbian,1040,0.195192,0.365591,0.152709,0.359833,0.847291,1040,0.195192,0.16129,0.438424,0.457831,0.561576,-0.204301,0.285714,0.097999,-0.285714,306,135,31,89
17,other_race_or_ethnicity,56,0.089286,0.27451,0.0,0.263158,1.0,56,0.089286,0.078431,0.2,0.5,0.8,-0.196078,0.2,0.236842,-0.2,14,4,0,1
15,muslim,2170,0.1447,0.288793,0.200637,0.318933,0.799363,2170,0.1447,0.100754,0.487261,0.462644,0.512739,-0.188039,0.286624,0.143711,-0.286624,536,187,63,153
5,buddhist,45,0.022222,0.272727,1.0,0.0,0.0,45,0.022222,0.090909,1.0,0.0,0.0,-0.181818,0.0,0.0,0.0,12,4,1,1
22,transgender,236,0.097458,0.267606,0.086957,0.269231,0.913043,236,0.097458,0.093897,0.521739,0.354839,0.478261,-0.173709,0.434783,0.085608,-0.434783,57,20,2,12
3,bisexual,32,0.09375,0.37931,0.0,0.214286,1.0,32,0.09375,0.206897,0.333333,0.25,0.666667,-0.172414,0.333333,0.035714,-0.333333,11,6,0,1



Thresholds used:
  BASE_THR : 0.5
  FUSED_THR: 0.8179

Top 10 groups with biggest FPR reduction (fused - base):


Unnamed: 0,group,ΔFPR,base_FPR,fused_FPR,base_n
20,physical_disability,-0.285714,0.285714,0.0,7
4,black,-0.224806,0.40913,0.184324,1486
23,white,-0.217883,0.374615,0.156732,2417
8,heterosexual,-0.205882,0.323529,0.117647,124
10,homosexual_gay_or_lesbian,-0.204301,0.365591,0.16129,1040
17,other_race_or_ethnicity,-0.196078,0.27451,0.078431,56
15,muslim,-0.188039,0.288793,0.100754,2170
5,buddhist,-0.181818,0.272727,0.090909,45
22,transgender,-0.173709,0.267606,0.093897,236
3,bisexual,-0.172414,0.37931,0.206897,32



Top 10 groups with biggest FPR increase (fused - base):


Unnamed: 0,group,ΔFPR,base_FPR,fused_FPR,base_n
11,intellectual_or_learning_disability,0.0,0.0,0.0,13
19,other_sexual_orientation,0.0,0.0,0.0,3
16,other_gender,0.0,0.0,0.0,2
0,ALL,-0.041768,0.065225,0.023457,180488
6,christian,-0.066788,0.100444,0.033655,4057
18,other_religion,-0.081081,0.162162,0.081081,40
1,asian,-0.090244,0.153659,0.063415,450
7,female,-0.094791,0.151175,0.056384,5380
14,male,-0.10401,0.176441,0.072431,4476
2,atheist,-0.134921,0.18254,0.047619,133


Fairness by Identity Group result:

The fused TF–IDF + emotion model consistently reduces false positive rates across most identity groups compared to the TF–IDF-only baseline. This indicates that incorporating emotional context helps the classifier avoid incorrectly labeling identity-related but non-toxic comments as toxic. While this comes with a modest increase in false negatives for some groups, the overall effect suggests an improvement in fairness by reducing spurious toxicity flags.

Example: black: base_FPR 0.409 → fused_FPR 0.184  (ΔFPR ≈ -0.225) -> after adding emotions there was large reductions in false positives, especially for sensitive identity groups.

## 9. Testing custom comments

I will now test some custom comments

In [23]:
try:
    emo_model
    emo_tokenizer
    emotion_labels
except NameError:
    from transformers import AutoTokenizer, AutoModelForSequenceClassification

    GOEMOTIONS_MODEL = "SamLowe/roberta-base-go_emotions"
    emo_tokenizer = AutoTokenizer.from_pretrained(GOEMOTIONS_MODEL)
    emo_model = AutoModelForSequenceClassification.from_pretrained(GOEMOTIONS_MODEL)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    emo_model.to(device)
    emo_model.eval()

    emotion_labels = [
        'admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring',
        'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval',
        'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief',
        'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief',
        'remorse', 'sadness', 'surprise', 'neutral'
    ]

# Make sure model is in eval mode and we know its device
emo_model.eval()
_device = next(emo_model.parameters()).device

# -------------------------------
# 2) GoEmotions inference (sigmoid for multi-label)
# -------------------------------
def goemotions_probs(texts, batch_size=32, max_len=128):
    all_probs = []
    with torch.no_grad():
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            enc = emo_tokenizer(
                batch,
                padding=True,
                truncation=True,
                max_length=max_len,
                return_tensors="pt",
            )
            enc = {k: v.to(_device) for k, v in enc.items()}

            logits = emo_model(**enc).logits
            probs = torch.sigmoid(logits)  # IMPORTANT: multi-label emotions
            all_probs.append(probs.cpu().numpy())
    return np.vstack(all_probs)

def topk_emotions(probs_row, k=5):
    idx = np.argsort(probs_row)[::-1][:k]
    return [(emotion_labels[i], float(probs_row[i])) for i in idx]

# -------------------------------
# 3) Predict toxicity: baseline + fused
# -------------------------------
def predict_toxicity(texts):
    texts = [str(t) for t in texts]

    # Emotions
    E = goemotions_probs(texts, batch_size=32, max_len=128)

    # TF–IDF
    X_txt = tfidf.transform(texts)

    # Baseline
    base_prob = clf.predict_proba(X_txt)[:, 1]
    base_pred = (base_prob >= BASE_THR).astype(int)

    # Fused
    E_sp = sp.csr_matrix(E.astype(np.float32))
    X_fused = sp.hstack([X_txt, E_sp], format="csr")
    fused_prob = clf_fused.predict_proba(X_fused)[:, 1]
    fused_pred = (fused_prob >= FUSED_THR).astype(int)

    # Build output table
    rows = []
    for i, txt in enumerate(texts):
        tops = topk_emotions(E[i], k=5)
        rows.append({
            "text": txt,
            "top_emotions": ", ".join([f"{e}:{p:.3f}" for e, p in tops]),
            "baseline_prob": float(base_prob[i]),
            "baseline_label": int(base_pred[i]),
            "fused_prob": float(fused_prob[i]),
            "fused_label": int(fused_pred[i]),
        })

    return pd.DataFrame(rows)

# -------------------------------
# 4) Put your comments here
# -------------------------------
my_comments = [
    "I totally disagree with you, but I respect your opinion.",
    "You're a complete idiot and nobody wants you here.",
    "I'm really upset about how people like me are treated.",
    "This is awesome, thanks for sharing!",
    "I’m honestly sick of people like you pretending to be reasonable.", # No direct insult word, strong annoyance / disgust
    "You contribute nothing of value, and everyone knows it.", # indirect, anger + annoyance + contempt-like emotions
    "Your presence here is exhausting, and I have no patience left.", # Lexically neutral-ish, emotionally very negative, high annoyance + anger
    "People like you make this place unbearable." # Short, no swear words, strong emotional rejection
]

demo_df = predict_toxicity(my_comments)
print(f"Thresholds: baseline={BASE_THR}, fused={FUSED_THR}")
display(demo_df)


Thresholds: baseline=0.5, fused=0.8179


Unnamed: 0,text,top_emotions,baseline_prob,baseline_label,fused_prob,fused_label
0,"I totally disagree with you, but I respect you...","disapproval:0.838, approval:0.180, annoyance:0...",0.082825,0,0.075663,0
1,You're a complete idiot and nobody wants you h...,"anger:0.529, annoyance:0.490, neutral:0.057, d...",0.999989,1,0.999995,1
2,I'm really upset about how people like me are ...,"disappointment:0.614, sadness:0.289, annoyance...",0.090693,0,0.094339,0
3,"This is awesome, thanks for sharing!","gratitude:0.974, admiration:0.604, approval:0....",0.010653,0,0.012543,0
4,I’m honestly sick of people like you pretendin...,"annoyance:0.577, disappointment:0.199, anger:0...",0.617109,1,0.780222,0
5,"You contribute nothing of value, and everyone ...","annoyance:0.320, neutral:0.314, disapproval:0....",0.321746,0,0.319636,0
6,"Your presence here is exhausting, and I have n...","disappointment:0.434, annoyance:0.352, sadness...",0.294113,0,0.494503,0
7,People like you make this place unbearable.,"annoyance:0.301, sadness:0.137, disgust:0.105,...",0.326752,0,0.503791,0


In manual testing, cases where the fused model classified a comment as toxic while the text-only model did not were extremely rare. Instead, emotional features primarily reduced false positives by preventing emotionally expressive but non-abusive comments from being labeled as toxic. This behavior aligns with the observed reductions in false positive rates across identity groups.

## 10. Conclusion

This project shows that emotional signals alone are insufficient for toxicity detection, but when combined with lexical features, they provide valuable contextual information. The fused model reduces false positives, particularly for identity-related comments, without sacrificing overall performance. These findings highlight the role of emotions in improving fairness and reliability rather than raw accuracy.