# **Jigsaw Toxic Comment Classification**
## **XGBoost Modeling and Fairness Evaluation**

This notebook builds a baseline tree-based classifier using **XGBoost** and evaluates its performance across both overall and identity subgroup levels. It uses **Word2Vec sentence embeddings** for feature extraction and fairness-aware metrics to assess model bias, following Jigsaw’s official competition methodology.


### **1. Data Preparation**
- Loads and filters the `train.csv` file, removing non-identity ID columns.
- Extracts target labels, including toxicity categories:
  - `target`, `severe_toxicity`, `obscene`, `threat`, `insult`, `identity_attack`, `sexual_explicit`
- Includes identity indicators:
  - `male`, `female`, `homosexual_gay_or_lesbian`, `christian`, `jewish`, `muslim`, `black`, `white`, `psychiatric_or_mental_illness`
- Handles missing values in text and target columns.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gc

import os

from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
data_path = "/content/drive/My Drive/Jigsaw/"

In [None]:
train = pd.read_csv(data_path + "train.csv")

In [None]:
id_cols = [col for col in train.columns if ('id' in col.lower()) & ('identity' not in col.lower())]
train = train.drop(columns=id_cols)

In [None]:
train.columns

Index(['target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'rating', 'funny', 'wow', 'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count'],
      dtype='object')

In [None]:
# get those columns as the target Y set:
# severe_toxicity
# obscene
# threat
# insult
# identity_attack
# sexual_explicit

target_cols = ['target', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack', 'sexual_explicit'] \
            + ['male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish', 'muslim', 'black', 'white', 'psychiatric_or_mental_illness']
y = train[target_cols]

In [None]:
X = train['comment_text']

In [None]:
del train

In [None]:
gc.collect()

93

In [None]:
X.isnull().sum()

3

In [None]:
X = X.fillna('')

In [None]:
y.isnull().sum()

Unnamed: 0,0
target,0
severe_toxicity,0
obscene,0
threat,0
insult,0
identity_attack,0
sexual_explicit,0
male,1399744
female,1399744
homosexual_gay_or_lesbian,1399744


### **2. Word2Vec Embedding Generation**
- Trains a Word2Vec model on the preprocessed text.
- Computes average sentence vectors for each comment.
- Optionally loads pre-computed vectors from disk.

In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

def get_avg_w2v(sentence, model, vector_size):
    words = word_tokenize(sentence)
    vectors = [model.wv[word] for word in words if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(vector_size)

In [None]:
# Tokenize text
# sentences = [word_tokenize(text) for text in X]

In [None]:
# Train Word2Vec
# w2v_model = Word2Vec(sentences, vector_size=300, window=5, min_count=2, workers=4)

In [None]:
# X = np.array([get_avg_w2v(comment, w2v_model, 300) for comment in X])

In [None]:
# X = pd.DataFrame(X)

In [None]:
data_path = "/content/drive/My Drive/Jigsaw/"

In [None]:
# X.to_csv(data_path + 'train_cleaned.csv', index=False)

In [None]:
X = pd.read_csv(data_path + 'train_cleaned.csv')

In [None]:
X = X.apply(pd.Series)

In [None]:
y['target'] = y['target'] > 0.5

### **3. XGBoost Model Training**
- **Train-test split**: 80/20
- **Feature input**: Sentence-level Word2Vec embeddings  
- **Target variable**: Binary toxicity label (`target > 0.5`)

**XGBoost Parameters:**
- `objective`: `'binary:logistic'`
- `eval_metric`: `'auc'`
- `eta`: `0.1`
- `max_depth`: `6`
- `subsample`: `0.8`
- `colsample_bytree`: `0.8`
- `num_boost_round`: `100`

**Performance:**
- Achieved an **AUC of approximately 0.85** on the test set, indicating strong baseline separation between toxic and non-toxic comments.

**Key Takeaway:**  
The model demonstrates strong predictive capability on the binary classification task using dense Word2Vec sentence vectors, even without deep architectures or fine-tuned embeddings. It establishes a solid benchmark for further bias-aware evaluations and multi-label extensions.

In [None]:
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert data to DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train['target'])
dtest = xgb.DMatrix(X_test, label=y_test['target'])

In [None]:
# Set XGBoost parameters
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',  # Use AUC as evaluation metric
    'eta': 0.1,
    'max_depth': 6,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
}

# Train the XGBoost model
model = xgb.train(params, dtrain, num_boost_round=100)

In [None]:
# Make predictions on the test set
y_pred = model.predict(dtest)

# Evaluate the model (example: using AUC)
# You would typically use metrics like AUC, accuracy, precision, recall, F1-score, etc.
#  based on the specific requirements of your task

from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test['target'], y_pred)
print(f"AUC: {auc}")

AUC: 0.8491787180748991


### **4. Bias and Fairness Evaluation**

To assess the fairness of the model across identity subgroups, we compute three bias-aware metrics and combine them into a single fairness-adjusted score.

#### **Evaluation Metrics**
- **Subgroup AUC**: Measures the model’s ability to distinguish toxic vs. non-toxic within each identity group.
- **BPSN AUC** (Background Positive, Subgroup Negative): Tests if the model mistakenly classifies non-toxic subgroup comments as toxic.
- **BNSP AUC** (Background Negative, Subgroup Positive): Tests if the model fails to classify toxic subgroup comments as toxic.

#### **Bias-Aware Final Score**
A power mean (with power = -5) aggregates the above metrics across subgroups. This combined bias score is then blended with the overall AUC (weight = 0.25) to compute the final fairness-adjusted performance metric.

#### **Bias Metrics Results**
While the model performs well overall (AUC ≈ 0.85), it shows **lower fairness scores on subgroup AUC for underrepresented groups** like "black" and "homosexual_gay_or_lesbian", suggesting that further mitigation (e.g., reweighting or debiasing techniques) may be needed to ensure equitable outcomes.


In [None]:
# From baseline kernel

def calculate_overall_auc(df, model_name):
    true_labels = df[TOXICITY_COLUMN]>0.5
    predicted_labels = df[model_name]
    return roc_auc_score(true_labels, predicted_labels)

def power_mean(series, p):
    total = sum(np.power(series, p))
    return np.power(total / len(series), 1 / p)

def get_final_metric(bias_df, overall_auc, POWER=-5, OVERALL_MODEL_WEIGHT=0.25):
    bias_score = np.average([
        power_mean(bias_df[SUBGROUP_AUC], POWER),
        power_mean(bias_df[BPSN_AUC], POWER),
        power_mean(bias_df[BNSP_AUC], POWER)
    ])
    return (OVERALL_MODEL_WEIGHT * overall_auc) + ((1 - OVERALL_MODEL_WEIGHT) * bias_score)



SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'  # stands for background positive, subgroup negative
BNSP_AUC = 'bnsp_auc'  # stands for background negative, subgroup positive

def compute_auc(y_true, y_pred):
    try:
        return roc_auc_score(y_true, y_pred)
    except ValueError:
        return np.nan

def compute_subgroup_auc(df, subgroup, label, model_name):
    subgroup_examples = df[df[subgroup]>0.5]
    return compute_auc((subgroup_examples[label]>0.5), subgroup_examples[model_name])

def compute_bpsn_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup negative examples and the background positive examples."""
    subgroup_negative_examples = df[(df[subgroup]>0.5) & (df[label]<=0.5)]
    non_subgroup_positive_examples = df[(df[subgroup]<=0.5) & (df[label]>0.5)]
    # examples = subgroup_negative_examples.append(non_subgroup_positive_examples)
    examples = pd.concat([subgroup_negative_examples, non_subgroup_positive_examples])
    return compute_auc(examples[label]>0.5, examples[model_name])

def compute_bnsp_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup positive examples and the background negative examples."""
    subgroup_positive_examples = df[(df[subgroup]>0.5) & (df[label]>0.5)]
    non_subgroup_negative_examples = df[(df[subgroup]<=0.5) & (df[label]<=0.5)]
    # examples = subgroup_positive_examples.append(non_subgroup_negative_examples)
    examples = pd.concat([subgroup_positive_examples, non_subgroup_negative_examples])
    return compute_auc(examples[label]>0.5, examples[model_name])

def compute_bias_metrics_for_model(dataset,
                                   subgroups,
                                   model,
                                   label_col,
                                   include_asegs=False):
    """Computes per-subgroup metrics for all subgroups and one model."""
    records = []
    for subgroup in subgroups:
        record = {
            'subgroup': subgroup,
            'subgroup_size': len(dataset[dataset[subgroup]>0.5])
        }
        record[SUBGROUP_AUC] = compute_subgroup_auc(dataset, subgroup, label_col, model)
        record[BPSN_AUC] = compute_bpsn_auc(dataset, subgroup, label_col, model)
        record[BNSP_AUC] = compute_bnsp_auc(dataset, subgroup, label_col, model)
        records.append(record)
    return pd.DataFrame(records).sort_values('subgroup_auc', ascending=True)

In [None]:
identity_columns = [
    'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness']
y_columns=['target']

In [None]:
MODEL_NAME = 'model1'
y_test[MODEL_NAME]=y_pred
TOXICITY_COLUMN = 'target'
bias_metrics_df = compute_bias_metrics_for_model(y_test, identity_columns, MODEL_NAME, 'target')
bias_metrics_df
get_final_metric(bias_metrics_df, calculate_overall_auc(y_test, MODEL_NAME))

0.8080898543096531

In [None]:
bias_metrics_df

Unnamed: 0,subgroup,subgroup_size,subgroup_auc,bpsn_auc,bnsp_auc
6,black,2729,0.751513,0.717222,0.86336
2,homosexual_gay_or_lesbian,2104,0.756789,0.843098,0.759485
7,white,4661,0.759939,0.637381,0.910221
5,muslim,3940,0.772141,0.777581,0.83724
1,female,10136,0.80207,0.789344,0.851708
0,male,8025,0.809047,0.733787,0.892211
4,jewish,1431,0.821264,0.83058,0.826134
3,christian,7011,0.841584,0.890722,0.769287
8,psychiatric_or_mental_illness,861,0.847092,0.804135,0.8733


### **5. Classification Metrics**

This section evaluates the model using standard classification performance metrics. The predicted probabilities are thresholded at **0.5** to convert them into binary labels.

#### **Metrics Used**
- **Accuracy**: Proportion of total correct predictions.
- **Precision**: Of all predicted toxic comments, how many were actually toxic?
- **Recall**: Of all true toxic comments, how many were correctly identified?
- **F1 Score**: Harmonic mean of precision and recall, balancing both.

#### **Results**
The model achieves **high accuracy** and **good precision**, but **recall is very low**, meaning it misses a large portion of actual toxic comments. This suggests that the model is conservative in predicting toxicity, favoring false negatives over false positives.


In [None]:
# Calculate the F1 Score, Accuracy, Precision, and Recall

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Assuming y_pred is already a 0/1 prediction based on a threshold (e.g., 0.5)
y_pred_binary = (y_pred > 0.5).astype(int)

accuracy = accuracy_score(y_test['target'], y_pred_binary)
precision = precision_score(y_test['target'], y_pred_binary)
recall = recall_score(y_test['target'], y_pred_binary)
f1 = f1_score(y_test['target'], y_pred_binary)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")


Accuracy: 0.9440542973890158
Precision: 0.7450980392156863
Recall: 0.09026548672566372
F1 Score: 0.1610236384030576
