<a href="https://colab.research.google.com/github/anirbansaha96/AI-ML-Playground/blob/master/llm_toxicity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
!pip install datasets transformers evaluate -q

In [6]:
from datasets import load_dataset
from evaluate import load
from sklearn.metrics import mean_squared_error

dataset = load_dataset("allenai/real-toxicity-prompts", split="train[:100]")
# textdetox/multilingual_toxicity_dataset


Downloading readme:   0%|          | 0.00/4.22k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/67.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/99442 [00:00<?, ? examples/s]

In [14]:

def evaluate_model(model_name, toxic_label, metric="matthews_correlation"):
  toxicity = load("toxicity", model_name, module_type="measurement")
  predictions = []
  ground_truth = []
  for item in dataset:
    text = item["prompt"]["text"]
    label = item["prompt"]["toxicity"]
    prediction = toxicity.compute(predictions=[text], toxic_label=toxic_label)["toxicity"][0]
    predictions.append(prediction)
    ground_truth.append(label)

  result = mean_squared_error(ground_truth, predictions)
  return result



In [15]:
model_results = {}

models = [
    {"model_name": "facebook/roberta-hate-speech-dynabench-r4-target", "toxic_label": "hate"},
    {"model_name": "DaNLP/da-electra-hatespeech-detection", "toxic_label": "offensive"},
    {"model_name": "unitary/multilingual-toxic-xlm-roberta", "toxic_label": "toxic"},
    {"model_name": "unitary/toxic-bert", "toxic_label": "toxic"},
    {"model_name": "martin-ha/toxic-comment-model", "toxic_label": "toxic"},
    {"model_name": "textdetox/xlmr-large-toxicity-classifier", "toxic_label": "LABEL_1"},
    {"model_name": "s-nlp/roberta_toxicity_classifier", "toxic_label": "toxic"}
]

for model in models:
  try:
    metric_score = evaluate_model(model_name=model["model_name"], toxic_label=model["toxic_label"])
    model_results[model["model_name"]] = metric_score
  except Exception as e:
    print(f"Toxicity Calculation failed for {model['model_name']} with toxicity_label {model['toxic_label']} with error {e}")



Toxicity Calculation Started for facebook/roberta-hate-speech-dynabench-r4-target with toxicity_label hate
Toxicity Calculation Started for DaNLP/da-electra-hatespeech-detection with toxicity_label offensive
Toxicity Calculation Started for unitary/multilingual-toxic-xlm-roberta with toxicity_label toxic
Toxicity Calculation Started for unitary/toxic-bert with toxicity_label toxic
Toxicity Calculation Started for martin-ha/toxic-comment-model with toxicity_label toxic
Toxicity Calculation Started for textdetox/xlmr-large-toxicity-classifier with toxicity_label LABEL_1
Toxicity Calculation Started for s-nlp/roberta_toxicity_classifier with toxicity_label toxic


Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
sorted_models = sorted(model_results.items(), key=lambda x: x[1])

print("Model Performance (Best to Worst):")
for model, score in sorted_models:
  print(f"{model}: {score:.4f}")


Model Performance (Best to Worst):
textdetox/xlmr-large-toxicity-classifier: 0.0353
facebook/roberta-hate-speech-dynabench-r4-target: 0.0245
s-nlp/roberta_toxicity_classifier: 0.0160
unitary/toxic-bert: 0.0153
unitary/multilingual-toxic-xlm-roberta: 0.0151
martin-ha/toxic-comment-model: 0.0147
DaNLP/da-electra-hatespeech-detection: 0.0140


In [10]:
# from transformers import pipeline
# toxic_classifier = pipeline("text-classification", model="textdetox/xlmr-large-toxicity-classifier", top_k=99999, truncation=True)
# toxic_classifier.model.config.id2label

{0: 'LABEL_0', 1: 'LABEL_1'}