Toxic Language Debiasing
This repo contains the code for our paper "Challenges in Automated Debiasing for Toxic Language Detection". In particular, it contains the code to fine-tune RoBERTa and RoBERTa with the ensemble-based method in the task of toxic language prediction. It also contains the index of data points that we used in the experiments.
This repo contains code to detect toxic language with RoBERT/ ensemble-based ROBERTa. Our experiments mainly focus on the dataset from "Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior".
Our implementation exists in the
.\src folder. The
organize the classifier, and the
modeling_roberta_debias.py builds the
We require pytorch>=1.2 and transformers=2.3.0 Additional requirements are are in
You can find the index of the training data with different data selection methods in
You can find a complete list of entries of data that we need for experiments in
Out-of-distribution (OOD) data, the two OOD datasets we use are publicly available:
- ONI-adv: This dataset is the test set of the work "Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack"
- User-reported: This dataset is from the work 'User-Level Race and Ethnicity Predictors from Twitter Text'
Our word list for lexical bias is in the file:
Since we do not encourage building systems based on our relabeling dataset, we decide not to release the relabeling dataset publicly. For research purpose, please contact the first author for the access of the dataset.
Measure Dataset Bias
python ./tools/get_stats.py /location/of/your/data_file.csv
To obtain the Peasonr correlation between toxicity and Tox-Trig words/ aav probabilities.
Fine-tune a Vanilla RoBERTa
Fine-tune a Ensemble-based RoBERTa
You need to obtain the bias-only model first in order to train the ensemble
model. Feel free to use files we provided in the folder
Model Evaluation & Measuring Models' Bias
You can use the same fine-tuning script to obtain predictions from models.
The measuring bias script takes the predictions as input and output models'
performance and lexical/dialectal bias scores. The script is available in the