Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Safety recipes project folder #3199

Merged
merged 5 commits into from
Oct 15, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions parlai/tasks/bot_adversarial_dialogue/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
Task: Bot Adversarial Dialogue Dataset
===========================
Description: Dialogue datasets labeled with offensiveness from Bot Adversarial Dialogue task
Link:
Arxiv Paper:
===========================
# Task: Bot Adversarial Dialogue Dataset

## Description
Dialogue datasets labeled with offensiveness from Bot Adversarial Dialogue task

[Project](parl.ai/projects/recipes/safety_recipes)
[Arxiv Paper](https://arxiv.org/abs/2010.07079)

## Teachers
The `BotAdversarialDialogueTeacher` in `agents.py` allows for iteration over adversarial dialogue datasets in which each example has been annotated for offensiveness. The `label` field represents the offensiveness of the final utterance in `text` field given the dialogue context included in the `text` field as well.
The `BotAdversarialDialogueTeacher` in `agents.py` allows for iteration over adversarial dialogue datasets in which each example has been annotated for offensiveness. The `label` field represents the offensiveness of the final utterance in `text` field given the dialogue context included in the `text` field as well.
The `HumanSafetyEvaluationTeacher` in `agents.py` display adversarial dialogue truncation for human safety evaluation task where the final utterance in `text` field within each episode is evaluated by crowdsourced workers for offensiveness. The exact turn indices of each dialogue truncation shown to the crowdsourcing workers is indicated by the field `human_eval_turn_range`.

## Files
Expand Down
4 changes: 4 additions & 0 deletions projects/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,16 @@ _Task & models for chitchat with a given persona._
- **Dialogue Safety** [[project]](https://parl.ai/projects/dialogue_safety/) [[paper]](https://arxiv.org/abs/1908.06083).
_Task and method for improving the detection of offensive language in the context of dialogue._

- **Recipes for Safety in Open-Domain Chatbots** [[project]](https://parl.ai/projects/safety_recipes/) [[paper]](https://arxiv.org/abs/2010.07079).
_Methods for improving the safety of open-domain chatbots._

- **Mitigating Genderation Bias** [[project]](https://parl.ai/projects/genderation_bias/) [[paper]](https://arxiv.org/abs/1911.03842).
_Analysis and methods for mitigating gender bias in dialogue generation, using LIGHT as a testbed._

- **Multi-Dimensional Gender Bias Classification** [[project]](https://parl.ai/projects/md_gender/) [[paper]](https://arxiv.org/abs/2005.00614)
_Training fine-grained gender bias classifiers to identify gender bias in text._


## Knowledge Grounded

- **Wizard of Wikipedia** [[project]](http://parl.ai/projects/wizard_of_wikipedia/) [[paper]](https://openreview.net/forum?id=r1l73iRqKm).
Expand Down
Binary file added projects/safety_recipes/BAD_safety_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
64 changes: 64 additions & 0 deletions projects/safety_recipes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Recipes for Safety in Open-domain Chatbots

Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, Emily Dinan

## Abstract

Models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior and unwanted biases. We investigate a variety of methods to mitigate these issues in the context of open-domain generative dialogue models. We introduce a new human-and-model-in-the-loop framework for both training safer models and for evaluating them, as well as a novel method to distill safety considerations inside generative models without the use of an external classifier at deployment time. We conduct experiments comparing these methods and find our new techniques are (i) safer than existing models as measured by automatic and human evaluations while (ii) maintaining usability metrics such as engagingness relative to the state of the art. We then discuss the limitations of this work by analyzing failure cases of our models.

## Paper

[Link](https://arxiv.org/abs/2010.07079)


## Data

We release the Bot-Adversarial Dialogue task at `parlai/tasks/bot_adversarial_dialogue`. To view the data, run:

```
parlai display_data -t bot_adversarial_dialogue
```

To view the data used for the fixed test set, run:

```
parlai display_data -t bot_adversarial_dialogue:HumanSafetyEvaluation
```

<p align="center"><img width="60%" src="BAD_safety_diagram.png" /></p>


Data (and models) the from the [Build-it, Break-it, Fix-it paper](https://arxiv.org/abs/1908.06083) can be found [here](parl.ai/projects/dialogue_safety).

## Models

A classifier trained on the new Bot-Adversarial Dialogue (BAD) task (as well as other existing safety tasks) can be found at `zoo:bot_adversarial_dialogue/multi_turn_v0/model`.

This model can be downloaded and evaluated on the BAD task test set with the following command:
```
parlai eval_model -t bot_adversarial_dialogue:bad_num_turns=4 -dt test -mf zoo:bot_adversarial_dialogue/multi_turn_v0/model -bs 128
```

To train your own classifier on the BAD dataset and other safety tasks, try the following command:
```
parlai train_model -t dialogue_safety:WikiToxicComments,dialogue_safety:standard,dialogue_safety:adversarial,bot_adversarial_dialogue --model transformer/classifier --load-from-pretrained-ranker True --init-model zoo:pretrained_transformers/bi_model_huge_reddit/model --dict-file zoo:pretrained_transformers/bi_model_huge_reddit/model.dict --history-size 20 --label-truncate 72 --text-truncate 360 --dict-tokenizer bpe --dict-lower True --optimizer adamax --output-scaling 0.06 --variant xlm --reduction-type mean --share-encoders False --learn-positional-embeddings True --n-layers 12 --n-heads 12 --ffn-size 3072 --attention-dropout 0.1 --relu-dropout 0.0 --dropout 0.1 --n-positions 1024 --embedding-size 768 --activation gelu --embeddings-scale False --n-segments 2 --learn-embeddings True --share-word-embeddings False --dict-endtoken __start__ --classes __notok__ __ok__ --round 3 --use-test-set True --model transformer/classifier --multitask-weights 3,1,1,1 -lr 5e-05 -bs 20 --data-parallel True -vtim 60 -vp 30 -stim 60 -vme 10000 --lr-scheduler fixed --lr-scheduler-patience 3 --lr-scheduler-decay 0.9 --warmup_updates 1000 --validation-metric class___notok___f1 --validation-metric-mode max --save-after-valid True --model-file <your model file path>
```


## Human Evaluations

- Evaluating safety: Mechanical Turk task for analyzing the safety of models will be released shortly. *Check back soon!*

- Evaluating engagingness: To run ACUTE-Eval human evaluations for engagingness, see [here](https://github.com/facebookresearch/ParlAI/tree/master/parlai/mturk/tasks/acute_eval).


## Citation

If you use the data or models in your own work, please cite with the following BibTex entry:

@inproceedings{xu2020safetyrecipes,
author={Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, Emily Dinan},
title={Recipes for Safety in Open-domain Chatbots},
journal={arXiv preprint arXiv:2010.07079},
year={2020},
}