Toxicity

The toxicity model classifies whether a comment is a rude, disrespectful, or unreasonable comment that is likely to make people leave a discussion.

Overview

The ROC curve for the current TOXICITY model (TOXICITY@6). This shows True Positive Rate (y-axis) vs. False Positive Rate (x-axis).

Intended use

Human-assisted moderation

Make moderation easier with an ML assisted tool that helps prioritize comments for human moderation, and create custom tasks for automated actions. See our moderator tool as an example.

Author feedback

Assist authors in real-time when their comments might violate your community guidelines or be may be perceived as “Toxic” to the conversation. Use simple feedback tools when the assistant gets it wrong. See our authorship demo as an example.

Read better comments

Organize comments on topics that are often difficult to discuss online. Build new tools that help people explore the conversation.

Uses to avoid

Fully automated moderation

Perspective is not intended to be used for fully automated moderation. Machine learning models will always make some mistakes, so it is essential to build in systems for humans to catch and correct those mistakes.

Character judgement

In order to maintain user privacy, the TOXICITY model only helps detect toxicity in an individual statement, and is not intended to detect anything about the individual who said it. In addition, Perspective does not use prior information about an individual to inform toxicity predictions.

Model details

Training data

Proprietary from Perspective API, which includes comments from online forums such as Wikipedia (CC-BY-SA3 license) and New York Times, with crowdsourced labels of whether the comment is “toxic”, defined as “a rude, disrespectful, or unreasonable comment that is likely to make people leave a discussion”.

Model architecture

The model is a Convolutional Neural Network (CNN) trained with GloVe word embeddings, which are fine-tuned during training. You can also train your own deep CNN for text classification on our public toxicity dataset, and explore our open-source model training tools to train your own models.

Values

Community, Transparency, Inclusivity, Privacy, and Topic neutrality. These values guide our product and research decisions.

Evaluation data

Overall evaluation data

The overall evaluation result (shown above) is calculated using the held out test set associated with the training set for the specific model. Note that this means that each new model version is likely to have a different training and testing set, so overall results are not directly comparable across models.

Unintended bias evaluation data

The unintended bias evaluation result is calculated using a synthetically generated test set where a range of identity terms are swapped into template sentences, both toxic and non-toxic. Results are presented grouped by identity term. Note that this evaluation looks at only the identity terms present in the text. We do not look at the identities of comment authors or readers to protect the privacy of these users.

Group factors

Identity terms referencing frequently attacked groups, focusing on sexual orientation, gender identity, and race.

Caveats

The current synthetic test data covers only a small set of very specific comments and identities. While these are designed to be representative of common use cases and concerns, it is not comprehensive.

Unitary Identity Subgroup Evaluation

To measure unintended bias, we calculate three separate ROC-AUC results for each identity. Each result captures a different type of unintended bias and each is calculated by restricting the data set to different subsets:

Test set	Description
Subgroup AUC	Here, we restrict the data set to only the examples that mention the specific identity subgroup. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.
BPSN AUC	Here, we restrict the test set to the non-toxic examples that mention the identity and the toxic examples that do not. A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not, likely meaning that the model predicts higher toxicity scores than it should for non-toxic examples mentioning the identity.
BNSP AUC	Here, we restrict the test set to the toxic examples that mention the identity and the non-toxic examples that do not. A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not, likely meaning that the model predicts lower toxicity scores than it should for toxic examples mentioning the identity.

Below are unintended bias evaluation results for a subset of identities for two versions of our model, the initial TOXICITY@1, launched in February 2017, and the latest TOXICITY@6, launched in August 2018. See our results for all versions of Toxicity models here, including results for more identity terms and more intersectional results.

Intersectional Identity Subgroup Evaluation

The intersectional evaluation shows results for comments mentioning two identities.

Get involved

If you have any questions, feedback, or additional things you'd like to see in the model card, please reach out to us here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toxicity.md

toxicity.md

Toxicity

Overview

Intended use

Human-assisted moderation

Author feedback

Read better comments

Uses to avoid

Fully automated moderation

Character judgement

Model details

Training data

Model architecture

Values

Evaluation data

Overall evaluation data

Unintended bias evaluation data

Group factors

Caveats

Unitary Identity Subgroup Evaluation

Intersectional Identity Subgroup Evaluation

Get involved

Files

toxicity.md

Latest commit

History

toxicity.md

File metadata and controls

Toxicity

Overview

Intended use

Human-assisted moderation

Author feedback

Read better comments

Uses to avoid

Fully automated moderation

Character judgement

Model details

Training data

Model architecture

Values

Evaluation data

Overall evaluation data

Unintended bias evaluation data

Group factors

Caveats

Unitary Identity Subgroup Evaluation

Intersectional Identity Subgroup Evaluation

Get involved