# Jigsaw Unintended Bias in Toxicity Classification

At the end of 2017 the [Civil Comments](https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d) platform shut down and chose make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes.



# 1 About Data:

In the data supplied for this competition, the text of the individual comment is found in the `comment_text` column. Each comment in Train has a toxicity label (`target`), and models should predict the `target` toxicity for the Test data.

The data also has several additional toxicity subtype attributes. Models do not need to predict these attributes for the competition, they are included as an additional avenue for research. Subtype attributes are:

- `severe_toxicity`
- `obscene`
- `threat`
- `insult`
- `identity_attack`
- `sexual_explicit`

Additionally, a subset of comments have been labelled with a variety of identity attributes, representing the identities that are mentioned in the comment.

- `male`
- `female`
- `homosexual_gay_or_lesbian`
- `christian`
- `jewish`
- `muslim`
- `black`
- `white`
- `psychiatric_or_mental_illness`

There is also some additional features:

- `created_date`
- `publication_id`
- `parent_id`
- `article_id`
- `rating`
- `funny`
- `wow`
- `sad`
- `likes`
- `disagree`
- `sexual_explicit`
- `identity_annotator_count`
- `toxicity_annotator_count`

<h3> 1.1 Business Problem: </h3>

The Conversation AI team (it is research initiated by Jigsaw and Google) build a toxicity model, they found that the model incorrectly learned to associate the names of frequently attacked identities with toxicity. So the model predicted high toxicity for those comments which contain words like gay, black, Muslim, white, lesbian, etc, even when comments were not actually toxic (e.g. I am a gay woman.). This happened because the dataset was collected from the sources where such words (or identities) are considered as highly offensive. A model is needed to be build which can find the __toxicity__ in the comments and minimize the __unintended bias__ with respect to some identities.

- __Toxic__ comments are the comments which are offensive and sometimes can make some people leave the discussion (on public forums).

- __Unintended Bias__ is related to unplanned bias which happened because the data was collected from such sources which considered some words (or identities) very offensive.


<h3> 1.2 Objective: </h3>

- Predicting whether a comment is toxic or not.
- Minimize unintended bias.

<h3> 1.3 Constraints: </h3>

- No strict latency requirements.

# 2 Mapping the real-world problem to a ML problem:

<h3> 2.1 Type of Machine Learning Problem </h3> 

This is a binary classification task: 

- Target label 0 means __non-toxic__ comments 

- Target label 1 means __toxic__ comments.

<h3> 2.2 Performance Metric </h3> 

- Source: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview/evaluation

- Metric(s):

This competition uses a newly developed metric that combines several submetrics to balance overall performance with various aspects of unintended bias.

First, we'll define each submetric.

__Overall AUC:__
This is the ROC-AUC for the full evaluation set.

__Bias AUCs:__
To measure unintended bias, we again calculate the ROC-AUC, this time on three specific subsets of the test set for each identity, each capturing a different aspect of unintended bias. More about these metrics in Conversation AI's recent paper [Nuanced Metrics for Measuring Unintended Bias with Real Data in Text Classification](https://arxiv.org/abs/1903.04561).

*Subgroup AUC:* Here, we restrict the data set to only the examples that mention the specific identity subgroup. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

*BPSN (Background Positive, Subgroup Negative) AUC:* Here, we restrict the test set to the non-toxic examples that mention the identity and the toxic examples that do not. A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not, likely meaning that the model predicts higher toxicity scores than it should for non-toxic examples mentioning the identity.

*BNSP (Background Negative, Subgroup Positive) AUC:* Here, we restrict the test set to the toxic examples that mention the identity and the non-toxic examples that do not. A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not, likely meaning that the model predicts lower toxicity scores than it should for toxic examples mentioning the identity.

__Generalized Mean of Bias AUCs:__

To combine the per-identity Bias AUCs into one overall measure, we calculate their generalized mean as defined below:

![Generalized Mean of Bias AUCs](https://miro.medium.com/max/288/1*mdaEgvW3QN3nD1HjRDSgoQ.png)

![Variables](https://miro.medium.com/max/474/1*OIxmlRN66YE23g8kvDckAQ.png)



__Final Metric/Score/AUC:__

![final auc](https://miro.medium.com/max/560/1*oWQoDSnOt41GTWDp8V9FbA.png)


![variable](https://miro.medium.com/max/1043/1*Ufhmj7YkqXooBHTtG_16cw.png)



<h3> 2.3. Machine Learing Objectives and Constraints </h3> 

- Objective: Predict the probability of each data-point whether it is toxic or non-toxic.
Also, maximize the final score defined in above section.

- Constraints: There is no strict latency requirements. However, we want to penalize the data-points that mentions identity.