# Analyzing Bias in Detox Dataset

## Step 1: Download datasets

In [1]:
import urllib
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

In [2]:
if os.path.isdir('data')==False:
    os.mkdir('data')
if os.path.isdir('images')==False:
    os.mkdir('images')

In [3]:
def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

In [4]:
TOXICITY_ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7394542' 
TOXICITY_ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7394539'
TOXICITY_WORKER_DEMOGRAPHICS_IRL = 'https://ndownloader.figshare.com/files/7640581'

download_file(TOXICITY_ANNOTATED_COMMENTS_URL, 'data/toxicity_annotated_comments.tsv')
download_file(TOXICITY_ANNOTATIONS_URL, 'data/toxicity_annotations.tsv')
download_file(TOXICITY_WORKER_DEMOGRAPHICS_IRL, 'data/toxicity_worker_demographics.tsv')

In [5]:
ATTACK_ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ATTACK_ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637'
ATTACK_WORKER_DEMOGRAPHICS_IRL = 'https://ndownloader.figshare.com/files/7640752'

download_file(ATTACK_ANNOTATED_COMMENTS_URL, 'data/attack_annotated_comments.tsv')
download_file(ATTACK_ANNOTATIONS_URL, 'data/attack_annotations.tsv')
download_file(ATTACK_WORKER_DEMOGRAPHICS_IRL, 'data/attack_worker_demographics.tsv')

## Step 2: Load tables using pandas

In [6]:
toxicity_annotations = pd.read_csv("data/toxicity_annotations.tsv", delimiter="\t")
toxicity_annotated_comments = pd.read_csv("data/toxicity_annotated_comments.tsv", delimiter="\t")
toxicity_worker_demographics = pd.read_csv("data/toxicity_worker_demographics.tsv", delimiter="\t")

toxicity_worker_demographics = toxicity_worker_demographics.set_index("worker_id")

In [7]:
attack_annotations = pd.read_csv("data/attack_annotations.tsv", delimiter="\t")
attack_annotated_comments = pd.read_csv("data/attack_annotated_comments.tsv", delimiter="\t")
attack_worker_demographics = pd.read_csv("data/attack_worker_demographics.tsv", delimiter="\t")

attack_worker_demographics = attack_worker_demographics.set_index("worker_id")

In [8]:
toxicity_annotations = pd.read_csv("data/toxicity_annotations.tsv", delimiter="\t")
toxicity_annotated_comments = pd.read_csv("data/toxicity_annotated_comments.tsv", delimiter="\t")
toxicity_worker_demographics = pd.read_csv("data/toxicity_worker_demographics.tsv", delimiter="\t")

toxicity_worker_demographics = toxicity_worker_demographics.set_index("worker_id")

# Analysis Questions

## Question 1: Analyze the level of disagreement among crowdworkers around certain labels

#### How much do labelers tend to agree while labelling hostile speech? Do people disagree more to a comment more likely to be labeled as hostile? Are some kinds of hostile speech harder for people to agree on than others? For example, do labelers tend to disagree more about “personal attacks” vs. “toxicity”?    

For this analysis, we use "entropy" to measure the level of disagreement between different workers on the labeling of each comment. Entropy is a measure of randomness in the data which makes it harder to draw any conclusions from. There are several ways to measure entropy. In this analysis, we use [Shannon's entropy](https://arxiv.org/ftp/arxiv/papers/1405/1405.2061.pdf#:~:text=Meaning%20of%20Entropy,of%20information%20in%20that%20variable)  which returns a value between 0 to 1. 0 indicates no entropy i.e. complete agreement while 1 indicates maximum disagreement i.e. 50-50 split between labels.

##### Step 1: We first compute the entropy in the toxicity and personal attacks annotations respectively.  

We also find the majority vote (is_toxic/is_attack) for each comment (e.g. is_toxic = 1 if more than half workers label the comment as toxic)

### Toxicity

In [9]:
toxicity_mean = toxicity_annotations.groupby("rev_id")["toxicity_score"].mean().to_frame().head().reset_index()

In [10]:
toxic_count = pd.DataFrame({"toxic_count": toxicity_annotations.groupby("rev_id")["toxicity"].apply(lambda c : c.sum())}).reset_index()

In [11]:
nontoxic_count = pd.DataFrame({"nontoxic_count": toxicity_annotations.groupby("rev_id")["toxicity"].apply(lambda c : (c == 0).sum())}).reset_index()

In [12]:
joined_toxicity_count = toxic_count.set_index("rev_id").join(nontoxic_count.set_index("rev_id")).reset_index()
joined_toxicity_count['total_count'] = joined_toxicity_count["toxic_count"] + joined_toxicity_count["nontoxic_count"]
joined_toxicity_count['p_toxic'] = joined_toxicity_count['toxic_count']/joined_toxicity_count['total_count']
joined_toxicity_count['p_nontoxic'] = joined_toxicity_count['nontoxic_count']/joined_toxicity_count['total_count']
joined_toxicity_count["entropy"] = -(np.log2(joined_toxicity_count['p_toxic'])*joined_toxicity_count['p_toxic']) - (np.log2(joined_toxicity_count['p_nontoxic'])*joined_toxicity_count['p_nontoxic'])
joined_toxicity_count["entropy"] = joined_toxicity_count["entropy"].fillna(0)
joined_toxicity_count['is_toxic'] = joined_toxicity_count['toxic_count']> joined_toxicity_count['nontoxic_count']

In [13]:
joined_toxicity_count.head()

Unnamed: 0,rev_id,toxic_count,nontoxic_count,total_count,p_toxic,p_nontoxic,entropy,is_toxic
0,2232.0,1,9,10,0.1,0.9,0.468996,False
1,4216.0,0,10,10,0.0,1.0,0.0,False
2,8953.0,0,10,10,0.0,1.0,0.0,False
3,26547.0,0,10,10,0.0,1.0,0.0,False
4,28959.0,2,8,10,0.2,0.8,0.721928,False


### Personal Attacks

In [14]:
attack_count = pd.DataFrame({"attack_count": attack_annotations.groupby("rev_id")["attack"].apply(lambda c : c.sum())}).reset_index()

In [None]:
nonattack_count = pd.DataFrame({"nonattack_count": attack_annotations.groupby("rev_id")["attack"].apply(lambda c : (c == 0).sum())}).reset_index()

In [None]:
joined_attack_count = attack_count.set_index("rev_id").join(nonattack_count.set_index("rev_id")).reset_index()
joined_attack_count['total_count'] = joined_attack_count["attack_count"] + joined_attack_count["nonattack_count"]
joined_attack_count['p_attack'] = joined_attack_count['attack_count']/joined_attack_count['total_count']
joined_attack_count['p_nonattack'] = joined_attack_count['nonattack_count']/joined_attack_count['total_count']
joined_attack_count["entropy"] = -(np.log2(joined_attack_count['p_attack'])*joined_attack_count['p_attack']) - (np.log2(joined_attack_count['p_nonattack'])*joined_attack_count['p_nonattack'])
joined_attack_count["entropy"] = joined_attack_count["entropy"].fillna(0)
joined_attack_count['is_attack'] = joined_attack_count['attack_count']> joined_attack_count['nonattack_count']

In [None]:
joined_attack_count.head()

###### Step 2: Next we compare the distribution of entropy (level of disagreement) between the comment labels for toxicity and personal attacks.

In [None]:
toxicity_vs_attack_entropy = joined_attack_count[['rev_id', 'entropy']].merge(joined_toxicity_count[['rev_id', 'entropy']], on = 'rev_id')[['entropy_x', 'entropy_y']]
toxicity_vs_attack_entropy.columns = ['attack', 'toxicity']

In [None]:
pd.melt(toxicity_vs_attack_entropy)
fig, ax = plt.subplots(figsize=(15,8))
ax.set_title("Distribution of Level of Disagreement in Different Kinds of Hostile speech")
sns.violinplot( x="variable", y="value", data=pd.melt(toxicity_vs_attack_entropy), ax=ax, palette = 'magma' )
plt.savefig("images/label_disagreement_toxic_vs_attack.png")

We observe that both the distributions and means of the entropy for comments on toxicity and personal attacks are quite similar. This means that people tend to have a similar level of agreement while identifying different types of hostile speech.

###### Step 3: We compare the entropy between comments which were labelled as hostile by a majority of the labellers vs comments that were labelled as not hostile

We first compare the number of hostile to non-hostile comments. We observe that in both datasets (toxicity and personal attacks), there is higher number of comments that are marked as hostile.

In [None]:
joined_toxicity_count.groupby(['is_toxic'])['rev_id'].count().reset_index()

In [None]:
joined_attack_count.groupby(['is_attack'])['rev_id'].count().reset_index()

### Toxicity

In [None]:
joined_toxicity_count.groupby('is_toxic')['entropy'].mean().reset_index()

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
ax.set_title("Distribution of Level of Disagreement in Toxicity annotations, by majority vote")
sns.violinplot( x="is_toxic", y="entropy", data=joined_toxicity_count, ax=ax , palette = 'cubehelix')
plt.savefig("images/label_disagreement_toxicity.png")

We see that the mean entropy for comments labeled as not-toxic is lower (0.24) compared to the mean entropy when comments are labeled as toxic (0.53). This means that labelers tend to agree less that a comment is toxic although majority labelers do tend to label it as toxic.

### Personal Attacks

In [None]:
joined_attack_count.groupby('is_attack')['entropy'].mean().reset_index()

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
ax.set_title("Distribution of Level of Disagreement in Personal Attack annotations, by majority vote")
sns.violinplot( x="is_attack", y="entropy", data=joined_attack_count, ax=ax, palette = 'cubehelix' )
plt.savefig("images/label_disagreement_attack.png")

We see that similar to the toxicity data, the mean entropy for comments labeled as not-attacks is lower (0.28) compared to the mean entropy when comments are labeled as attacks (0.58). This means that labelers tend to agree less that a comment is an attack although majority labelers do tend to label it as attack.

### Results  

Labelers tend to agree and disagree similarly for the different kinds of hostile speech (toxicity and personal attack).  
A higher mean entropy indicates that there is more disagreement between the labelling of comments that are voted as having hostile speech by a majority of the annotators. This indicates that there is more ambiguity between the decision to label comments as toxic/attack. Thus any model trained on this data might not do a very good job in detecting comments with toxicity or personal attacks.

## Question 2: Explore relationships between worker demographics and labeling behavior

#### How consistent are labelling behaviors among workers with different demographic profiles? For example, are female-identified labelers more or less likely to label comments as aggressive than male-identified labelers?

For this analysis, we compare the proportion of comments that are marked as toxic between different groups in demographic profiles such as gender, age, education and language. If the data is unbiased we should observe similar proportions between the groups.

##### Step 1: Join the annotation data with the worker demographics data

In [None]:
toxicity_worker_demographics = toxicity_worker_demographics.reset_index()
attack_worker_demographics = attack_worker_demographics.reset_index()

In [None]:
toxicity_label_demographics = toxicity_annotations.set_index('worker_id').join(toxicity_worker_demographics.set_index('worker_id'))

In [None]:
toxicity_label_demographics.head()

In [None]:
attack_label_demographics = attack_annotations.set_index('worker_id').join(attack_worker_demographics.set_index('worker_id'))

In [None]:
attack_label_demographics.head()

##### Step 2: Find proportion of comments marked as hostile for different demographic profile groups

### Gender

In [None]:
toxicity_gender_labels = toxicity_label_demographics[toxicity_label_demographics['toxicity'] == 1].groupby('gender')['rev_id'].count().reset_index()
toxicity_gender_totals = toxicity_label_demographics.groupby('gender')['rev_id'].count().reset_index()
toxicity_gender_labels['proportion'] = toxicity_gender_labels['rev_id']/toxicity_gender_totals['rev_id']

In [None]:
toxicity_gender_labels.head()

In [None]:
attack_gender_labels = attack_label_demographics[attack_label_demographics['attack'] == 1].groupby('gender')['rev_id'].count().reset_index()
attack_gender_totals = attack_label_demographics.groupby('gender')['rev_id'].count().reset_index()
attack_gender_labels['proportion'] = attack_gender_labels['rev_id']/attack_gender_totals['rev_id']

In [None]:
attack_gender_labels

In [None]:
toxicity_vs_attack_gender = toxicity_gender_labels.merge(attack_gender_labels, on = 'gender')
toxicity_vs_attack_gender = pd.DataFrame({'gender': toxicity_vs_attack_gender['gender'], 'toxicity':  toxicity_vs_attack_gender['proportion_x'], 'attack': toxicity_vs_attack_gender['proportion_y']})
toxicity_vs_attack_gender = toxicity_vs_attack_gender.set_index('gender').stack().reset_index()

In [None]:
toxicity_vs_attack_gender.columns = ['gender', 'type', 'comments_proportion']

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_title("Proportion of Hostile Comments by Gender")
sns.barplot(x = 'gender', y = 'comments_proportion', hue = 'type', palette = 'magma', data = toxicity_vs_attack_gender)
plt.savefig("images/hostile_comments_by_gender.png")

### Age Group

In [None]:
toxicity_age_labels = toxicity_label_demographics[toxicity_label_demographics['toxicity'] == 1].groupby('age_group')['rev_id'].count().reset_index()
toxicity_age_totals = toxicity_label_demographics.groupby('age_group')['rev_id'].count().reset_index()
toxicity_age_labels['proportion'] = toxicity_age_labels['rev_id']/toxicity_age_totals['rev_id']

In [None]:
toxicity_age_labels

In [None]:
attack_age_labels = attack_label_demographics[attack_label_demographics['attack'] == 1].groupby('age_group')['rev_id'].count().reset_index()
attack_age_totals = attack_label_demographics.groupby('age_group')['rev_id'].count().reset_index()
attack_age_labels['proportion'] = attack_age_labels['rev_id']/attack_age_totals['rev_id']

In [None]:
attack_age_labels

In [None]:
toxicity_vs_attack_age = toxicity_age_labels.merge(attack_age_labels, on = 'age_group')
toxicity_vs_attack_age = pd.DataFrame({'age_group': toxicity_vs_attack_age['age_group'], 'toxicity':  toxicity_vs_attack_age['proportion_x'], 'attack': toxicity_vs_attack_age['proportion_y']})
toxicity_vs_attack_age = toxicity_vs_attack_age.set_index('age_group').stack().reset_index()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_title("Proportion of Hostile Comments by Age Group")
toxicity_vs_attack_age.columns = ['age_group', 'type', 'comments_proportion']
sns.barplot(x = 'age_group', y = 'comments_proportion', hue = 'type', palette = 'magma', data = toxicity_vs_attack_age)
plt.savefig("images/hostile_comments_by_age.png")

### Education

In [None]:
toxicity_education_labels = toxicity_label_demographics[toxicity_label_demographics['toxicity'] == 1].groupby('education')['rev_id'].count().reset_index()
toxicity_education_totals = toxicity_label_demographics.groupby('education')['rev_id'].count().reset_index()
toxicity_education_labels['proportion'] = toxicity_education_labels['rev_id']/toxicity_education_totals['rev_id']

In [None]:
toxicity_education_labels

In [None]:
attack_education_labels = attack_label_demographics[attack_label_demographics['attack'] == 1].groupby('education')['rev_id'].count().reset_index()
attack_education_totals = attack_label_demographics.groupby('education')['rev_id'].count().reset_index()
attack_education_labels['proportion'] = attack_education_labels['rev_id']/attack_education_totals['rev_id']

In [None]:
attack_education_labels

In [None]:
toxicity_vs_attack_education = toxicity_education_labels.merge(attack_education_labels, on = 'education')
toxicity_vs_attack_education = pd.DataFrame({'education': toxicity_vs_attack_education['education'], 'toxicity':  toxicity_vs_attack_education['proportion_x'], 'attack': toxicity_vs_attack_education['proportion_y']})
toxicity_vs_attack_education = toxicity_vs_attack_education.set_index('education').stack().reset_index()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_title("Proportion of Hostile Comments by Education")
toxicity_vs_attack_education.columns = ['education', 'type', 'comments_proportion']
sns.barplot(x = 'education', y = 'comments_proportion', hue = 'type', palette = 'magma', data = toxicity_vs_attack_education)
plt.savefig("images/hostile_comments_by_education.png")

### First language

In [None]:
toxicity_language_labels = toxicity_label_demographics[toxicity_label_demographics['toxicity'] == 1].groupby('english_first_language')['rev_id'].count().reset_index()
toxicity_language_totals = toxicity_label_demographics.groupby('english_first_language')['rev_id'].count().reset_index()
toxicity_language_labels['proportion'] = toxicity_language_labels['rev_id']/toxicity_language_totals['rev_id']

In [None]:
toxicity_language_labels

In [None]:
attack_language_labels = attack_label_demographics[attack_label_demographics['attack'] == 1].groupby('english_first_language')['rev_id'].count().reset_index()
attack_language_totals = attack_label_demographics.groupby('english_first_language')['rev_id'].count().reset_index()
attack_language_labels['proportion'] = attack_language_labels['rev_id']/attack_language_totals['rev_id']

In [None]:
attack_language_labels

In [None]:
toxicity_vs_attack_language = toxicity_language_labels.merge(attack_language_labels, on = 'english_first_language')
toxicity_vs_attack_language = pd.DataFrame({'english_first_language': toxicity_vs_attack_language['english_first_language'], 'toxicity':  toxicity_vs_attack_language['proportion_x'], 'attack': toxicity_vs_attack_language['proportion_y']})
toxicity_vs_attack_language = toxicity_vs_attack_language.set_index('english_first_language').stack().reset_index()

In [None]:
toxicity_vs_attack_language.columns = ['english_first_language', 'type', 'comments_proportion']
sns.barplot(x = 'english_first_language', y = 'comments_proportion', hue = 'type', palette = 'magma', data = toxicity_vs_attack_language)

### Results  

We observe that there is a similar proportions being tagged as toxic or attack among the different age groups and first language. We see that while the proportion of comments marked as toxic remains similar between different genders, there is an increase in the proportion of comments that workers of 'other' gender label as containing attacks. The possible reason for this could be that the comments annotated for attacks may contain words pertaining to gender or hostile comments for non-binary people.  
We also see that workers with no educational background labelled fewer comments as containing personal attacks.

# Further Implications

**Q1: Which, if any, of these demo applications would you expect the Perspective API—or any model trained on the Wikipedia Talk corpus—to perform well in? Why?**  
1. Comment Blur Filter: (link) Perspective API—or any model trained on the Wikipedia Talk corpus will help in the proper categorisation of the comments into toxic and non-toxic categories. Thus, I would expect them to perform well. Also since the usecase involves free platforms for conversation and no niche scope of discussion, I would expect it to do well.  
2. WikiDetox: (link) This is a very good application where Perspective API—or any model trained on the Wikipedia Talk corpus could perform well as the goal matches the intent for creation of API and data  
3. Author Perspective for Drupal (link) Here we need to filter the toxic and comments causing harassment. This is a perfect usecase for Perspective API—or any model trained on the Wikipedia Talk corpus  

**Q2:Which, if any, of these demo applications would you expect the Perspective API to perform poorly in? Why?**  
1. Toxicity Timeline: (link)It would perform poorly because time is not given to the required accuracy in Wikipedia corpus  
2. An authorship experience demo for Perspective API: (link) It Perspective API may not perform well as this is a feedback data. Comments are generally free form and do not have a theme. Also, people are more likely to comment on anything they like or not, but mostly are biased to give a feedback only when the feedback is negative.  
3. Hot Topics (link): The data covered in Wikipedia talk corpus is comment based. This demo checks the whole of the document. The Perspective API may or may not be able to perform well here  

**Q3:What are some kinds of hostile speech that would be difficult to accurately detect using the approach used to train the Perspective API models?**  
1. Any comments not in English would be difficult to accurately detect using the approach used to train the Perspective API models. This would be because the model would be trained in English, and if the model encounters any negative word, it may not understand that it is toxic/aggresive, due to change in language/font etc.
2. Sarcasm could be difficult to detect with these models
3. Any data other than comment-like data may not perform well here. For example- it would be difficult to detect a lenghty toxic news article because comments are generally the length of 1-10 sentences.