# Mini Challenge 1: Bias

## Evaluating biases in a predictor for toxic text comments

In this mini-challenge, we are going to examine the biases when evaluating toxic text comments. 

In 2018 Jigsaw sponsored a challenge on <a href = "https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/overview"> Kaggle </a>  to classify toxic comments. In a <a href ="https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data"> 2019 follow-up challenge</a> data of forum comments was labeled on toxicity levels and identity groups with the goal of minimising unintended identity bias. <br>

In this mini challenge we are going to delve into a part of the data set (shortened roughly 100 thousand of 2 million comments) and try to get an idea of how bias is represented in language.

## Imports

In [None]:
#import some libraries that are required or might prove helpful
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import matplotlib

## Reducing the original dataset

The genuine data has been prepared as follows and provided to you in the file "mini_data.csv": 

```
np.random.seed(0)

full_data = pd.read_csv("./2017/all_data.csv").dropna()
data = pd.read_csv("./2017/all_data.csv")[["comment_text","toxicity"]].dropna()

toxics = (data["toxicity"] > 0.7)
racist_data = data[toxics]
cleaner_data = data[~toxics]

subset = shuffle(pd.concat([cleaner_data.sample(n=40000), racist_data]))

subset.to_csv("./mini_data.csv")
```

## Load the data

We load the provided data set. We see that it has three columns "original_id" (int), "comment_text" (string) and "toxicity" (float in the interval [0,1]).

In [None]:
dataset = pd.read_csv("./mini_data.csv", header=0)
dataset.describe()

In [None]:
dataset.head()

## Prepare the data

We choose the column "comment_text" as our independent values and "toxicity" as our dependent value. Comments with a toxicity score of 0.7 and above will be labelled as "toxic", the others as "not toxic".

In [None]:
toxicity_cut = 0.7

comments = dataset["comment_text"]
target = (dataset["toxicity"] > toxicity_cut).astype(int)

If you want to check out some of the comments, run the following cells. Can you already get an idea how toxicity might be classified?

In [None]:
bools = (dataset["toxicity"] > toxicity_cut)
toxic_comments = dataset[bools]
toxic_toxicity = dataset["toxicity"][bools]

non_toxic_comments = dataset[~bools]
non_toxic_toxicity = dataset["toxicity"][~bools]

In [None]:
#print comments labelled as toxic
for i, (comment, tar) in enumerate(zip(toxic_comments["comment_text"].iloc[0:10], toxic_toxicity.iloc[0:10])):
        print(f"{i} (score: {tar}): {comment}\n")

In [None]:
#print comments not labelled as toxic
for i, (comment, tar) in enumerate(zip(non_toxic_comments["comment_text"].iloc[0:10], non_toxic_toxicity.iloc[0:10])):
        print(f"{i} (score: {tar}): {comment}\n")

## Train the model

For simplicity, we chose a logistic regression as this is a classification problem. This choice will show some drawbacks which are examine

In [None]:
# set the random seed to get comparable results
np.random.seed(0)

# prepare the data
comments_train, comments_test, y_train, y_test = train_test_split(comments, target, test_size = 0.3, stratify = target) 

# we use a count vectorizer to break the individual comments into tokens, one token per word
vectorizer = CountVectorizer()
vectorizer.fit(comments_train)

# we create the training a matrix 
X_train = vectorizer.transform(comments_train)
X_test = vectorizer.transform(comments_test)

# train the classifier
classifier = LogisticRegression(max_iter=2000)
classifier.fit(X_train, y_train)

## Test the model

The model classifies about 90% of the comments correctly.

In [None]:
#test the data
score = classifier.score(X_test, y_test)
print("Accuracy:", score)

A look at the confusion matrix shows that false positives and false negatives are about equal.

In [None]:
#plot the confusion matrix
cm = confusion_matrix(y_test, classifier.predict(X_test))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

<h2>Data Wrangling</h2>

Let us have a look at some falsely identified comments.

**Question 1:** What might be potential reasons for false classifications by looking at the comments. Find three notable examples. Also classify the bias type and think of ways to prevent these as well.

In [None]:
# examine false negatives
for k, (i, j) in enumerate(zip(y_test, X_test)):
    if ( i and classifier.predict(j) == 0):
        print(f"{k}: {comments_test.iloc[k]}\n")

In [None]:
# examine false positives
for k, (i, j) in enumerate(zip(y_test, X_test)):
    if (not i and classifier.predict(j) == 1):
        print(f"{k}: {comments_test.iloc[k]}\n")

## Examine the decision strategies of the model

The model operates by stripping the input of its punctuation and case sensitivity, tokenises the words separated by spaces and assigns values to them that function as a "toxicity score" for the words. Let's take a look at the top 50 words the model considers as the best identifier for toxicity.

**Question 2:** Can you find potential problems with some entries?

In [None]:
# We are abusing some model internals by extracting the coefficents directly. 
# Each coefficient represent the weight an individual occurence of the word adds 
# to the decision whether the whole comment is toxic.

coefficients = pd.DataFrame({"word": sorted(list(vectorizer.vocabulary_.keys())), "coeff": classifier.coef_[0]})

coefficients.sort_values(by=['coeff']).tail(50)

The following function predicts (based on our model) whether a string is considered toxic or not. We shall use it.

In [None]:
# Function to classify any string
def classify_string(string, investigate=False):
    prediction = classifier.predict(vectorizer.transform([string]))[0]
    if prediction == 0:
        print("NOT TOXIC:", string)
    else:
        print("TOXIC:", string)

Test the model on some strings, start with "i have a friend". What happens if the friend is white, christian, muslim or black?</br>
Does this imply that black people are talked about better than Muslims?
What about statements like "that is damn good"?</br>


Do not use interpunctation or capitalisation! 

In [None]:
# Set the value of new_comment
new_comment = "i have a friend"

# Do not change the code below
classify_string(new_comment)
coefficients[coefficients.word.isin(new_comment.split())]

In [None]:
# Set the value of new_comment
new_comment = "I have a christian friend"

# Do not change the code below
classify_string(new_comment)
coefficients[coefficients.word.isin(new_comment.split())]

In [None]:
# Set the value of new_comment
new_comment = "I have a black car"

# Do not change the code below
classify_string(new_comment)
coefficients[coefficients.word.isin(new_comment.split())]

In [None]:
# Set the value of new_comment
new_comment = "that is damn good"

# Do not change the code below
classify_string(new_comment)
coefficients[coefficients.word.isin(new_comment.split())]

Let's have a look at the coefficients for various words of interest. Feel free to add any words you deem interesting. </br>

**Question3:** Discuss some scores. Put them into relation to words with similar meaning and think about the impact it has on the model's application.

In [None]:
# Set the value of new_comment
new_comment = "black muslim faggot lesbian gay homosexual heterosexual indian native hispanic american mexican immigrant nazi military color colour cross halfmoon allah god trump hitler man woman male female"
coefficients[coefficients.word.isin(new_comment.split())].sort_values(by=['coeff'])

## Changing hyperparameters

**Question 4:** In the beginning, we used an arbitrary value of 0.7 as a threshold of determining toxic comments. What can you observe if you change the threshold to 0.6? Are the comments still clearly toxic? Have a look at the first 20 comments.

## Identifying Biases

**Question 5:** Which biases can you find in the data set and the model? And why? Explicitly, think of:
- Representation Bias
- Selection Bias
- Measurement Bias
- Exclusion Bias
- Algorithmic Bias
- Historical Bias (opinions about topics might change over time)

You can find the complete and original data set with its description <a href="https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data"> here. 

TEACHER ONLY

Examples of reasons for bias:

- Selection Bias: the data represents mostly British people with access to the internet that had an interest in the forum.
- Representation Bias: those who don't fulfill the above criteria will be underrepresented
- Measurement Bias: Spelling errors
- Exclusion Bias: Potentially some extreme comments (death threats etc) were deleted by moderators
- Algorithmic Bias: "Go sleep with the fishes" consists only of words the algorithm considers good. Can't differentiate between "anti-xxx" and "xxx".
- Historical Bias: The data is a snapshot until 2017. Biases in Society that are prevalent at that time influence the data, especially when big events occur ("migration crisis")

For the next task, be aware that the author of this exercise MIGHT have meddled with the data set in order to enhance some effects for didactic purposes.

**Question 6:** How can you tell that the data was selected with that intent? Which biases are caused by the author maliciously manipulating the dataset for this exercise?

## Algorithmic bias

The model we chose, logistic regression, is quite a simple and linear model and shows clear limitations. 

**Question 7:** How does the classification works for long sentences or multiple occurences of the same word? How do combinations of words matter? 

Tipp: examine the following code with `toxicity_cut = 0.7` (it should switch from non-toxic to toxic)

In [None]:
magic_word =  "cholestoral "
print(classify_string(magic_word * 61,))
print(classify_string(magic_word * 62,))

## Cultural Differences/Deployment

**Questions 8:** Take this list of Australian swear words and check how the model judges them. Explain the process why it struggles to identify them and which bias is in play if you deploy it in an Australian context. How do you mitigate this?

In [None]:
australian_swear_words = ["bugger off", "fair suck of the sav", "get stuffed", "dads", "derro", "fanny", "drongo", "bogan"]

In [None]:
for word in australian_swear_words:
    classify_string(word)

**Question 9:** Think of a real world use case of an algorithm like this. How should it judge concerning false positives or negatives? Which control structures (human in/out of the loop etc) should accompany it?

## Prevention/Mitigation

Consider <a href = "https://ai.meta.com/blog/measure-fairness-and-mitigate-ai-bias/">this </a> approach by Meta. 

**Question 10:** Discuss the effectivity and the implicit requirements. Consider the context of Meta's business model.

## Further Reading

**The original where this course is based on:**

- [Identifying Bias in AI](https://www.kaggle.com/code/alexisbcook/identifying-bias-in-ai)

**Some papers:**

- [A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle](https://arxiv.org/pdf/1901.10002)
- [Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification](https://storage.googleapis.com/gweb-research2023-media/pubtools/5007.pdf)
- [Measuring and Mitigating Unintended Bias in Text Classification](https://storage.googleapis.com/gweb-research2023-media/pubtools/4578.pdf)

**Tutorial about bias mitigation:**

- [How To: Preprocessing for GloVe Part1: EDA](https://www.kaggle.com/code/christofhenkel/how-to-preprocessing-for-glove-part1-eda)