# Unintended Bias in Toxicity Classification

*Final Project for LING-583, Spring 2019*

*Bryan D. Hayes*

Given the incomprehensible volume of traffic on internet discussion forums and the propensity for forum users to behave uncivilly towards one another, there is a demand for automated detectors of toxic behavior. While simple classifier prove quite effective at determining when toxic behavior is taking place, these classifer can become biased against certain identity groups because the names of these groups tend to be invoked in toxic comments. The word "gay", for example, can be used as a description of identity in a toxic comment and as an insult in a non-toxic comment. As a result, the word "gay" may be an informative feature for a classifier in deciding that a comment is toxic, flagging non-toxic comments containing the word as toxic in the process.

Our goal is to build a simple classifier, measure its bias, and then attempt to reduce the bias against certain identity groups by separating toxic uses of an identity label from non-toxic uses.

In [1]:
import pandas as pd
from cytoolz import *

The data used for this task is taken from the Jigsaw Unintended Bias in Toxicity Classification competition.

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification

The provided dataset contains 1.8 million comments; to ease computation, we will sample just 100,000 of them for training and 10,000 for testing.

In [2]:
data = pd.read_csv("train.csv").sample(110000, random_state = 583)

The dataset provides the fraction of annotators who believed each comment is toxic. To simplify our task, we will mark all comments with a target score of 0.5 or greater as toxic and all others as nontoxic.

In [3]:
data['toxic'] = [(score >= 0.5) for score in data['target']]

## Text Parsing

We first use spaCy to process the raw comment text.

In [3]:
import spacy
from spacy import displacy
from spacy.tokens import Token

nlp = spacy.load('en_core_web_sm', disable = ['ner'])

In [5]:
%%time

data['doc'] = list(nlp.pipe(data['comment_text']))

Wall time: 17min 32s


## Split Training Data

Now we separate our training data from our testing data.

In [4]:
from sklearn.model_selection import train_test_split

In [94]:
train, test = train_test_split(data,
                               train_size = 100000, test_size = 10000,
                               stratify = data['toxic'],
                               random_state = 583)

## Baseline Classifier

In [8]:
train.groupby('toxic').size()

toxic
False    91988
True      8012
dtype: int64

We can see that about 92% of comments are nontoxic, so we expect a dummy classifier that predicts all comments to be nontoxic will perform quite well.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import *
from sklearn.model_selection import *
from sklearn.metrics import *
from sklearn.dummy import *

In [23]:
baseline = make_pipeline(CountVectorizer(analyzer = identity), DummyClassifier('most_frequent'))
baseline.fit(train['comment_text'], train['toxic'])

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer=<cyfunction identity at 0x0000026CEC250C80>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), p...), ('dummyclassifier', DummyClassifier(constant=None, random_state=None, strategy='most_frequent'))])

In [24]:
baseline.score(test['comment_text'], test['toxic']) * 100.

91.99000000000001

## Baseline Logistic Regression

Now we will build a simple logistic regression classifier that will classify comments based on their tokenized representations.

In [7]:
def tokens(doc):
    return [tok.lower_ for tok in doc]

train['tokens'] = train['doc'].apply(tokens)
test['tokens'] = test['doc'].apply(tokens)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [8]:
baseline_lr = make_pipeline(CountVectorizer(analyzer = identity),
                            LogisticRegression(solver ='liblinear', max_iter = 500))
baseline_lr.fit(train['tokens'], train['toxic'])

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer=<cyfunction identity at 0x000001AE1CA74608>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), p...ty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False))])

In [13]:
baseline_lr.score(test['tokens'], test['toxic']) * 100.

94.01

The logistic regression classifier improves on the dummy classifier by about 2%.

## Hyperparameter Tuning

We can further improve our classifier's score by selecting optimal hyperparameters.

In [39]:
%%time
params = {'logisticregression__C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
         'countvectorizer__min_df':[1, 2, 5],
         'countvectorizer__max_df':[0.5, 0.75, 0.9]}
grid = GridSearchCV(baseline_lr, params, n_jobs=-1, cv=3)
grid.fit(train['tokens'], train['toxic'])



Wall time: 14min 14s


In [59]:
grid.fit(train['tokens'], train['toxic'])



GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer=<cyfunction identity at 0x0000026CEC250C80>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), p...ty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'logisticregression__C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'countvectorizer__min_df': [1, 2, 5], 'countvectorizer__max_df': [0.5, 0.75, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [60]:
grid.best_params_

{'countvectorizer__max_df': 0.5,
 'countvectorizer__min_df': 1,
 'logisticregression__C': 1.0}

In [11]:
baseline_lr.set_params(**grid.best_params_)
baseline_lr.fit(train['tokens'], train['toxic'])
baseline_lr.score(test['tokens'], test['toxic']) * 100.

94.1

## Bias Evaluation

We have a classifier that is rather successful at predicting the toxicity of a comment, but is it biased? To measure this, we will look at three ROC-AUC metrics:

- Subgroup: the performance of the classifier only on comments mentioning an identity subgroup.
- Background-positive, subgroup-negative: the performance of the classifier on non-toxic comments mentioning the identity and toxic comments that don't mention the identity.
- Background-negative, subgroup-positive: the performance of the classifier on toxic comments mentioning the identity and non-toxic comments that don't mention the identity.

The identity subgroups considered are those that are present in more than 500 comments in the original dataset.

In [12]:
identities = ["male", "female", "homosexual_gay_or_lesbian", "christian", "jewish", "muslim", "black", "white", "psychiatric_or_mental_illness"]

In [101]:
def subgroup_auc(group_label, model, col_label = 'tokens'):
    subgroup_test_set = test[test[group_label] > 0]
    score = roc_auc_score(subgroup_test_set['toxic'], model.predict_proba(subgroup_test_set[col_label])[:,1])
    print("Subgroup | " + group_label + ": " + str(score))
    return score
    
def bpsn_auc(group_label, model, col_label = 'tokens'):
    subgroup_test_set = test[((test[group_label] > 0) & (test['toxic'] == False)) | 
                             ((test[group_label] == 0) & (test['toxic'] == True))]
    score = roc_auc_score(subgroup_test_set['toxic'], model.predict_proba(subgroup_test_set[col_label])[:,1])
    print("BPSN | " + group_label + ": " + str(score))
    return score
    
def bnsp_auc(group_label, model, col_label = 'tokens'):
    subgroup_test_set = test[((test[group_label] > 0) & (test['toxic'] == True)) | 
                             ((test[group_label] == 0) & (test['toxic'] == False))]
    score = roc_auc_score(subgroup_test_set['toxic'], model.predict_proba(subgroup_test_set[col_label])[:,1])
    print("BNSP | " + group_label + ": " + str(score))
    return score

To compare bias across subgroups, we use a power mean function to more heavily penalize the model for its poorest-performing subgroup classification.

In [102]:
def mean_auc(group_labels, model, metric_func, col_label = 'tokens', p = -5):
    subgroup_scores = [(metric_func(label, model, col_label) ** p) for label in group_labels]
    mean_auc = (np.mean(subgroup_scores)) ** (1 / p)
    return mean_auc

To get an overall sense of the model's bias, we average the overall ROC-AUC score as well as the three submetric scores defind above.

In [103]:
def overall_score(group_labels, model, metric_funcs, weights, col_label = 'tokens', p = -5):
    score = weights[0] * roc_auc_score(test['toxic'], model.predict_proba(test[col_label])[:,1])
    for i in range(0, len(metric_funcs)):
        score += (weights[i + 1] * mean_auc(group_labels, model, metric_funcs[i], col_label, p))
    return score

We can now measure the bias in our classifier:

In [63]:
overall_score(identities, baseline_lr, [subgroup_auc, bpsn_auc, bnsp_auc], [0.25, 0.25, 0.25, 0.25])

Subgroup | male: 0.8213306215524945
Subgroup | female: 0.8435256151674062
Subgroup | homosexual_gay_or_lesbian: 0.7009803921568627
Subgroup | christian: 0.8856655290102389
Subgroup | jewish: 0.860655737704918
Subgroup | muslim: 0.8048279404211608
Subgroup | black: 0.7920751633986928
Subgroup | white: 0.7904761904761904
Subgroup | psychiatric_or_mental_illness: 0.8424479166666666
BPSN | male: 0.8374824460615345
BPSN | female: 0.8291709767991725
BPSN | homosexual_gay_or_lesbian: 0.8013840830449827
BPSN | christian: 0.8764232081911264
BPSN | jewish: 0.8486665035478346
BPSN | muslim: 0.7944479319243916
BPSN | black: 0.7986685032139578
BPSN | white: 0.7821428571428573
BPSN | psychiatric_or_mental_illness: 0.799599358974359
BNSP | male: 0.8326326413546227
BNSP | female: 0.8545525784884205
BNSP | homosexual_gay_or_lesbian: 0.7501765536723163
BNSP | christian: 0.8383829039271011
BNSP | jewish: 0.8562005277044855
BNSP | muslim: 0.8619052329607281
BNSP | black: 0.8258398900961659
BNSP | white: 0

0.8327445661725307

We can easily see that bias is not evenly distributed. The model is particularly biased against the homosexual_gay_or_lesbian subgroup; we see that the BNSP score for this subgroup is very low, indicating that the classifier tends to associate comments about this subgroup with toxicity.

The below comment, for example, was rated toxic by only 1 of 5 annotators, but was determined by the model to be toxic with 76% confidence.

In [143]:
test.loc[107433]['comment_text']

"Sounds like typical liberal, resorting to name calling, if anyone is a bigot its you against Christians,  I'm not a Christian or a bigot, I do have several gay friends that are really embarrassed about gay pride parade and all of the fuss,  they don't want it and have a good life,  its the liberal agenda pushing it,  if you think anyone will change their minds by the government saying it will be so,  think again."

## Reducing Bias

One way we might reduce bias in the model is by distinguishing references to an identity group used in a toxic context from those used in a non-toxic context. We could accomplish this in part by tagging any words used in conjunction with a recognized profanity. This would allow the classifier to separate normal uses of a term from those used in a clearly insulting context.

We will use Google's list of profanities, obtained from https://github.com/RobertJGabriel/Google-profanity-words.

In [76]:
with open("bad_words.txt", "r") as f:
    bad_words = []
    for line in f:
        bad_words.append(line.strip("\n")) 

In [78]:
bad_doc = list(nlp.pipe(bad_words))
bad_tokens = [word[0] for word in bad_doc]
bad_strings = [tok.text for tok in bad_tokens]

In [36]:
Token.set_extension('profane', default=False, force=True)

Specifically, we will look for words that are modified by a recognized profanity.

In [95]:
for doc in data['doc']:
    for tok in doc:
        if tok.text in bad_strings and tok.dep_ == 'amod':
            tok.head._.profane = True

We can now modify tokens used in a profane context, adding in additional dependency relationships along the way.

In [96]:
def profane(tok):
    return 'PROFANE:' + tok.lower_ if tok._.profane else tok.lower_

def everything(doc):
    return [profane(w.head) + '_' + profane(w) for w in doc if w.head != w ] + \
           [profane(w) for w in doc]

In [97]:
train['everything'] = train['doc'].apply(everything)
test['everything'] = test['doc'].apply(everything)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [99]:
baseline_lr.fit(train['everything'], train['toxic'])
baseline_lr.score(test['everything'], test['toxic']) * 100.

94.13

We see that this method of tagging the data provides a small increase in overall performance. How was bias affected?

In [104]:
overall_score(identities, baseline_lr, [subgroup_auc, bpsn_auc, bnsp_auc], [0.25, 0.25, 0.25, 0.25], col_label = 'everything')

Subgroup | male: 0.8344581060676096
Subgroup | female: 0.8609116579265834
Subgroup | homosexual_gay_or_lesbian: 0.7331932773109244
Subgroup | christian: 0.9176949330532949
Subgroup | jewish: 0.8319672131147541
Subgroup | muslim: 0.8061119671289163
Subgroup | black: 0.809640522875817
Subgroup | white: 0.8184523809523809
Subgroup | psychiatric_or_mental_illness: 0.8841145833333334
BPSN | male: 0.8392697561598366
BPSN | female: 0.8373282104329836
BPSN | homosexual_gay_or_lesbian: 0.8241637831603229
BPSN | christian: 0.8921638225255972
BPSN | jewish: 0.8494005382921458
BPSN | muslim: 0.8109437120736556
BPSN | black: 0.8068755739210286
BPSN | white: 0.7704573934837092
BPSN | psychiatric_or_mental_illness: 0.8166666666666667
BNSP | male: 0.8551346562978777
BNSP | female: 0.8751604781833037
BNSP | homosexual_gay_or_lesbian: 0.7750453995157385
BNSP | christian: 0.8701142513529765
BNSP | jewish: 0.8533641160949867
BNSP | muslim: 0.8574537540805223
BNSP | black: 0.8470557012613962
BNSP | white: 

0.8465474409689075

This one transformation was able to increase our overall bias score by almost 2%. Clearly, a substantial amount of bias remains in the model, and more sophisticated techniques for recognizing toxicity would be necessary to eliminate it. However, we hopefully have shown that it is possible to reduce model bias, so unbiased automated toxicity detection may someday be a reality.