# Exercise 8: Hate Speech Detection with BERT

In this exercise, you will finetune a BERT model to do hate speech detection on tweets. You will also modify the training dataset to make training more efficient.

You should complete the parts of the exercise that are marked as **TODO**.
A correctly completed **TODO** gives 2 bonus points. Partially correct answers give 1 bonus point.
Some **TODO**s are inside a comment in a code block: Here, you should complete the line of code.
Other **TODO**s are inside a text block: Here, you should write a few sentences to answer the question.

**Important:** Some students were under the impression that you have to complete a TODO in a _single_ line of code. That is not the case, you can use as many lines as you need.

**Submission deadline:** 27.01.2021, 23:59 Central European Time

**Instructions for submission:** After completing the exercise, save a copy of the notebook as exercise8_twitterhate_MATRIKELNUMMER.ipynb, where MATRIKELNUMMER is your student ID number. Then upload the notebook to moodle (submission exercise sheet 8).

In order to understand the code, it can be helpful to experiment a bit during development, e.g., to print tensors or their shapes. But please remove these changes before submitting the notebook. If we cannot run your notebook, or if a print statement is congesting stdout too much, then we cannot grade it. 

To make the most of this exercise, you should try to read and understand the entire code, not just the parts that contain a **TODO**. If you have questions, write them down for the exercise, which will happen in the week after the submission deadline.

**CUDA:** You can use a GPU for this exercise (on colab: Runtime -> Change Runtime Type -> GPU). This is not mandatory, but it will speed up training epochs, thereby allowing you to test more hyperparameters.

# Required libraries
When working with 🤗 transformers, or any fast-changing software library, you should be extra careful to fix the library versions when you begin your project, and not change versions while you're developing.

In [1]:
!pip3 install transformers==4.2.0
!pip3 install datasets==1.2.0
!pip3 install tensorflow



In [2]:
from transformers import BertForSequenceClassification, DistilBertTokenizerFast, Trainer, TrainingArguments, EarlyStoppingCallback
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [3]:
NUM_EPOCHS = 2
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 64
WARMUP_STEPS = 50
WEIGHT_DECAY = 0.01
LOGGING_STEPS = 50
LEARNING_RATE = 5e-05

# Data
In this exercise, we will finetune a BERT model to perform hate speech detection on data from twitter. Hate speech detection is the task of classifying sentences, or in this case, tweets, as hate speech or not hate speech, for example so that we can automatically report it or filter it out. The dataset we're using is from the 🤗/datasets library, so we can load it very easily: https://huggingface.co/datasets/tweets_hate_speech_detection 
As the dataset currently only contains a training portion, we are going to use the Slicing API (https://huggingface.co/docs/datasets/splits.html) to divide it into a training and a development set.


In [4]:
train_dataset = load_dataset('tweets_hate_speech_detection', split='train[:80%]')
dev_dataset = load_dataset('tweets_hate_speech_detection', split='train[80%:]')

Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (/home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0)
Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (/home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0)


Now let's look at some examples of tweets containing hate speech.

In [5]:
label_mask = np.array(train_dataset['label']) == 1
hate_speech_examples = train_dataset[label_mask]["tweet"]

print('\n'.join(hate_speech_examples[15:20]))

you might be a libtard if... #libtard  #sjw #liberal #politics 
@user take out the #trash america...  - i voted against #hate - i voted against  - i voted against  - i votâ¦ 
if you hold open a door for a woman because she's a woman and not because it's a nice thing to do, that's . don't even try to deny it
@user this man ran for governor of ny, the state with the biggest african-american population    #â¦
#stereotyping #prejudice  offer no #hope or solutions but create the same old repetitive #hate #conflictâ¦ 


# Tokenization
Now that we have the datasets loaded, we need to tokenize them. This is very easy with 🤗 transformers, but to make our model faster we are first going to find out the smallest sequence length that we can comfortably work with. Tweets are very short, so we should be able to choose a sequence length that is a lot shorter than the standard 512 that most BERT models run with. We are going to tokenize the whole dataset with a very generous sequence length, choose our new sequence length so that at least 95% of all tweets are within this length, and then tokenize again while truncating those that are longer.

In [6]:
def run_tokenizer(train_dataset, dev_dataset):
    tokenizer = DistilBertTokenizerFast.from_pretrained('bert-base-uncased') #actually the same as BertTokenizerFast
    def get_sequence_len(tokenizer, train_dataset, dev_dataset):

        def tokenize_for_lengths(batch):
            return tokenizer(batch['tweet'], padding=False, truncation=True, max_length=128, return_length = True)

        train_dataset_for_lengths = train_dataset.map(tokenize_for_lengths, batched=True, batch_size=len(train_dataset))

        tweet_lengths = np.array(train_dataset_for_lengths[:]['length'])
        chosen_sequence_len = int(np.percentile(np.sort(tweet_lengths),95)+1)
        
        return chosen_sequence_len

    chosen_sequence_len = get_sequence_len(tokenizer, train_dataset, dev_dataset)

    def tokenize(batch, sequence_len):
        return tokenizer(batch['tweet'], padding="max_length", truncation=True, max_length=sequence_len)

    train_dataset = train_dataset.map(tokenize, fn_kwargs={'sequence_len': chosen_sequence_len}, batched=True, batch_size=len(train_dataset))
    dev_dataset = dev_dataset.map(tokenize, fn_kwargs={'sequence_len': chosen_sequence_len}, batched=True, batch_size=len(dev_dataset))

    return (train_dataset, dev_dataset)

train_dataset, dev_dataset = run_tokenizer(train_dataset, dev_dataset)

Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0/cache-9a9f5036699670ae.arrow


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [7]:
def set_format(train_dataset, dev_dataset):
    train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    dev_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

    return (train_dataset, dev_dataset)

train_dataset, dev_dataset = set_format(train_dataset, dev_dataset)

# Model Definition
To make training as fast as possible, we are going to load the Distilbert model. To do finetuning, we are going to load our BERT model, add classification heads on top of it and then train with our dataset and specific task. In this case, we're doing binary sequence classification: we're classifying sequences, tweets, as either hate speech (label 1) or not hate speech (label 0). Luckily for us, in 🤗 transformers, we only need to instantiate a BertForSequenceClassification model from a pretrained generic BERT model and specify how many labels we want for the classification. We will get a warning that some of the weights (those of the classification heads) have not been trained yet, but that's fine.

In [8]:
def define_model():
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2) 
    # according to the docs: kwargs can be used to update the configuration object
    # I think this is the easier implementation, than creating a new Config Object
    return model

model = define_model()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Here, we are going to set up two things. The first is a function that we can then pass to the Trainer class to tell it what kinds of metrics we want to compute on our development set, and the other is a Early Stopping Callback so that just like in the last exercise sheet, we can stop training if the development performance isn't increasing.

In [9]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

early_stopping_callback = EarlyStoppingCallback(early_stopping_patience= 2, early_stopping_threshold = 0.0)

Now, we are going to define some training arguments that we are going to pass to the Trainer class which will handle the training for us. Very important here is that we have set the metric for best model to F1-measure and load_best_model_at_end to True, so that F1-measure is used for early stopping and we load the best model at the end, not the one for which F1-measure has already decreased.

In [10]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    warmup_steps=WARMUP_STEPS,
    weight_decay=WEIGHT_DECAY,
    logging_dir='./logs/',
    evaluation_strategy="steps",
    logging_steps=LOGGING_STEPS,
    learning_rate=LEARNING_RATE,
    metric_for_best_model="f1",
    load_best_model_at_end=True
)

Here we instantiate the Trainer class using our model, the training args, and so on

In [11]:
def define_trainer(model, training_args, train_dataset, dev_dataset, compute_metrics, early_stopping_callback):
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        compute_metrics=compute_metrics,
        callbacks = [early_stopping_callback]
    )
    
    return trainer

trainer = define_trainer(model, training_args, train_dataset, dev_dataset, compute_metrics, early_stopping_callback)

# Training

In [12]:
def train(trainer):
    trainer.train()
    return trainer

trainer = train(trainer)

  return torch.tensor(x, **format_kwargs)


Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Runtime,Samples Per Second
50,0.3992,0.249589,0.930695,0.0,0.0,0.0,9.6775,660.503
100,0.2133,0.168138,0.945401,0.41541,0.805195,0.27991,10.0429,636.468
150,0.1942,0.290523,0.839643,0.421884,0.281203,0.844244,9.8304,650.227
200,0.1997,0.15848,0.955882,0.59366,0.820717,0.465011,9.8874,646.477
250,0.1784,0.161324,0.946965,0.643533,0.602362,0.690745,10.346,617.825
300,0.1621,0.155532,0.956195,0.621622,0.774411,0.519187,10.3787,615.875
350,0.1825,0.146445,0.954318,0.552147,0.861244,0.406321,10.2597,623.019


  _warn_prf(average, modifier, msg_start, len(result))


In [13]:
def evaluate(trainer):
    trainer.evaluate()

evaluate(trainer)

# What happened?

It looks like our accuracy is 93%, but our Precision, Recall and F1-measure are all 0. What happened?

Let's take a look at our dataset again. How are the classes distributed?


In [14]:
hate_speech_dataset = load_dataset('tweets_hate_speech_detection', split='train')
print(np.mean(np.array(hate_speech_dataset['label'])))

Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (/home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0)


0.07014579813528565


It looks like only about 7% of the training dataset is actually hate speech. This extreme imbalance means that the model takes the path of least resistance for the loss, which is to predict "not hate speech" all the time. 

Let's do something about that!

# Rebalancing the dataset

To help with this, we are simply going to rebalance the dataset so that it contains all the examples for hate speech, but only as many negative examples, so that the dataset is balanced 50-50.

In [16]:
def rebalance_dataset(dataset):
    num_label_1 = len(dataset.filter(lambda obj: obj['label']==1)) #TODO: find out how often the label 1 appears in hate_speech_dataset's label column
    num_label_0 = len(dataset.filter(lambda obj: obj['label']==0)) #TODO: find out how often the label 0 appears in hate_speech_dataset's label column
    sorted_dataset = dataset.sort('label')
    balanced_dataset = sorted_dataset.select(list(range(0,num_label_1)) + list(range(num_label_0, num_label_0 + num_label_1)))
    return balanced_dataset.shuffle(seed=42)

balanced_dataset = rebalance_dataset(hate_speech_dataset)
l = len(balanced_dataset)
eighty_percent = int(l*.8)
balanced_train_dataset = balanced_dataset.select(list(range(0,eighty_percent))) #using the dataset.select method, select the first 80% of the balanced dataset
balanced_dev_dataset = balanced_dataset.select(list(range(eighty_percent,l))) #using the dataset.select method, select the last 20% of the balanced dataset


Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0/cache-c00e6f14a823ac30.arrow
Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0/cache-49ce4ac3ab2f8d3c.arrow
Loading cached sorted indices for dataset at /home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0/cache-b327a09a8b54561b.arrow
Loading cached shuffled indices for dataset at /home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0/cache-57d433a9ee4a78c9.arrow


Now that we have balanced the dataset, let's run everything again. 

In [17]:
balanced_train_dataset, balanced_dev_dataset = run_tokenizer(balanced_train_dataset, balanced_dev_dataset)
balanced_train_dataset, balanced_dev_dataset = set_format(balanced_train_dataset, balanced_dev_dataset)
balanced_model = define_model()
balanced_trainer = define_trainer(balanced_model, training_args, balanced_train_dataset, balanced_dev_dataset, compute_metrics, early_stopping_callback)
balanced_trainer = train(balanced_trainer)
evaluate(balanced_trainer)

Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0/cache-4233158099525b21.arrow
Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0/cache-ffed11ce5742e632.arrow
Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0/cache-3b5786ba96ed5a4c.arrow
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight',

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Runtime,Samples Per Second
50,0.5844,0.367716,0.851728,0.857449,0.821355,0.896861,1.4346,625.278
100,0.374,0.396487,0.84058,0.824108,0.912807,0.751121,1.477,607.319
150,0.2962,0.316887,0.867336,0.861144,0.89781,0.827354,1.4962,599.52
200,0.2992,0.306965,0.885173,0.883616,0.890661,0.876682,1.5127,592.974
250,0.2529,0.444368,0.861761,0.84878,0.930481,0.780269,1.5031,596.774
300,0.1981,0.366218,0.877369,0.884211,0.833333,0.941704,1.4671,611.391
350,0.1512,0.382991,0.886288,0.890323,0.855372,0.928251,1.4728,609.064
400,0.1724,0.354292,0.894091,0.889919,0.920863,0.860987,1.4946,600.171
450,0.1262,0.323049,0.904125,0.903153,0.90724,0.899103,1.4347,625.221


That's better! **TODO**: Write a few sentences about how much F1-measure has improved, and why.

So for me the F1 Measure isn't 0 before the balancing. But at the same time of course the F1 Measure is really bad. See below:

Accuracy------F1-------Precision---Recall

0.946965 	0.643533 	0.602362 	0.690745

But the Training never finished because of zero-division. 

After balancing the Datasets I get better results: 

Accuracy------F1-------Precision---Recall

0.904125 	0.903153 	0.907240 	0.899103

And even though the accuracy went down a bit, the Precision and Recall increased by a lot, because the model stopped labeling everything as no-hate-speech. As a result, F1 obviously increased because of the increased Precision and Recall. (see formula for F1) 