<a href="https://colab.research.google.com/github/danielsaggau/deep-learning-for-nlp/blob/main/exercise8_twitterhate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 8: Hate Speech Detection with BERT

In this exercise, you will finetune a BERT model to do hate speech detection on tweets. You will also modify the training dataset to make training more efficient.

You should complete the parts of the exercise that are marked as **TODO**.
A correctly completed **TODO** gives 2 bonus points. Partially correct answers give 1 bonus point.
Some **TODO**s are inside a comment in a code block: Here, you should complete the line of code.
Other **TODO**s are inside a text block: Here, you should write a few sentences to answer the question.

**Important:** Some students were under the impression that you have to complete a TODO in a _single_ line of code. That is not the case, you can use as many lines as you need.

**Submission deadline:** 27.01.2021, 23:59 Central European Time

**Instructions for submission:** After completing the exercise, save a copy of the notebook as exercise8_twitterhate_MATRIKELNUMMER.ipynb, where MATRIKELNUMMER is your student ID number. Then upload the notebook to moodle (submission exercise sheet 8).

In order to understand the code, it can be helpful to experiment a bit during development, e.g., to print tensors or their shapes. But please remove these changes before submitting the notebook. If we cannot run your notebook, or if a print statement is congesting stdout too much, then we cannot grade it. 

To make the most of this exercise, you should try to read and understand the entire code, not just the parts that contain a **TODO**. If you have questions, write them down for the exercise, which will happen in the week after the submission deadline.

**CUDA:** You can use a GPU for this exercise (on colab: Runtime -> Change Runtime Type -> GPU). This is not mandatory, but it will speed up training epochs, thereby allowing you to test more hyperparameters.

# Required libraries
When working with 🤗 transformers, or any fast-changing software library, you should be extra careful to fix the library versions when you begin your project, and not change versions while you're developing.

In [None]:
!pip install transformers==4.2.0
!pip install datasets==1.2.0
!pip install tensorflowpip install tensorflow-gpu

Collecting transformers==4.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/84/ea/634945faff8ad6984b98f7f3d98f6d83083a18af44e349744d90bde81f80/transformers-4.2.0-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 5.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 34.6MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 37.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=0bf763c70f6e

In [None]:
from transformers import BertForSequenceClassification, DistilBertTokenizerFast, Trainer, TrainingArguments, EarlyStoppingCallback
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [None]:
NUM_EPOCHS = 2
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 64
WARMUP_STEPS = 50
WEIGHT_DECAY = 0.01
LOGGING_STEPS = 50
LEARNING_RATE = 5e-05

# Data
In this exercise, we will finetune a BERT model to perform hate speech detection on data from twitter. Hate speech detection is the task of classifying sentences, or in this case, tweets, as hate speech or not hate speech, for example so that we can automatically report it or filter it out. The dataset we're using is from the 🤗/datasets library, so we can load it very easily: https://huggingface.co/datasets/tweets_hate_speech_detection 
As the dataset currently only contains a training portion, we are going to use the Slicing API (https://huggingface.co/docs/datasets/splits.html) to divide it into a training and a development set.


In [None]:
train_dataset = load_dataset('tweets_hate_speech_detection', split='train[:80%]') #create the train dataset, using the first 80% of the dataset
dev_dataset = load_dataset('tweets_hate_speech_detection', split ='train[-20%:]')  #create the dev dataset, using the last 20% of the dataset

Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (/root/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0)
Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (/root/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0)


Now let's look at some examples of tweets containing hate speech.

In [None]:
hate_speech_examples = (np.array(train_dataset['label']) == 1)
train_dataset[hate_speech_examples]['tweet'][15:20]
# TODO: create a list with all tweets that are marked as hate speech

['you might be a libtard if... #libtard  #sjw #liberal #politics ',
 '@user take out the #trash america...  - i voted against #hate - i voted against  - i voted against  - i votâ\x80¦ ',
 "if you hold open a door for a woman because she's a woman and not because it's a nice thing to do, that's . don't even try to deny it",
 '@user this man ran for governor of ny, the state with the biggest african-american population    #â\x80¦',
 '#stereotyping #prejudice  offer no #hope or solutions but create the same old repetitive #hate #conflictâ\x80¦ ']

# Tokenization
Now that we have the datasets loaded, we need to tokenize them. This is very easy with 🤗 transformers, but to make our model faster we are first going to find out the smallest sequence length that we can comfortably work with. Tweets are very short, so we should be able to choose a sequence length that is a lot shorter than the standard 512 that most BERT models run with. We are going to tokenize the whole dataset with a very generous sequence length, choose our new sequence length so that at least 95% of all tweets are within this length, and then tokenize again while truncating those that are longer.

In [None]:
def run_tokenizer(train_dataset, dev_dataset):
    tokenizer = DistilBertTokenizerFast.from_pretrained('bert-base-uncased') #actually the same as BertTokenizerFast

    def get_sequence_len(tokenizer, train_dataset, dev_dataset):

        def tokenize_for_lengths(batch):
            return tokenizer(batch['tweet'], padding=False, truncation=True, max_length=128, return_length = True)

        train_dataset_for_lengths = train_dataset.map(tokenize_for_lengths, batched=True, batch_size=len(train_dataset))

        tweet_lengths = np.array(train_dataset_for_lengths[:]['length'])
        percentile = int(np.percentile(tweet_lengths, 95))
        chosen_sequence_len = percentile +(1)
        #TODO: find out what sequence length is at the 95th percentile of the tweet_lengths list and add 1 so we're on the safe side
        
        return chosen_sequence_len

    chosen_sequence_len = get_sequence_len(tokenizer, train_dataset, dev_dataset)

    def tokenize(batch, sequence_len):
        return tokenizer(batch['tweet'], padding="max_length", truncation=True, max_length=sequence_len)

    train_dataset = train_dataset.map(tokenize, fn_kwargs={'sequence_len': chosen_sequence_len}, batched=True, batch_size=len(train_dataset))
    dev_dataset = dev_dataset.map(tokenize, fn_kwargs={'sequence_len': chosen_sequence_len}, batched=True, batch_size=len(dev_dataset))

    return (train_dataset, dev_dataset)

train_dataset, dev_dataset = run_tokenizer(train_dataset, dev_dataset)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
def set_format(train_dataset, dev_dataset):
    train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    dev_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

    return (train_dataset, dev_dataset)

train_dataset, dev_dataset = set_format(train_dataset, dev_dataset)

# Model Definition
To make training as fast as possible, we are going to load the Distilbert model. To do finetuning, we are going to load our BERT model, add classification heads on top of it and then train with our dataset and specific task. In this case, we're doing binary sequence classification: we're classifying sequences, tweets, as either hate speech (label 1) or not hate speech (label 0). Luckily for us, in 🤗 transformers, we only need to instantiate a BertForSequenceClassification model from a pretrained generic BERT model and specify how many labels we want for the classification. We will get a warning that some of the weights (those of the classification heads) have not been trained yet, but that's fine.

In [None]:
def define_model():
    model = BertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels =2) 
    #TODO: instantiate a BertForSequenceClassification model from distilbert-base-uncased for classification with two possible labels
    return model
model = define_model()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertForSequenceClassification: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.0.

Here, we are going to set up two things. The first is a function that we can then pass to the Trainer class to tell it what kinds of metrics we want to compute on our development set, and the other is a Early Stopping Callback so that just like in the last exercise sheet, we can stop training if the development performance isn't increasing.

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

early_stopping_callback = EarlyStoppingCallback(early_stopping_patience= 2, early_stopping_threshold = 0.0)

Now, we are going to define some training arguments that we are going to pass to the Trainer class which will handle the training for us. Very important here is that we have set the metric for best model to F1-measure and load_best_model_at_end to True, so that F1-measure is used for early stopping and we load the best model at the end, not the one for which F1-measure has already decreased.

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    warmup_steps=WARMUP_STEPS,
    weight_decay=WEIGHT_DECAY,
    logging_dir='./logs/',
    evaluation_strategy="steps",
    logging_steps=LOGGING_STEPS,
    learning_rate=LEARNING_RATE,
    metric_for_best_model="f1",
    load_best_model_at_end=True
)

Here we instantiate the Trainer class using our model, the training args, and so on

In [None]:
def define_trainer(model, training_args, train_dataset, dev_dataset, compute_metrics, early_stopping_callback):
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        compute_metrics=compute_metrics,
        callbacks = [early_stopping_callback]
    )
    
    return trainer

trainer = define_trainer(model, training_args, train_dataset, dev_dataset, compute_metrics, early_stopping_callback)

# Training

In [None]:
def train(trainer):
    trainer.train() #TODO: tell the Trainer object to train 
    return trainer

trainer = train(trainer)

  return torch.tensor(x, **format_kwargs)


Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Runtime,Samples Per Second
50,0.3257,0.29459,0.930695,0.0,0.0,0.0,14.1579,451.48
100,0.2471,0.252357,0.930695,0.0,0.0,0.0,14.7178,434.303
150,0.266,0.266822,0.930695,0.0,0.0,0.0,14.9695,427.001


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
def evaluate(trainer):
    trainer.evaluate()

evaluate(trainer)

  _warn_prf(average, modifier, msg_start, len(result))


# What happened?

It looks like our accuracy is 93%, but our Precision, Recall and F1-measure are all 0. What happened?

Let's take a look at our dataset again. How are the classes distributed?


In [None]:
# We got the following error message: 
# Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. 
# Use `zero_division` parameter to control this behavior.
# Essentially: Those scores are ill defined because we rarely have any hate speech and our classification algorithm labeled everything as non-hatespeech. 
# We got 93 percent because we predicted everything as not hate speech and the 7 percent error are the hate speech. 

In [None]:
hate_speech_dataset = load_dataset('tweets_hate_speech_detection', split='train')
print(np.mean(np.array(hate_speech_dataset['label'])))

Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (/root/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0)


0.07014579813528565
2242
29720


It looks like only about 7% of the training dataset is actually hate speech. This extreme imbalance means that the model takes the path of least resistance for the loss, which is to predict "not hate speech" all the time. 

Let's do something about that!

# Rebalancing the dataset

To help with this, we are simply going to rebalance the dataset so that it contains all the examples for hate speech, but only as many negative examples, so that the dataset is balanced 50-50.

In [None]:
print(sum(np.array(hate_speech_dataset['label'])==1))
print(sum(np.array(hate_speech_dataset['label'])==0))

2242
29720


In [None]:
def rebalance_dataset(dataset):
    num_label_1 = sum(np.array(dataset['label'])==1) #TODO: find out how often the label 1 appears in hate_speech_dataset's label column
    num_label_0 = sum(np.array(dataset['label'])==0) #TODO: find out how often the label 0 appears in hate_speech_dataset's label column
    sorted_dataset = dataset.sort('label')
    balanced_dataset = sorted_dataset.select(list(range(0, num_label_1)) + list(range(num_label_0, num_label_0 + num_label_1)))
    return balanced_dataset.shuffle(seed=42)
balanced_dataset = rebalance_dataset(hate_speech_dataset)

balanced_train_dataset = hate_speech_dataset.select(range([0, balanced_dataset[':80%']])) #using the dataset.select method, select the first 80% of the balanced dataset
balanced_dev_dataset = hate_speech_dataset.select(range([balanced_train_dataset, balanced_dataset['-20%:']])) #using the dataset.select method, select the last 20% of the balanced dataset

# one line solution:
# balanced_dataset.train_test_split(test_size=0.2)

Loading cached sorted indices for dataset at /root/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c32a982d8b2d6233065d820ac655454174f8aaa8faddc74979cf793486acd3b0/cache-d7ad6ed779a0089f.arrow


TypeError: ignored

Now that we have balanced the dataset, let's run everything again. 

In [None]:
balanced_train_dataset, balanced_dev_dataset = run_tokenizer(balanced_train_dataset, balanced_dev_dataset)
balanced_train_dataset, balanced_dev_dataset = set_format(balanced_train_dataset, balanced_dev_dataset)
balanced_model = define_model()
balanced_trainer = define_trainer(balanced_model, training_args, balanced_train_dataset, balanced_dev_dataset, compute_metrics, early_stopping_callback)
balanced_trainer = train(balanced_trainer)
evaluate(balanced_trainer)

NameError: ignored

That's better! **TODO**: Write a few sentences about how much F1-measure has improved, and why.

In [None]:
# The rebalance in the sample has allowed us to undertake an analysis where our classifier actually does also classify some tweets as hatespeech.
# This is solely possible if actually having sufficent data within our respective dataset.
# for the algorithm to learn as opposed to simply classifying everything as non-hatespeech.
# The improvement as such should be taken with a grain of salt,
# given that this new predicament is merely providing a score at all 
# as opposed to before where we didnt get any score because our classifer was classfying everything as the same thing. 
# Given that i was struggling with the computation of the rebalanced dataset, i am unable to give a precise interpretation. 