# Text classification with transformers

Transformer-based language models like [BERT](https://arxiv.org/abs/1810.04805v2) have revolutionized NLP in the past 5 years. One important point is that these models are capable of *transfer learning*, i.e. they undergo a (self-supervised) pretraining on a large text corpus and can later *fine-tuned* with a rather small dataset of annotated training data.

## Hugging Face

[Hugging Face](https://huggingface.co) provides a comprehensive library of open source implementations 
of recent language models as well as a [model hub](https://huggingface.co/models) of pretrained models and a large collection of [datasets](https://huggingface.co/datasets).

We want to use a BERT model to classify a twitter dataset for hate speech and start with installing the necessary libraries.


In [16]:
!pip install datasets
!pip install transformers
!pip install evaluate



## Downloading the data

We now download the [measuring hate speech dataset](https://arxiv.org/abs/2009.10277) of the University of California, Berkeley.

> ⚠️ **Warning:** This dataset contains hate speech of variaous kind, including sexual harrasment and hate against minorities. 

In [1]:
from datasets import load_dataset


dataset = load_dataset("ucberkeley-dlab/measuring-hate-speech")

Found cached dataset parquet (/home/chgaw002/.cache/huggingface/datasets/ucberkeley-dlab___parquet/ucberkeley-dlab--measuring-hate-speech-c32713cabe528196/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/1 [00:00<?, ?it/s]

## Splitting the dataset

We split the dataset in a train and test part.

In [2]:
dataset = dataset['train'].train_test_split(test_size=0.2, seed=42)

Loading cached split indices for dataset at /home/chgaw002/.cache/huggingface/datasets/ucberkeley-dlab___parquet/ucberkeley-dlab--measuring-hate-speech-c32713cabe528196/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-19ee2facf8957277.arrow and /home/chgaw002/.cache/huggingface/datasets/ucberkeley-dlab___parquet/ucberkeley-dlab--measuring-hate-speech-c32713cabe528196/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-923efe3ccae23c20.arrow


The dataset contains quite some columns, we only use the `text` and `hatespeech` columns in this demo.

In [None]:
# dataset['train'].column_names

In [4]:
cols_to_remove = dataset['train'].column_names
cols_to_remove.remove("text")
#cols_to_remove.remove("hate_speech_score")
cols_to_remove.remove("hatespeech")

train_ds = dataset['train'].remove_columns(cols_to_remove)
test_ds = dataset['test'].remove_columns(cols_to_remove)

In [5]:
train_ds

Dataset({
    features: ['hatespeech', 'text'],
    num_rows: 108444
})

## Tokenizing the text

To tokenize the text, we download the tokenizer for our model.

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

We now do the following:

- We apply the tokenize function to our data,
- rename the `hatespeech` column into `label`, and
- cast the `label` column into a `ClassLabel`.

In [7]:
from datasets import ClassLabel

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

train_ds = train_ds.map(tokenize_function, batched=True).rename_column('hatespeech', 'label').cast_column('label', ClassLabel(names=['a', 'b', 'c']))
test_ds = test_ds.map(tokenize_function, batched=True).rename_column('hatespeech', 'label').cast_column('label', ClassLabel(names=['a', 'b', 'c']))

Loading cached processed dataset at /home/chgaw002/.cache/huggingface/datasets/ucberkeley-dlab___parquet/ucberkeley-dlab--measuring-hate-speech-c32713cabe528196/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-81def58c7b1bc357.arrow
Loading cached processed dataset at /home/chgaw002/.cache/huggingface/datasets/ucberkeley-dlab___parquet/ucberkeley-dlab--measuring-hate-speech-c32713cabe528196/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2af172c72758fbf5.arrow
Loading cached processed dataset at /home/chgaw002/.cache/huggingface/datasets/ucberkeley-dlab___parquet/ucberkeley-dlab--measuring-hate-speech-c32713cabe528196/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-8d9df59ccadd7951.arrow
Loading cached processed dataset at /home/chgaw002/.cache/huggingface/datasets/ucberkeley-dlab___parquet/ucberkeley-dlab--measuring-hate-speech-c32713cabe528196/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492f

In [8]:
test_ds.features

{'label': ClassLabel(names=['a', 'b', 'c'], id=None),
 'text': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

## Loading a pretrained model

We now download our model from the model hub.

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Configuring the training

All configuration options for the training are abstracted into a `TrainingArguments` class.

In [10]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", per_device_train_batch_size=64, evaluation_strategy="steps")

## Defining a quality metric

Hugging Face provides several metrics for evaluation. We use *accuracy* here.

In [11]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## Doing the training

We can now do the actual training with a single line of code.

In [12]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 108444
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 5085
  Number of trainable parameters = 109484547


Step,Training Loss,Validation Loss,Accuracy
500,0.6017,0.557136,0.788396


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 27112
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 27112
  Batch size = 8
