# Simple Training with the 🤗 Transformers Trainer

[here](https://www.youtube.com/watch?v=u--UVvH-LIQ&t=132s)

## Dataset: EDA & preparation

In [1]:
from datasets import load_dataset

emotion_dataset = load_dataset("emotion")
emotion_dataset

Found cached dataset emotion (/home/jupyter/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

Q. How can you separately download and save to local?

In [2]:
emotion_dataset["train"][0]

{'text': 'i didnt feel humiliated', 'label': 0}

In [3]:
emotion_df = emotion_dataset["train"].to_pandas()
emotion_df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


----

In [4]:
type(emotion_dataset["train"])

datasets.arrow_dataset.Dataset

In [5]:
features = emotion_dataset["train"].features
features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

In [6]:
features['label'].int2str(0)

'sadness'

In [7]:
n_classes = features['label'].num_classes

id2label = {idx:features['label'].int2str(idx) for idx in range(n_classes)}
id2label

{0: 'sadness', 1: 'joy', 2: 'love', 3: 'anger', 4: 'fear', 5: 'surprise'}

In [8]:
label2id = {k:v for v,k in id2label.items()}
label2id

{'sadness': 0, 'joy': 1, 'love': 2, 'anger': 3, 'fear': 4, 'surprise': 5}

----

In [9]:
class2prob = emotion_df["label"].value_counts(normalize=True).sort_index()
print(class2prob)

0    0.291625
1    0.335125
2    0.081500
3    0.134937
4    0.121063
5    0.035750
Name: label, dtype: float64


In [10]:
for i,v in class2prob.items():
    print(f"{id2label[i]:<10} {v:.5F}")

sadness    0.29163
joy        0.33513
love       0.08150
anger      0.13494
fear       0.12106
surprise   0.03575


Notice that class "5" for "surprise" is not well represented within the dataset.

_How do we deal with this?_

## Tokenize

[microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased?text=I+like+you.+I+love+you) on Hugging Face modelhub.

> Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i.e., the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model. We also obtain competitive results in applying deep self-attention distillation to multilingual pre-trained models.

Read the [MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-trained Transformers](https://arxiv.org/abs/2002.10957) paper.

In [11]:
from transformers import AutoTokenizer

checkpoint = "microsoft/MiniLM-L12-H384-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

##### `input_ids`

Index ids of the word embeddings corresponding to the given word.

##### `token_type_ids`


##### `attention_mask`

> Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

In [12]:
tokenizer(emotion_dataset["train"]["text"][:1])

{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1]]}

In [13]:
def tokenize_text(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

In [14]:
emotion_dataset = emotion_dataset.map(tokenize_text, batched=True)
emotion_dataset

Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-5d942269825b2479.arrow
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-1ea28bd647c98398.arrow
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-75b7553e76ca0b70.arrow


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
})

From the [API docs on Trainer](https://huggingface.co/docs/transformers/main_classes/trainer):

> The Trainer class is optimized for 🤗 Transformers models and can have surprising
> behaviors when you use it on other models. When using it on your own model, make sure:
> * your model always return tuples or subclasses of ModelOutput.
> * your model can compute the loss if a `labels` argument is provided and that loss is returned as the first element of the tuple (if your model returns tuples)
> * your model can accept multiple label arguments (use the `label_names` in your TrainingArguments to indicate their name to the Trainer) but none of them should be named "`label`".

So, we need to rename that column for `label` in our dataset to `labels`.

In [15]:
emotion_dataset = emotion_dataset.rename_column("label", "labels")
emotion_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
})

## Dealing with imbalanced classes

In [16]:
class_weights = (1 - (emotion_df["label"].value_counts().sort_index() / len(emotion_df))).values
print(class_weights)
print(type(class_weights))

[0.708375  0.664875  0.9185    0.8650625 0.8789375 0.96425  ]
<class 'numpy.ndarray'>


In [17]:
import torch

class_weights = torch.from_numpy(class_weights).float().to("cuda")
class_weights
print(type(class_weights))
print(class_weights.dim())

<class 'torch.Tensor'>
1


Now, when training a classfier using a dataset that may suffer for imbalanced classes, one way to deal with the situation is to up-sample from the imbalanced class(es) to offset the differnce(s). However, Transformer models are good at memorizing patterns, and so this might not be a good idea in our case.

What we can do is this: correct for the class imbalance with the loss function during training.

From the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer#trainer) documentation:

> To inject custom behavior you can subclass them and override the following methods:
> ...
> * [`compute_loss`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.compute_loss) - Computes the loss on a batch of training inputs.


From PyTorch docs on [torch.nn.CrossEntropyLoss](https://huggingface.co/docs/transformers/main_classes/trainer):

> If provided, the optional argument `weight` should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

In [18]:
from torch import nn
import torch
from transformers import Trainer


class WeightedLossTrainer(Trainer):
    
    def compute_loss(self, model, inputs, return_outputs=False):
        # feed inputs to model and extract logits
        outputs = model(**inputs)
        logits = outputs.get("logits")
        
        # extract labels; not label!
        #         ^^^^^
        labels = inputs.get("labels")
        
        # define loss function with class weights
        loss_func = nn.CrossEntropyLoss(weight=class_weights)
        
        # compute loss
        loss = loss_func(logits, labels)
        
        return (loss, outputs) if return_outputs else loss
        


----

## Now bring it all together!

In [19]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                           num_labels=n_classes,
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/MiniLM-L12-H384-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
from sklearn.metrics import f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    return {"f1": f1}

how do you know if your GPU supposed mixed-precision training?
What is mixed-precision?

In [21]:
from transformers import TrainingArguments

batch_size = 64

logging_steps = len(emotion_dataset["train"]) // batch_size

output_dir = "test-minilm-finetuned-emotion"

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=5,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    logging_steps=logging_steps,
    fp16=True,
    push_to_hub=False
)

In [22]:
trainer = WeightedLossTrainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=emotion_dataset["train"],
    eval_dataset=emotion_dataset["validation"],
    tokenizer=tokenizer
)

Using cuda_amp half precision backend


In [23]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 16000
  Num Epochs = 5
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 1250
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1
1,1.4069,1.052148,0.606863
2,0.9087,0.706614,0.806596
3,0.6463,0.563339,0.864676
4,0.4974,0.43355,0.901976
5,0.4232,0.404301,0.908285


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 64
Saving model checkpoint to test-minilm-finetuned-emotion/checkpoint-500
Configuration saved in test-minilm-finetuned-emotion/checkpoint-500/config.json
Model weights saved in test-minilm-finetuned-emotion/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-minilm-finetuned-emotion/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-minilm-finetuned-emotion/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely 

TrainOutput(global_step=1250, training_loss=0.7765083251953125, metrics={'train_runtime': 136.5489, 'train_samples_per_second': 585.871, 'train_steps_per_second': 9.154, 'total_flos': 582587326282752.0, 'train_loss': 0.7765083251953125, 'epoch': 5.0})

----

## Now use it!

In [24]:
from transformers import pipeline

pipl = pipeline("text-classification", 
                model="test-minilm-finetuned-emotion/checkpoint-1000")

loading configuration file test-minilm-finetuned-emotion/checkpoint-1000/config.json
Model config BertConfig {
  "_name_or_path": "test-minilm-finetuned-emotion/checkpoint-1000",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 384,
  "id2label": {
    "0": "sadness",
    "1": "joy",
    "2": "love",
    "3": "anger",
    "4": "fear",
    "5": "surprise"
  },
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "label2id": {
    "anger": 3,
    "fear": 4,
    "joy": 1,
    "love": 2,
    "sadness": 0,
    "surprise": 5
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.2

In [25]:
pipl("well, i'm your huckleberry!")

[{'label': 'sadness', 'score': 0.7310232520103455}]