# Simple Training with the 🤗 Transformers Trainer


c.f. Lewis Tunstall & HuggingFace 🤗 video on [Simple Training with the 🤗 Transformers Trainer](https://www.youtube.com/watch?v=u--UVvH-LIQ&t=132s)

from huggingface_hub import whoami

whoami()

from huggingface_hub import HfFolder

HfFolder().get_token()

from huggingface_hub import create_repo

create_repo("test-minilm-finetuned-emotion", token="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")

from huggingface_hub import notebook_login

notebook_login()

----

## Prepare for sharing a new model

Open up an SSH shell, and log in to your vm instance.

Activate your Python virtual environment.

Using the `huggingface-cli` command-line utility, log in to HuggingFace and enter your write token.

    # log in to Huggingface, and enter your write Role user access token
    huggingface-cli login

    # create a new Huggingface repo for this model
    huggingface-cli repo create test-minilm-finetuned-emotion

## Dataset: EDA & preparation

In [1]:
from datasets import load_dataset

emotion_dataset = load_dataset("SetFit/emotion")
emotion_dataset

Using custom data configuration SetFit--emotion-e444b7640ce3116e
Found cached dataset json (/home/buruzaemon/.cache/huggingface/datasets/SetFit___json/SetFit--emotion-e444b7640ce3116e/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 16000
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
})

What do the dataset items look like?

In [2]:
emotion_dataset["train"][0]

{'text': 'i didnt feel humiliated', 'label': 0, 'label_text': 'sadness'}

Sometimes, it is easier to inspect a dataset with Pandas DataFrames.

In [3]:
emotion_df = emotion_dataset["train"].to_pandas()
emotion_df.head()

Unnamed: 0,text,label,label_text
0,i didnt feel humiliated,0,sadness
1,i can go from feeling so hopeless to so damned...,0,sadness
2,im grabbing a minute to post i feel greedy wrong,3,anger
3,i am ever feeling nostalgic about the fireplac...,2,love
4,i am feeling grouchy,3,anger


----

In [4]:
type(emotion_dataset["train"])

datasets.arrow_dataset.Dataset

In [5]:
features = emotion_dataset["train"].features
features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='int64', id=None),
 'label_text': Value(dtype='string', id=None)}

----

In [None]:
import pandas as pd

In [None]:
kvs = pd.unique(emotion_df[['label', 'label_text']].values.ravel('C'))

In [None]:
id2label = dict(zip(kvs[0::2], kvs[1::2]))

In [None]:
id2label

In [None]:
label2id = {k:v for v,k in id2label.items()}
label2id

In [None]:
n_classes = len(id2label)

----

In [None]:
class2prob = emotion_df["label"].value_counts(normalize=True).sort_index()
print(class2prob)

In [None]:
for i,v in class2prob.items():
    print(f"{id2label[i]:<10} {v:.5F}")

Notice that class "5" for "surprise" is not well represented within the dataset.

How do we deal with this?

## Tokenization

[microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased?text=I+like+you.+I+love+you) on Hugging Face modelhub.

> Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i.e., the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model. We also obtain competitive results in applying deep self-attention distillation to multilingual pre-trained models.

Read the [MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-trained Transformers](https://arxiv.org/abs/2002.10957) paper.

In [None]:
from transformers import AutoTokenizer

checkpoint = "microsoft/MiniLM-L12-H384-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

`input_ids`

> Index ids of the word embeddings corresponding to the given word.

`token_type_ids`

> ... a binary mask identifying the two types of sequence in the model ...
> ... the “context” used for the question, has all its tokens represented by a 0, whereas the second sequence, corresponding to the “question”, has all its tokens represented by a 1...
> Some models, like XLNetModel use an additional token represented by a 2.

`attention_mask`

> Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

In [None]:
tokenizer(emotion_dataset["train"]["text"][:1])

In [None]:
def tokenize_text(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

In [None]:
emotion_dataset = emotion_dataset.map(tokenize_text, batched=True)
emotion_dataset

From the API docs on Trainer:

> The Trainer class is optimized for 🤗 Transformers models and can have surprising behaviors when you use it on other models. When using it on your own model, make sure:
>
> * your model always return tuples or subclasses of ModelOutput.
> * your model can compute the loss if a labels argument is provided and that loss is returned as the first element of the tuple (if your model returns tuples)
> * your model can accept multiple label arguments (use the label_names in your TrainingArguments to indicate their name to the Trainer) but none of them should be named "label".

So, we need to rename that column for label in our dataset to labels.

In [None]:
emotion_dataset = emotion_dataset.rename_column("label", "labels")
emotion_dataset

----

## Dealing with imbalanced classes

In [None]:
class_weights = (1 - (emotion_df["label"].value_counts().sort_index() / len(emotion_df))).values
print(class_weights)
print(type(class_weights))

In [None]:
import torch

class_weights = torch.from_numpy(class_weights).float().to("cuda")
class_weights
print(type(class_weights))
print(class_weights.dim())

Now, when training a classfier using a dataset that may suffer for imbalanced classes, one way to deal with the situation is to up-sample from the imbalanced class(es) to offset the differnce(s). However, Transformer models are good at memorizing patterns, and so this might not be a good idea in our case.

What we can do is this: correct for the class imbalance with the loss function during training.

From the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer#trainer) documentation:

> To inject custom behavior you can subclass them and override the following methods: ...
> * compute_loss - Computes the loss on a batch of training inputs.


From PyTorch docs on [`torch.nn.CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html):

> If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

In [None]:
from torch import nn
import torch
from transformers import Trainer


class WeightedLossTrainer(Trainer):
    
    def compute_loss(self, model, inputs, return_outputs=False):
        # feed inputs to model and extract logits
        outputs = model(**inputs)
        logits = outputs.get("logits")
        
        # extract labels; not label!
        #         ^^^^^
        labels = inputs.get("labels")
        
        # define loss function with class weights
        loss_func = nn.CrossEntropyLoss(weight=class_weights)
        
        # compute loss
        loss = loss_func(logits, labels)
        
        return (loss, outputs) if return_outputs else loss

----

## Now bring it all together!

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                           num_labels=n_classes,
                                                           id2label=id2label,
                                                           label2id=label2id)

In [None]:
from sklearn.metrics import f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    return {"f1": f1}

In [None]:
from transformers import TrainingArguments

batch_size = 64

logging_steps = len(emotion_dataset["train"]) // batch_size

output_dir = "test-minilm-finetuned-emotion"

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=5,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    logging_steps=logging_steps,
    fp16=True,
    push_to_hub=True
)

In [None]:
trainer = WeightedLossTrainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=emotion_dataset["train"],
    eval_dataset=emotion_dataset["validation"],
    tokenizer=tokenizer
)

You may see a warning message stating <code>Using cuda_amp half precision backend</code>.

From [Nvidia, Deep Learning Performance Documentation, Train with Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html)...
> There are numerous benefits to using numerical formats with lower precision than 32-bit floating point. First, they require less memory, enabling the training and deployment of larger neural networks. Second, they require less memory bandwidth which speeds up data transfer operations. Third, math operations run much faster in reduced precision, especially on GPUs with Tensor Core support for that precision. Mixed precision training achieves all these benefits while ensuring that no task-specific accuracy is lost compared to full precision training. It does so by identifying the steps that require full precision and using 32-bit floating point for only those steps while using 16-bit floating point everywhere else.

c.f. [CUDA Automatic Mixed Precision Examples](https://pytorch.org/docs/stable/notes/amp_examples.html#cuda-automatic-mixed-precision-examples)

In [None]:
trainer.train()

trainer.push_to_hub()

trainer.push_to_hub(use_auth_token="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")

----

## Try out our newly-trained model

In [None]:
from transformers import pipeline

pipl = pipeline("text-classification", 
                model="buruzaemon/test-minilm-finetuned-emotion")

In [None]:
pipl("I am sorry to see you go")