# Huggingface Trainer

## 학습 목표
1. Huggingface Trainer 사용법을 살펴본다.

**Context**
1. Text Classification 데이터 및 모델 사전준비
2. TrainingArguments 살펴보기
3. `compute_metrics` 사용법
4. Trainer로 initialization 후 학습하기

In [None]:
!pip install transformers

In [None]:
!pip install datasets

In [None]:
!pip install evaluate

In [4]:
import numpy as np
import evaluate
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

## 1. Text Classification 데이터 및 모델 사전준비

```
💡 Trainer 란❓

AI 모델 학습을 하기 위해 train() 함수를 매번 만드는 번거로움을 덜고자,
Huggingface에서 제공하는 모델학습부터 평가까지 한 번에 해결할 수 있는 API이다.

어떻게 사용할 수 있는지 간단하게 text classification model을 Trainer로 학습하고자 한다.
```

참고: https://huggingface.co/docs/transformers/training

```
먼저 text classification task를 수행하기 위해 Yelp 데이터를 불러온다.
Huggingface에서 제공하는 datasets 라이브러리를 활용하여 간편하게 로드할 수 있다.

Yelp 데이터셋은 각 고객의 리뷰 글에 대해 평점이 존재하는 데이터다.
평점(라벨)은 별 1개(0), 2개(1), 3개(2), 4개(3), 5개(4)로 구성되어 있고,
모델은 주어진 리뷰 글을 보고 평점을 예측하는 간단한 multi-class classification task를 수행한다.
```

참고: https://huggingface.co/datasets/yelp_review_full

In [5]:
dataset = load_dataset("yelp_review_full")

Downloading builder script:   0%|          | 0.00/4.41k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.55k [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

> 데이터는 학습(train) 데이터셋과 추론(test) 데이터셋으로 구성되어 있다.

In [6]:
dataset.keys()

dict_keys(['train', 'test'])

In [7]:
# 데이터 예제
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

> Text classification을 하기 위해 가장 대표적인 BERT 모델을 사용한다.

In [8]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

> `padding` 파라미터와 `truncation` 파라미터를 활용하여 배치 학습을 가능케 한다.
>
> 즉, 문장들이 길든 짧든 BERT의 `max_seq_length`인 512로 맞춰준다.

In [9]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

> Huggingface에서 불러온 데이터셋은 내부 함수 `map`을 사용하여 원하는 전처리 함수를 파라미터로 주면 모든 데이터에 손쉽게 적용할 수 있다.

In [10]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)   # batched=True는 map 함수를 병렬적으로 처리한다는 의미다.

  0%|          | 0/650 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

> 본 실습에서는 학습이 목표가 아니기 때문에 간단하게 1000개의 데이터 샘플만 가지고 테스트해본다.

In [11]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))    # test 셋으로 평가하면 안되지만 본 실습에서는 확인용으로 사용한다. 

> ~ForSequenceClassification은 인코더 위에 추가로 linear layer (classifier)이 있는 모델이다. Linear layer는 사전 학습된 weight가 존재하지 않는다 (경고 메시지 확인). `num_labels` 파라미터를 명시해주어야 classifier가 알맞은 클래스 수에서 예측할 수가 있다.

In [12]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

## 2. TrainingArguments 살펴보기

```
💡 TrainingArguments 란❓

학습(train)에 필요한 파라미터들의 모음이다.
Optimizer의 종류, learning rate, epoch, scheduler, half precision 사용여부 등 다양하게 지정할 수 있다.
```

참고: https://huggingface.co/docs/transformers/v4.19.2/en/main_classes/trainer#transformers.TrainingArguments

In [13]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,               # How often to print logs
    do_train=True,                   # Perform training
    do_eval=True,                    # Perform evaluation
    evaluation_strategy="epoch",     # evalute after eachh epoch
    gradient_accumulation_steps=1,   # total number of steps before back propagation
    fp16=False,                      # Use mixed precision
    fp16_opt_level="01",             # mixed precision mode
    seed=42                          # Seed for experiment reproducibility 3x3
)

## 3. `compute_metrics` 사용법

```
💡 compute_metrics 란❓

Trainer은 학습 중에 모델 성능을 자동으로 평가하지 않는다.
Trainer에게 학습 중에 evaluation을 가능케 하는 것이 compute_metrics 함수다.

Metric으로 사용할 함수를 사용자가 직접 정의할 수 있다.
TrainingArguments에 evaluation_strategy를 설정해 놓았으면
학습 중에 evaluation metrics를 모니터링 할 수 있다.
```

> 💡 Huggingface에서는 evaluate 라이브러리를 제공하는데, 여기서 accuracy, precision, recall, F1 등 다양한 metric들을 불러올 수 있다.

In [14]:
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

> `compute_metrics` 함수는 직접 구현할 수도 있지만, accuracy 같은 간단한 metric들은 그냥 불러올 수 있다.

In [15]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## 4. Trainer로 initialization 후 학습하기

In [16]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [17]:
# 학습
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 125
  Number of trainable parameters = 108314117


Epoch,Training Loss,Validation Loss,Accuracy
1,1.505,1.234964,0.441


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=125, training_loss=1.4551904754638672, metrics={'train_runtime': 133.4507, 'train_samples_per_second': 7.493, 'train_steps_per_second': 0.937, 'total_flos': 263118142464000.0, 'train_loss': 1.4551904754638672, 'epoch': 1.0})

In [20]:
# 추론
res = trainer.predict(small_eval_dataset)
print(res)

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1000
  Batch size = 8


PredictionOutput(predictions=array([[ 0.40580794,  1.2029712 ,  0.7478649 ,  0.2475175 , -0.75517446],
       [-0.0555532 ,  1.0840354 ,  1.0694494 ,  0.52948904, -0.6456897 ],
       [ 1.9604565 ,  0.61424345, -0.33459976, -0.545025  , -0.46351534],
       ...,
       [-1.3943344 , -0.39032632,  0.74318975,  1.1755632 ,  0.7792639 ],
       [ 0.02216605,  1.1025462 ,  0.9173853 ,  0.58567345, -0.5156993 ],
       [ 1.6816765 ,  1.0406543 ,  0.03713107, -0.39047658, -0.92845494]],
      dtype=float32), label_ids=array([2, 4, 1, 4, 3, 4, 2, 3, 2, 3, 0, 0, 3, 2, 2, 1, 3, 1, 2, 2, 1, 2,
       3, 1, 1, 3, 4, 0, 0, 2, 2, 2, 1, 3, 4, 0, 0, 1, 3, 2, 0, 2, 0, 0,
       3, 0, 3, 2, 3, 0, 1, 1, 3, 3, 4, 4, 1, 4, 1, 3, 1, 0, 0, 1, 4, 1,
       4, 3, 2, 4, 1, 0, 3, 3, 4, 1, 2, 1, 0, 4, 4, 4, 2, 3, 3, 1, 4, 0,
       4, 2, 3, 0, 0, 0, 3, 4, 0, 0, 1, 4, 4, 0, 0, 1, 1, 0, 4, 2, 2, 1,
       1, 4, 0, 4, 0, 3, 2, 0, 4, 4, 4, 2, 0, 0, 0, 1, 3, 0, 2, 0, 3, 2,
       2, 2, 0, 3, 4, 3, 0, 1, 0, 1, 0, 0, 4

In [24]:
print(res.predictions.shape)

preds = np.argmax(res.predictions, axis=1)
print(preds)

(1000, 5)
[1 1 0 3 3 3 1 3 3 3 0 0 1 1 1 1 1 1 2 1 0 1 2 1 1 4 4 0 0 1 1 1 0 0 4 3 0
 0 0 3 1 2 0 1 0 0 1 1 3 0 1 0 3 3 3 3 0 3 1 3 0 0 0 0 4 0 4 4 1 4 1 0 3 3
 3 1 1 0 0 3 3 3 1 1 4 3 4 1 3 1 3 0 0 0 2 1 0 0 2 4 3 3 0 0 0 0 3 1 1 0 1
 3 0 4 0 2 1 0 3 3 0 0 0 0 0 0 3 0 0 0 1 0 1 1 0 3 3 0 0 1 0 0 0 0 1 1 2 1
 3 0 3 4 0 1 0 3 0 3 1 0 1 0 0 0 0 0 0 0 3 1 0 4 1 3 0 0 0 1 0 3 3 0 0 3 1
 0 1 0 3 3 0 0 0 0 0 0 3 0 0 0 0 0 3 4 2 1 0 1 0 3 3 1 4 2 0 0 0 0 0 1 0 0
 3 0 0 1 0 0 3 0 1 0 3 2 3 1 1 4 3 3 1 0 3 1 0 0 3 1 3 0 0 3 0 3 0 3 1 3 3
 1 4 0 1 0 4 4 3 0 0 0 1 0 3 0 3 0 1 0 0 0 3 1 0 3 3 0 0 3 2 2 3 4 3 1 0 0
 0 3 1 3 3 2 0 0 0 3 3 1 0 0 0 3 0 4 0 0 0 0 3 3 0 3 2 0 3 1 0 0 1 3 0 0 3
 4 0 1 0 0 1 1 1 0 1 1 1 3 0 3 4 0 1 4 3 3 0 1 0 2 1 1 4 1 3 1 0 3 4 1 0 4
 0 3 2 0 1 0 0 0 0 2 2 1 1 0 2 4 1 1 3 3 0 0 1 0 4 1 3 0 1 0 0 3 3 0 3 3 1
 1 3 1 0 3 1 3 0 0 3 1 0 0 3 3 1 0 0 4 1 1 0 0 0 0 4 1 1 3 1 1 1 1 0 4 3 2
 1 1 4 3 1 0 0 0 1 3 0 3 1 2 0 0 0 3 1 4 1 1 3 4 0 3 1 1 0 0 0 1 0 2 3 3 1
 4 0 1 0 0 1 3 

In [25]:
print(preds == res.label_ids)

[False False False False  True False False  True False  True  True  True
 False False False  True False  True  True False False False False  True
  True False  True  True  True False False False False False  True False
  True False False False False  True  True False False  True False False
  True  True  True False  True  True False False False False  True  True
 False  True  True False  True False  True False False  True  True  True
  True  True False  True False False  True False False False False False
 False False  True False False False  True  True  True  True False False
  True  True False  True False False  True False False  True False False
 False False  True False  True  True  True False False  True False False
 False False  True  True  True False  True  True False  True False False
 False False  True  True False False  True  True  True False  True  True
 False False False  True  True  True  True  True  True False False  True
 False  True  True False  True  True False  True Fa

In [26]:
num_correct = (preds == res.label_ids).sum()
print(f"accuracy: {num_correct}/{len(small_eval_dataset)}")

accuracy: 441/1000
