# HuggingFace 커스텀 프로젝트

GLUE benchmark의 한국어 버전 [KLUE benchmark](https://klue-benchmark.com/)  
- GLUE와 마찬가지로 한국어 자연어처리에 대한 이해도를 높이기 위해 만들어진 데이터셋 benchmark
- 총 8가지의 데이터셋이 있다.

프로젝트는 KLUE의 dataset을 활용하는 것이 아닌, model(klue/ber-base)를 활용하여 NSMC(Naver Sentiment Movie Corpus) task 도전 

모델과 데이터에 관한 정보는 링크 참조
- [KLUE/Bert-base](https://huggingface.co/klue/bert-base)
- [NSMC](https://github.com/e9t/nsmc)

In [1]:
import os
import datasets
import numpy as np
import transformers
import tensorflow as tf
import tensorflow_datasets as tfds
from datasets import Dataset, load_dataset, load_metric
from transformers import Trainer, TrainingArguments
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import BertTokenizer, BertForSequenceClassification, TFBertForSequenceClassification, AdamWeightDecay

print(tf.__version__)
print(np.__version__)
print(transformers.__version__)
print(datasets.__version__)

2.6.0
1.21.4
4.11.3
1.14.0


## NSMC 데이터 분석 및 Huggingface dataset 구성
데이터셋은 깃허브에서 다운받거나, [Huggingface datasets](https://huggingface.co/datasets)에서 가져올 수 있다.

### NSMC 데이터셋 로드

In [2]:
huggingface_nsmc_dataset = load_dataset('nsmc')
print(huggingface_nsmc_dataset)

Using custom data configuration default
Reusing dataset nsmc (/aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})


#### trian dataset 샘플 출력

In [3]:
train = huggingface_nsmc_dataset['train']
cols = train.column_names
cols

['id', 'document', 'label']

In [4]:
for i in range(5):
    for col in cols:
        print(col, ":", train[col][i])
    print('\n')

id : 9976970
document : 아 더빙.. 진짜 짜증나네요 목소리
label : 0


id : 3819312
document : 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
label : 1


id : 10265843
document : 너무재밓었다그래서보는것을추천한다
label : 0


id : 9045019
document : 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
label : 0


id : 6483659
document : 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다
label : 1




#### test dataset 샘플 출력

In [5]:
test = huggingface_nsmc_dataset['test']
cols = test.column_names
cols

['id', 'document', 'label']

In [6]:
for i in range(5):
    for col in cols:
        print(col, ":", test[col][i])
    print('\n')

id : 6270596
document : 굳 ㅋ
label : 1


id : 9274899
document : GDNTOPCLASSINTHECLUB
label : 0


id : 8544678
document : 뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아
label : 0


id : 6825595
document : 지루하지는 않은데 완전 막장임... 돈주고 보기에는....
label : 0


id : 6723715
document : 3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??
label : 0




#### 데이터셋의 클래스 레이블 분포 분석

In [7]:
train_labels = huggingface_nsmc_dataset['train']['label']
test_labels = huggingface_nsmc_dataset['test']['label']

print("훈련 데이터셋 레이블 분포:", {0: train_labels.count(0), 1: train_labels.count(1)})
print("테스트 데이터셋 레이블 분포:", {0: test_labels.count(0), 1: test_labels.count(1)})

훈련 데이터셋 레이블 분포: {0: 75173, 1: 74827}
테스트 데이터셋 레이블 분포: {0: 24827, 1: 25173}


## klue/bert-base model 및 tokenizer 불러오기

### `klue/bert-base` 모델의 토크나이저 로드

In [8]:
huggingface_tokenizer = AutoTokenizer.from_pretrained('klue/bert-base')

In [9]:
print(huggingface_tokenizer)

PreTrainedTokenizerFast(name_or_path='klue/bert-base', vocab_size=32000, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


### `KoELECTRA` 작은 버전의 토크나이저 로드

In [10]:
tokenizer = AutoTokenizer.from_pretrained('monologg/koelectra-small-v3-discriminator')

In [11]:
print(tokenizer)

PreTrainedTokenizerFast(name_or_path='monologg/koelectra-small-v3-discriminator', vocab_size=35000, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


### klue/bert-base 모델 로드

In [12]:
huggingface_model = AutoModelForSequenceClassification.from_pretrained('klue/bert-base', num_labels=2)

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

In [13]:
print(huggingface_model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

### KoELECTRA 모델 로드

In [14]:
model = AutoModelForSequenceClassification.from_pretrained('monologg/koelectra-small-v3-discriminator', num_labels=2)

Some weights of the model checkpoint at monologg/koelectra-small-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-small-v3-discriminator and are newly initialized

In [15]:
print(model)

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(35000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (embeddings_project): Linear(in_features=128, out_features=256, bias=True)
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_

## tokenizer으로 데이터셋을 전처리하고, model 학습 진행해 보기

### 데이터셋 전처리

#### 토크나이즈 함수 정의

In [16]:
def transform(data):
    return huggingface_tokenizer(data['document'], 
                                 truncation=True, 
                                 padding='max_length', 
                                 max_length=128, 
                                 return_token_type_ids=False)

In [17]:
# def transform(data):
#     return tokenizer(data['document'], 
#                      truncation=True, 
#                      padding='max_length', 
#                      max_length=128, 
#                      return_token_type_ids=False)

#### 데이터셋 전처리

In [18]:
tokenized_nsmc_dataset = huggingface_nsmc_dataset.map(transform, batched=True)

Loading cached processed dataset at /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-65a493f18f4ee7f1.arrow
Loading cached processed dataset at /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-d8bb27747a8d1199.arrow


### train & test split

In [19]:
hf_train_dataset = tokenized_nsmc_dataset['train']
hf_test_dataset = tokenized_nsmc_dataset['test']

In [20]:
print(len(hf_train_dataset))
print(len(hf_test_dataset))

150000
50000


### TrainingArguments
학습 관련 설정을 미리 지정

In [27]:
# output_dir = os.getenv('HOME')+'/aiffel/transformers'

training_args = TrainingArguments(
    output_dir='./results',                # 모델과 체크포인트 저장 경로
    num_train_epochs=3,                    # 훈련 에폭 수를 줄여 학습 시간 단축
    per_device_train_batch_size=8,         # 배치 크기를 증가 (메모리 한계 내에서)
    per_device_eval_batch_size=8,          # 평가 시 배치 크기도 동일하게 증가
    warmup_steps=500,                      # 웜업 스텝 수 조절
    weight_decay=0.01,                     # 가중치 감소를 사용하여 규제 적용
    logging_dir='./logs',                  # 로깅 정보 저장 경로
    logging_steps=10,                      # 몇 스텝마다 로그를 남길지 설정
    evaluation_strategy='epoch',           # 평가 전략을 스텝 기준으로 변경
    eval_steps=500,                        # 평가를 수행할 스텝 간격
    save_strategy='epoch',                 # 모델 저장 전략을 에폭 단위로 변경
    gradient_accumulation_steps=2,         # 그래디언트 누적 스텝 수 조정
    learning_rate=5e-5,                    # 학습률 설정
    load_best_model_at_end=True,           # 학습 종료 시 최고 성능 모델을 불러오기
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


### compute_metrics

In [28]:
metric = load_metric('accuracy')

In [29]:
def compute_metrics(eval_pred):    
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

### 모델 학습

In [30]:
trainer = Trainer(
    model=huggingface_model,             # 앞서 로드한 모델
    args=training_args,                  # 훈련 설정
    train_dataset=hf_train_dataset,      # 훈련 데이터셋
    eval_dataset=hf_test_dataset,        # 평가 데이터셋
    compute_metrics=compute_metrics      # 평가 지표 계산 함수
)
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running training *****
  Num examples = 150000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 2
  Total optimization steps = 28125


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3069,0.27533,0.89444
2,0.2476,0.284077,0.9007
3,0.0923,0.404648,0.90234


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 50000
  Batch size = 8
Saving model checkpoint to ./results/checkpoint-9375
Configuration saved in ./results/checkpoint-9375/config.json
Model weights saved in ./results/checkpoint-9375/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 50000
  Batch size = 8
Saving model checkpoint to ./results/checkpoint-18750
Configuration saved in ./results/checkpoint-18750/config.json
Model weights saved in ./results/checkpoint-18750/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Runn

TrainOutput(global_step=28125, training_loss=0.20928465972105662, metrics={'train_runtime': 12099.3827, 'train_samples_per_second': 37.192, 'train_steps_per_second': 2.324, 'total_flos': 2.9599993728e+16, 'train_loss': 0.20928465972105662, 'epoch': 3.0})

#### RuntimeError: CUDA error: device-side assert triggered
배치사이즈를 늘리면 RuntimeError가 떠서 8로 고정하였다.  
결국 해결 못하고 그냥 모델 돌리고 있다.

In [31]:
!nvidia-smi

Tue Mar 26 13:37:28 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    31W /  70W |   5112MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [32]:
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

### 모델 평가

In [None]:
trainer.evaluate(hf_test_dataset)
# 실수로 코드셀을 마크다운으로 바꿔서 결과가 사라졌습니다.
# 약 0.90 나왔습니다.
# 3시간을 다시 돌릴 자신이 없습니다..

## Fine-tuning을 통하여 모델 성능(accuarcy) 향상시키기
데이터 전처리, TrainingArguments 등을 조정하여 모델의 정확도를 90% 이상으로 끌어올리기

초기 TrainingArguments 입니다.
```python
training_args = TrainingArguments(
    output_dir='./results',          # 모델 출력 디렉토리
    num_train_epochs=3,              # 총 훈련 에폭 수
    per_device_train_batch_size=8,   # 훈련 배치 크기
    per_device_eval_batch_size=8,    # 평가 배치 크기
    warmup_steps=500,                # 웜업을 위한 스텝 수
    evaluation_strategy='epoch',     # 에폭이 끝날 때마다 평가
    logging_dir='./logs',            # 로그 디렉토리
)
```

fine-tuning한 TrainingArguments 입니다.

```python
training_args = TrainingArguments(
    output_dir='./results',                # 모델과 체크포인트 저장 경로
    num_train_epochs=3,                    # 훈련 에폭 수를 줄여 학습 시간 단축
    per_device_train_batch_size=8,         # 배치 크기를 증가 (메모리 한계 내에서)
    per_device_eval_batch_size=8,          # 평가 시 배치 크기도 동일하게 증가
    warmup_steps=500,                      # 웜업 스텝 수 조절
    weight_decay=0.01,                     # 가중치 감소를 사용하여 규제 적용
    logging_dir='./logs',                  # 로깅 정보 저장 경로
    logging_steps=10,                      # 몇 스텝마다 로그를 남길지 설정
    evaluation_strategy='epoch',           # 평가 전략을 스텝 기준으로 변경
    eval_steps=500,                        # 평가를 수행할 스텝 간격
    save_strategy='epoch',                 # 모델 저장 전략을 에폭 단위로 변경
    gradient_accumulation_steps=2,         # 그래디언트 누적 스텝 수 조정
    learning_rate=5e-5,                    # 학습률 설정
    load_best_model_at_end=True,           # 학습 종료 시 최고 성능 모델을 불러오기
)
```

## Bucketing을 적용하여 학습시키고, STEP 4의 결과와의 비교
링크를 바탕으로 bucketing과 dynamic padding이 무엇인지 알아보고, 이들을 적용하여 model 학습
- [Data Collator](https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/data_collator)
- [Trainer.TrainingArguments 의 `group_by_length`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)

STEP 4에 학습한 결과와 bucketing을 적용하여 학습시킨 결과를 비교해보고, 모델 성능 향상과 훈련 시간 두 가지 측면에서 각각 어떤 이점이 있는지 비교

In [34]:
from transformers import DataCollatorWithPadding

# Data Collator 정의
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

In [35]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    group_by_length=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [36]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=hf_train_dataset,
    eval_dataset=hf_test_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

The following columns in the training set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: document, id.
***** Running training *****
  Num examples = 150000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 56250


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5591,0.45532,0.80922
2,0.4279,0.398825,0.82508
3,0.2713,0.418422,0.83802


The following columns in the evaluation set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 50000
  Batch size = 8
Saving model checkpoint to ./results/checkpoint-18750
Configuration saved in ./results/checkpoint-18750/config.json
Model weights saved in ./results/checkpoint-18750/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 50000
  Batch size = 8
Saving model checkpoint to ./results/checkpoint-37500
Configuration saved in ./results/checkpoint-37500/config.json
Model weights saved in ./results/checkpoint-37500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: document, id

TrainOutput(global_step=56250, training_loss=0.44554781651390923, metrics={'train_runtime': 3439.6731, 'train_samples_per_second': 130.826, 'train_steps_per_second': 16.353, 'total_flos': 3309709593600000.0, 'train_loss': 0.44554781651390923, 'epoch': 3.0})

In [37]:
trainer.evaluate(hf_test_dataset)

The following columns in the evaluation set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 50000
  Batch size = 8


{'eval_loss': 0.3988247215747833,
 'eval_accuracy': 0.82508,
 'eval_runtime': 85.7678,
 'eval_samples_per_second': 582.97,
 'eval_steps_per_second': 72.871,
 'epoch': 3.0}

### 비교
- 훈련 시간
    - 동적 패딩과 길이 기반의 배치 그룹화는 불필요한 패딩을 최소화하여, 각 배치의 처리 시간을 단축시킨다. 
    - 전체적인 훈련 시간을 단축시키는 효과를 가진다.
        - 3시간 / 1시간
- 모델 성능
    - 길이가 비슷한 샘플들을 함께 처리함으로써, 모델이 각 배치에서 더 효율적으로 학습할 수 있도록 도와준다. 
    - 모델의 성능 향상으로 이어질 수 있다.
        - 성능은 기존 모델보다 낮게 나왔다.

### 회고
거의 점심시간 이후로부터 10시간을..기다린 거 같다.   
힘들었다..심지어 중간에 1시간 반을 날려먹어서 일찍 끝낼 수 있었는데 ..