### 1. GLUE dataset과 Huggingface
    - Pretrained model의 성능을 측정하기 위해 최근은 SQuAD 등 기존에 유명한 데이터 셋 한 가지만을 갖고 비교했으나, 최근에는 다양한 task를 모델 하나만을 이용해 수행하여 비교하는 방식으로 바뀌었다.
    - NLP 모델의 성능 측정에 대표적으로 활용되는 것이 GLUE(General Language Understanding Evaluation) Benchmark Dataset이다.

> GLUE dataset은 다양한 자연어 처리(NLP) 태스크를 포함하고 있습니다. 각 태스크는 다음과 같은 내용을 다룹니다:

1. **CoLA (Corpus of Linguistic Acceptability)**: 문법에 맞는 문장인지 판단합니다.

2. **MNLI (Multi-Genre Natural Language Inference)**: 두 문장의 관계를 판단하며, entailment, contradiction, neutral 세 가지 범주로 분류합니다.

3. **MNLI-MM (MNLI-Mismatched)**: MNLI와 유사하지만 두 문장이 안 맞는지를 판단합니다.

4. **MRPC (Microsoft Research Paraphrase Corpus)**: 두 문장의 유사도를 평가합니다.

5. **SST-2 (Stanford Sentiment Treebank)**: 감정 분석을 수행합니다.

6. **STS-B (Semantic Textual Similarity Benchmark)**: 두 문장의 의미적 유사도를 평가합니다.

7. **QQP (Quora Question Pairs)**: 두 질문의 유사도를 평가합니다.

8. **QNLI (Question NLI)**: 질문과 문단 내 한 문장 간의 함의 관계를 판단합니다.

9. **RTE (Recognizing Textual Entailment)**: 두 문장의 관계를 판단하며, entailment와 not_entailment 두 가지 범주로 분류합니다.

10. **WNLI (Winograd Schema Challenge)**: 원문장과 대명사로 치환한 문장 사이의 함의 관계를 판단합니다.

11. **Diagnostic Main**: 자연어 이해 능력을 평가하는 자연어 추론 문제를 다룹니다.

이러한 태스크들은 자연어 처리 모델의 다양한 측면을 평가하고 향상시키는 데 사용됩니다. Huggingface를 통해 GLUE 데이터셋에 대한 접근이 간편해졌으며, 이를 통해 다양한 모델을 훈련하고 평가할 수 있습니다.


- 아래 코드가 잘 수행되지 않으면 본 명령어를 터미널에 입력합니다.
```
pip uninstall transformers -y
pip install transformers
mkdir -p transformers
```

In [1]:
!python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Downloading: 100%|██████████████████████████████| 629/629 [00:00<00:00, 499kB/s]
Downloading: 100%|███████████████████████████| 255M/255M [00:03<00:00, 87.5MB/s]
Downloading: 100%|███████████████████████████| 48.0/48.0 [00:00<00:00, 34.6kB/s]
Downloading: 100%|███████████████████████████| 226k/226k [00:00<00:00, 6.64MB/s]
[{'label': 'POSITIVE', 'score': 0.9998656511306763}]


### 2. 커스텀 프로젝트 제작 (1) Dataset
    - 구슬이 서 말이라도 꿰어야 보배다! 데이터를 task에 맞게 가공해봅시다.


In [2]:
import datasets
from datasets import load_dataset

huggingface_mrpc_dataset = load_dataset('glue', 'mrpc')
print(huggingface_mrpc_dataset)

Reusing dataset glue (/aiffel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


In [3]:
train = huggingface_mrpc_dataset['train']
cols = train.column_names
cols

['sentence1', 'sentence2', 'label', 'idx']

In [4]:
for i in range(5):
    for col in cols:
        print(col, ":", train[col][i])
    print('\n')

sentence1 : Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .
sentence2 : Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .
label : 1
idx : 0


sentence1 : Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .
sentence2 : Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .
label : 0
idx : 1


sentence1 : They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .
sentence2 : On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .
label : 1
idx : 2


sentence1 : Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .
sentence2 : Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .
label : 0

#### 커스텀 데이터셋 만들기

In [5]:
import tensorflow_datasets as tfds
from datasets import Dataset

tf_dataset, tf_dataset_info = tfds.load('glue/mrpc', with_info=True)

In [26]:
examples = tf_dataset['test'].take(5)

for example in examples:
    for col in cols:
        print(col, ":", example[col])
    print('\n')

sentence1 : tf.Tensor(b'Shares in BA were down 1.5 percent at 168 pence by 1420 GMT , off a low of 164p , in a slightly stronger overall London market .', shape=(), dtype=string)
sentence2 : tf.Tensor(b'Shares in BA were down three percent at 165-1 / 4 pence by 0933 GMT , off a low of 164 pence , in a stronger market .', shape=(), dtype=string)
label : tf.Tensor(-1, shape=(), dtype=int64)
idx : tf.Tensor(163, shape=(), dtype=int32)


sentence1 : tf.Tensor(b'The South Korean Agriculture and Forestry Ministry also said it would throw out or send back all Canadian beef currently in store .', shape=(), dtype=string)
sentence2 : tf.Tensor(b'The South Korean Agriculture and Forestry Ministry said it would scrap or return all Canadian beef in store .', shape=(), dtype=string)
label : tf.Tensor(-1, shape=(), dtype=int64)
idx : tf.Tensor(131, shape=(), dtype=int32)


sentence1 : tf.Tensor(b'" New Yorkers didn \'t embrace these units like they could have , " said Matthew Daus , chairman of the c

In [6]:
examples = tf_dataset['train'].take(5)

for example in examples:
    for col in cols:
        print(col, ":", example[col])
    print('\n')

sentence1 : tf.Tensor(b'The identical rovers will act as robotic geologists , searching for evidence of past water .', shape=(), dtype=string)
sentence2 : tf.Tensor(b'The rovers act as robotic geologists , moving on six wheels .', shape=(), dtype=string)
label : tf.Tensor(0, shape=(), dtype=int64)
idx : tf.Tensor(1680, shape=(), dtype=int32)


sentence1 : tf.Tensor(b"Less than 20 percent of Boise 's sales would come from making lumber and paper after the OfficeMax purchase is completed .", shape=(), dtype=string)
sentence2 : tf.Tensor(b"Less than 20 percent of Boise 's sales would come from making lumber and paper after the OfficeMax purchase is complete , assuming those businesses aren 't sold .", shape=(), dtype=string)
label : tf.Tensor(0, shape=(), dtype=int64)
idx : tf.Tensor(1456, shape=(), dtype=int32)


sentence1 : tf.Tensor(b'Spider-Man snatched $ 114.7 million in its debut last year and went on to capture $ 403.7 million .', shape=(), dtype=string)
sentence2 : tf.Tensor(b'Spi

In [7]:
# Tensorflow dataset 구조를 python dict 형식으로 변경
# Dataset이 train, validation, test로 나뉘도록 구성
train_dataset = tfds.as_dataframe(tf_dataset['train'], tf_dataset_info)
val_dataset = tfds.as_dataframe(tf_dataset['validation'], tf_dataset_info)
test_dataset = tfds.as_dataframe(tf_dataset['test'], tf_dataset_info)

# dataframe 데이터를 dict 내부에 list로 변경
train_dataset = train_dataset.to_dict('list')
val_dataset = val_dataset.to_dict('list')
test_dataset = test_dataset.to_dict('list')

# Huggingface dataset
tf_train_dataset = Dataset.from_dict(train_dataset)
tf_val_dataset = Dataset.from_dict(val_dataset)
tf_test_dataset = Dataset.from_dict(test_dataset)

### 3. 커스텀 프로젝트 제작 (2) Tokenizer와 Model
    - Custom project를 위한 모델과 tokenizer을 불러와봅시다.

In [8]:
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification

huggingface_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
huggingface_model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

In [9]:
def transform(data):
    return huggingface_tokenizer(
        data['sentence1'],
        data['sentence2'],
        truncation = True,
        padding = 'max_length',
        return_token_type_ids = False,
        )

In [10]:
hf_dataset = huggingface_mrpc_dataset.map(transform, batched=True)

# train & validation & test split
hf_train_dataset = hf_dataset['train']
hf_val_dataset = hf_dataset['validation']
hf_test_dataset = hf_dataset['test']

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [11]:
def transform_tf(batch):
    sentence1 = [s.decode('utf-8') for s in batch['sentence1']]
    sentence2 = [s.decode('utf-8') for s in batch['sentence2']]
    return huggingface_tokenizer(
        sentence1,
        sentence2,
        truncation=True,
        padding='max_length',
        return_token_type_ids=False,
    )

# 토큰화 및 패딩을 적용
tf_train_dataset = tf_train_dataset.map(transform_tf, batched=True)
tf_val_dataset = tf_val_dataset.map(transform_tf, batched=True)
tf_test_dataset = tf_test_dataset.map(transform_tf, batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

### 4. 커스텀 프로젝트 제작 (3) Train/Evaluation과 Test
    - Keras와 Huggingface 두 가지 방식으로 훈련해보아요.
    
#### 4-1. Trainer를 활용한 학습
    - Hugging Face의 Trainer를 활용해 학습을 진행
    - Trainer를 활용하기 위해선 Training Arguments를 통해 학습 관련 설정을 미리 지정해야 한다.
    - 데이터 셋은 hf_train_dataset, hf_val_dataset, hf_test_dataset을 사용합니다.

In [12]:
import os
import numpy as np
from transformers import Trainer, TrainingArguments

output_dir = os.getenv('HOME')+'/aiffel/transformers'

training_arguments = TrainingArguments(
    output_dir,                                         # output이 저장될 경로
    evaluation_strategy="epoch",           #evaluation하는 빈도
    learning_rate = 2e-5,                         #learning_rate
    per_device_train_batch_size = 8,   # 각 device 당 batch size
    per_device_eval_batch_size = 8,    # evaluation 시에 batch size
    num_train_epochs = 3,                     # train 시킬 총 epochs
    weight_decay = 0.01,                        # weight decay
)

- Trainer의 인자로 넘겨주어야 하는 것 중에 compute_metrics 메소드가 있다.
- Task가 Classification인지 Regression인지에 따라 모델의 출력 형태가 달라지므로 **task별로 적합한 출력 형식을 고려해 모델의 성능을 계산하는 방법을 지정해준다.**

In [13]:
from datasets import load_metric
metric = load_metric('glue', 'mrpc')

def compute_metrics(eval_pred):    
    predictions,labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references = labels)

In [14]:
trainer = Trainer(
    model=huggingface_model,           # 학습시킬 model
    args=training_arguments,           # TrainingArguments을 통해 설정한 arguments
    train_dataset=hf_train_dataset,    # training dataset
    eval_dataset=hf_val_dataset,       # evaluation dataset
    compute_metrics=compute_metrics,
)
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.419615,0.838235,0.890728
2,0.503800,0.488906,0.845588,0.896552
3,0.322300,0.555631,0.830882,0.883642


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-1000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-1000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-1000/pytorch_model.bin
The following columns in the evalua

TrainOutput(global_step=1377, training_loss=0.36206477434633433, metrics={'train_runtime': 572.0924, 'train_samples_per_second': 19.235, 'train_steps_per_second': 2.407, 'total_flos': 1457671254810624.0, 'train_loss': 0.36206477434633433, 'epoch': 3.0})

In [15]:
trainer.evaluate(hf_test_dataset)

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 1725
  Batch size = 8


{'eval_loss': 0.5840746164321899,
 'eval_accuracy': 0.8260869565217391,
 'eval_f1': 0.8739495798319329,
 'eval_runtime': 28.6173,
 'eval_samples_per_second': 60.278,
 'eval_steps_per_second': 7.548,
 'epoch': 3.0}

#### 4-2. Custom Dataset으로 학습시켜보기
    - tf_train_dataset, tf_val_dataset, tf_test_dataset으로도 학습을 진행해보고 결과를 비교합니다.

In [16]:
#메모리를 비워줍니다.
del huggingface_model

In [25]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ["CUDA_VISIBLE_DEVICES"] = '0'

In [19]:
huggingface_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
huggingface_model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 2)

training_arguments = TrainingArguments(
    output_dir,                                         # output이 저장될 경로
    evaluation_strategy="epoch",           #evaluation하는 빈도
    learning_rate = 2e-5,                         #learning_rate
    per_device_train_batch_size = 8,   # 각 device 당 batch size
    per_device_eval_batch_size = 8,    # evaluation 시에 batch size
    num_train_epochs = 3,                     # train 시킬 총 epochs
    weight_decay = 0.01,                        # weight decay
)

trainer = Trainer(
    model=huggingface_model,           # 학습시킬 model
    args=training_arguments,           # TrainingArguments을 통해 설정한 arguments
    train_dataset=tf_train_dataset,    # training dataset
    eval_dataset=tf_val_dataset,       # evaluation dataset
    compute_metrics=compute_metrics,
)
trainer.train()

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /aiffel/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.11.3",
  "vocab_size": 30522
}

loading file https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt from cache at /aiffel/.cache/huggingface/transformers/0e1bbfda7f63a99bb52e3915dcf10c3c92122b827d92eb2d34ce94ee79ba486c.d789d6

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.511601,0.818627,0.88141
2,0.511100,0.39908,0.843137,0.890034
3,0.311900,0.527691,0.848039,0.893103


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-500
Configuration saved in /aiffel/aiffel/transformers/checkpoint-500/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to /aiffel/aiffel/transformers/checkpoint-1000
Configuration saved in /aiffel/aiffel/transformers/checkpoint-1000/config.json
Model weights saved in /aiffel/aiffel/transformers/checkpoint-1000/pytorch_model.bin
The following columns in the evalua

TrainOutput(global_step=1377, training_loss=0.35851471235376386, metrics={'train_runtime': 563.0815, 'train_samples_per_second': 19.542, 'train_steps_per_second': 2.445, 'total_flos': 1457671254810624.0, 'train_loss': 0.35851471235376386, 'epoch': 3.0})