# 프로젝트 : 커스텀 프로젝트 직접 만들기
==========================================================================================

## 루브릭 달성 조건
1. MNLI 데이터셋을 처리하는 전용 Processor 클래스를 정상적으로 구현하였다.
    - Processor 클래스에 대해 1개 이상의 example에 대한 단위테스트가 정상 진행되었다.
2. BERT tokenizer와 Processor를 결합하여 데이터셋을 정상적으로 생성하였다.
    - MNLI 데이터셋의 입력과 라벨의 정의에 잘 맞는 tf.data.Dataset 인스턴스가 얻어졌다.
3. MNLI 데이터셋에 대해 적당한 모델을 fine-tuning하여 학습하였다.
    - 모델 학습이 정상적으로 진행되었다.

---
## 목차
>### 1. Tensorflow Dataset 사용해보기
>>### 1.1 Tensorflow Dataset 불러오기
>>### 1.2 Tensorflow HuggingFace 모델 및 토크나이저 불러오기
>>### 1.3 Tensorflow 모델 학습 및 평가하기
>### 2. Hugging Face Dataset 사용해보기
>>### 2.1 Hugging Face Dataset 및 모델, 토크나이저 불러오기
>>### 2.2 Hugging Face 모델 학습 및 평가하기
>### 3. 회고
>### 4. Reference
>### 5. 자기다짐 및 아쉬운 점

---
## 1. Tensorflow Dataset 사용해보기

---
## 1.1 Tensorflow Dataset 불러오기

In [1]:
!pip uninstall transformers -y
!pip install transformers
!pip install tensorflow-datasets -U

Found existing installation: transformers 4.25.1
Uninstalling transformers-4.25.1:
  Successfully uninstalled transformers-4.25.1
Collecting transformers
  Using cached transformers-4.25.1-py3-none-any.whl (5.8 MB)
Installing collected packages: transformers
Successfully installed transformers-4.25.1


In [2]:
import os
import numpy as np
from argparse import ArgumentParser

import tensorflow as tf
import tensorflow_datasets as tfds

import transformers
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from transformers.data.processors.utils import DataProcessor, InputExample, InputFeatures

import datasets
from datasets import load_dataset, load_metric
from dataclasses import asdict

import warnings
warnings.filterwarnings('ignore')

import logging
logging.disable(logging.WARNING)

transformers.logging.set_verbosity_error()

In [3]:
# Tensorflow Dataset 에서 glue/mnli dataset을 다운로드
data = tfds.load('glue/mnli')

In [4]:
class DataProcessor:
    """Base class for data converters for sequence classification data sets."""

    def get_example_from_tensor_dict(self, tensor_dict):
        """
        Gets an example from a dict with tensorflow tensors.

        Args:
            tensor_dict: Keys and values should match the corresponding Glue
                tensorflow_dataset examples.
        """
        raise NotImplementedError()

    def get_train_examples(self, data_dir):
        """Gets a collection of :class:`InputExample` for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of :class:`InputExample` for the dev set."""
        raise NotImplementedError()

    def get_test_examples(self, data_dir):
        """Gets a collection of :class:`InputExample` for the test set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    def tfds_map(self, example):
        """
        Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts
        examples to the correct format.
        """
        if len(self.get_labels()) > 1:
            example.label = self.get_labels()[int(example.label)]
        return example

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r", encoding="utf-8-sig") as f:
            return list(csv.reader(f, delimiter="\t", quotechar=quotechar))

In [5]:
class MnliProcessor(DataProcessor):
    """Processor for the MNLI data set (GLUE version)."""
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def get_example_from_tensor_dict(self, tensor_dict):
        """See base class."""
        return InputExample(
            tensor_dict["idx"].numpy(),
            tensor_dict["hypothesis"].numpy().decode("utf-8"),
            tensor_dict["premise"].numpy().decode("utf-8"),
            str(tensor_dict["label"].numpy()),
        )

    def get_train_examples(self, data_dir):
        """See base class."""
        print("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_test_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_labels(self):
        """See base class."""
        return ["0", "1", "2"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            text_b = line[4]
            label = None if set_type == "test" else line[0]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

In [6]:
processor = MnliProcessor()
examples = data['train'].take(1)

for example in examples:
    print('------원본데이터------')
    print(example)  
    example = processor.get_example_from_tensor_dict(example)
    print('------processor 가공데이터------')
    print(example)

------원본데이터------
{'hypothesis': <tf.Tensor: shape=(), dtype=string, numpy=b'Meaningful partnerships with stakeholders is crucial.'>, 'idx': <tf.Tensor: shape=(), dtype=int32, numpy=16399>, 'label': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'premise': <tf.Tensor: shape=(), dtype=string, numpy=b'In recognition of these tensions, LSC has worked diligently since 1995 to convey the expectations of the State Planning Initiative and to establish meaningful partnerships with stakeholders aimed at fostering a new symbiosis between the federal provider and recipients of legal services funding.'>}
------processor 가공데이터------
InputExample(guid=16399, text_a='Meaningful partnerships with stakeholders is crucial.', text_b='In recognition of these tensions, LSC has worked diligently since 1995 to convey the expectations of the State Planning Initiative and to establish meaningful partnerships with stakeholders aimed at fostering a new symbiosis between the federal provider and recipients of legal

In [7]:
label_list = processor.get_labels()
label_list

['0', '1', '2']

In [8]:
label_map = {label: i for i, label in enumerate(label_list)}
label_map

{'0': 0, '1': 1, '2': 2}

- processor class가 잘 작동하는 모습을 확인할 수 있다.

---
## 1.2 Tensorflow HuggingFace 모델 및 토크나이저 불러오기

In [9]:
tokenizer = DistilBertTokenizer.from_pretrained("typeform/distilbert-base-uncased-mnli")
model = TFDistilBertForSequenceClassification.from_pretrained("typeform/distilbert-base-uncased-mnli")

- model과 tokenizer는 `typeform/distilbert-base-uncased-mnli` 를 가져왔다.

In [10]:
def _glue_convert_examples_to_features(examples, tokenizer, max_length, processor, label_list=None, output_mode="claasification") :
    if max_length is None :
        max_length = tokenizer.max_len
    if label_list is None:
        label_list = processor.get_labels()
        print("Using label list %s" % (label_list))

    label_map = {label: i for i, label in enumerate(label_list)}
    labels = [label_map[example.label] for example in examples]

    batch_encoding = tokenizer(
        [(example.text_a, example.text_b) for example in examples],
        max_length=max_length,
        padding="max_length",
        truncation=True,
    )

    features = []
    for i in range(len(examples)):
        inputs = {k: batch_encoding[k][i] for k in batch_encoding}

        feature = InputFeatures(**inputs, label=labels[i])
        features.append(feature)

    for i, example in enumerate(examples[:2]):
        print("*** Example ***")
        print("guid: %s" % (example.guid))
        print("features: %s" % features[i])

    return features

In [11]:
def tf_glue_convert_examples_to_features(examples, tokenizer, max_length, processor, label_list=None, output_mode="classification") :
    """
    :param examples: tf.data.Dataset
    :param tokenizer: pretrained tokenizer
    :param max_length: example의 최대 길이(기본값 : tokenizer의 max_len)
    :param task: GLUE task 이름
    :param label_list: 라벨 리스트
    :param output_mode: "regression" or "classification"

    :return: task에 맞도록 feature가 구성된 tf.data.Dataset
    """
    examples = [processor.tfds_map(processor.get_example_from_tensor_dict(example)) for example in examples]
    features = _glue_convert_examples_to_features(examples, tokenizer, max_length, processor)
    label_type = tf.int64

    def gen():
        for ex in features:
            d = {k: v for k, v in asdict(ex).items() if v is not None}
            label = d.pop("label")
            yield (d, label)

    input_names = ["input_ids"] + tokenizer.model_input_names

    return tf.data.Dataset.from_generator(
        gen,
        ({k: tf.int32 for k in input_names}, label_type),
        ({k: tf.TensorShape([None]) for k in input_names}, tf.TensorShape([])),
    )

In [12]:
train_dataset = tf_glue_convert_examples_to_features(data['train'], tokenizer, max_length=64, processor=processor)

Using label list ['0', '1', '2']
*** Example ***
guid: 16399
features: InputFeatures(input_ids=[101, 15902, 13797, 2007, 22859, 2003, 10232, 1012, 102, 1999, 5038, 1997, 2122, 13136, 1010, 1048, 11020, 2038, 2499, 29454, 29206, 14626, 2144, 2786, 2000, 16636, 1996, 10908, 1997, 1996, 2110, 4041, 6349, 1998, 2000, 5323, 15902, 13797, 2007, 22859, 6461, 2012, 6469, 2075, 1037, 2047, 25353, 14905, 10735, 2483, 2090, 1996, 2976, 10802, 1998, 15991, 1997, 3423, 2578, 4804, 1012, 102, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], token_type_ids=None, label=1)
*** Example ***
guid: 206287
features: InputFeatures(input_ids=[101, 1996, 7207, 8771, 2921, 2000, 1996, 3020, 2598, 1999, 1996, 6594, 1012, 102, 1996, 7207, 7505, 21799, 2015, 2036, 2218, 1996, 2152, 2598, 1999, 1996, 6123, 2162, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [13]:
examples = train_dataset.take(1)
for example in examples:
    print(example)

({'input_ids': <tf.Tensor: shape=(64,), dtype=int32, numpy=
array([  101, 15902, 13797,  2007, 22859,  2003, 10232,  1012,   102,
        1999,  5038,  1997,  2122, 13136,  1010,  1048, 11020,  2038,
        2499, 29454, 29206, 14626,  2144,  2786,  2000, 16636,  1996,
       10908,  1997,  1996,  2110,  4041,  6349,  1998,  2000,  5323,
       15902, 13797,  2007, 22859,  6461,  2012,  6469,  2075,  1037,
        2047, 25353, 14905, 10735,  2483,  2090,  1996,  2976, 10802,
        1998, 15991,  1997,  3423,  2578,  4804,  1012,   102,     0,
           0], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(64,), dtype=int32, numpy=
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
      dtype=int32)>}, <tf.Tensor: shape=(), dtype=int64, numpy=1>)


In [14]:
# train 데이터셋
train_dataset_batch = train_dataset.shuffle(100).batch(16).repeat(2)

In [15]:
# validation 데이터셋
validation_dataset = tf_glue_convert_examples_to_features(data['validation_mismatched'], tokenizer, max_length=64, processor=processor)
validation_dataset_batch = validation_dataset.shuffle(100).batch(16)

Using label list ['0', '1', '2']
*** Example ***
guid: 9410
features: InputFeatures(input_ids=[101, 2122, 3934, 2024, 4321, 6439, 1998, 2123, 1005, 1056, 4254, 3087, 1012, 102, 3934, 2029, 4372, 3669, 8159, 1998, 4372, 13149, 1996, 3076, 3325, 1998, 4009, 2070, 1997, 2256, 10418, 5784, 1998, 5089, 2000, 2256, 3721, 1011, 1011, 1998, 2000, 2256, 2103, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=None, label=2)
*** Example ***
guid: 4506
features: InputFeatures(input_ids=[101, 1996, 2110, 2610, 1999, 2225, 3448, 2020, 11925, 4953, 4311, 1997, 5357, 2948, 2015, 1012, 102, 1996, 2415, 2036, 11925, 1996, 2225, 3448, 2110, 2610, 1998, 2356, 3251, 2151, 4311, 1997, 1037, 20164, 2948, 2018, 2042, 2363, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [16]:
# test 데이터셋
test_dataset = tf_glue_convert_examples_to_features(data['test_mismatched'], tokenizer, max_length=64, processor=processor)
test_dataset_batch = test_dataset.shuffle(100).batch(16)

Using label list ['0', '1', '2']
*** Example ***
guid: 3498
features: InputFeatures(input_ids=[101, 2008, 2795, 2038, 2042, 1999, 2026, 2155, 2005, 8213, 1012, 102, 4312, 1010, 1045, 8813, 2008, 2795, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=None, label=2)
*** Example ***
guid: 5295
features: InputFeatures(input_ids=[101, 1996, 20864, 2311, 2001, 2718, 2011, 2019, 13297, 2044, 3157, 1051, 1005, 5119, 1012, 102, 1996, 20864, 2018, 2042, 4930, 2011, 2137, 6255, 2012, 1023, 1024, 4261, 1024, 4805, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

- train, validation, test 데이터셋을 배치사이즈 16으로 구성하였다.

---
## 1.3 Tensorflow 모델 학습 및 평가하기

In [17]:
num_classes = len(processor.get_labels()) # 0, 1, 2 클래스는 3개이다.

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [18]:
model.compile(optimizer=optimizer, loss=loss, metrics=['acc'])

model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  2307      
_________________________________________________________________
dropout_19 (Dropout)         multiple                  0         
Total params: 66,955,779
Trainable params: 66,955,779
Non-trainable params: 0
_________________________________________________________________


- 약 6700만개의 파라미터를 가진 BERT모델이 불러와졌다.

In [19]:
# 이전 스텝에서 배치처리를 진행한 데이터셋(xxxx_dataset_batch)을 활용
model.fit(train_dataset_batch, epochs=5, steps_per_epoch=115, 
                validation_data=validation_dataset_batch)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f464040d880>

In [20]:
result = model.evaluate(test_dataset_batch)
print(result)

[2.4400217533111572, 0.3401035964488983]


- Train, Val accuracy는 잘 나왔으나 test accuracy가 상당히 낮게나온다.
- 모델의 robust가 좋지 못한것 같다.

In [21]:
output_dir = os.getenv('HOME')+'/aiffel/Going_Deeper/16_GD/transformers' # 결과파일 저장하기
output_eval_file = os.path.join(output_dir, "tf_eval_results.txt")

with open(output_eval_file, "w") as writer:
    for i, v in enumerate(result) :
        if i == 0 :
            writer.write("Loss = %f\t" %(v))
        if i == 1 :
            writer.write("Accuracy = %f\n" %(v))
print("완료")

#파일에 쓴 테스트 결과 확인
!cat ~/aiffel/Going_Deeper/16_GD/transformers/tf_eval_results.txt

완료
Loss = 2.440022	Accuracy = 0.340104


---
## 2. Hugging Face Dataset 사용해보기

---
## 2.1 Hugging Face Dataset 및 모델, 토크나이저 불러오기

In [22]:
# 메모리 확보를 위해 tensorflow 학습에 사용된 변수 제거

del data
del processor
del tokenizer
del model
del train_dataset
del train_dataset_batch
del validation_dataset
del validation_dataset_batch
del test_dataset
del test_dataset_batch

In [23]:
import datasets
from datasets import load_dataset

huggingface_mnli_dataset = load_dataset('glue', 'mnli')
print(huggingface_mnli_dataset)

  0%|          | 0/5 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})


- Hugging Face의 dataset이 잘 불러와진것을 확인할 수 있다.

---
## 2.2 Hugging Face 모델 학습 및 평가하기

In [24]:
huggingface_tokenizer = AutoTokenizer.from_pretrained("typeform/distilbert-base-uncased-mnli")
huggingface_model = AutoModelForSequenceClassification.from_pretrained("typeform/distilbert-base-uncased-mnli", num_labels = 3)

- 위와 동일하게 `typeform/distilbert-base-uncased-mnli` 모델을 사용했으나 Automodel을 가져왔다.

In [25]:
def transform(data):
  return huggingface_tokenizer(
      data['hypothesis'],
      data['premise'],
      truncation = True,
      padding = 'max_length',
      return_token_type_ids = False,
      )
  
examples = huggingface_mnli_dataset['train'][:2]
examples_transformed = transform(examples)

print(examples)
print(examples_transformed)

{'premise': ['Conceptually cream skimming has two basic dimensions - product and geography.', 'you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him'], 'hypothesis': ['Product and geography are what make cream skimming work. ', 'You lose the things to the following level if the people recall.'], 'label': [1, 0], 'idx': [0, 1]}
{'input_ids': [[101, 4031, 1998, 10505, 2024, 2054, 2191, 6949, 8301, 25057, 2147, 1012, 102, 17158, 2135, 6949, 8301, 25057, 2038, 2048, 3937, 9646, 1011, 4031, 1998, 10505, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [26]:
encoded_dataset = huggingface_mnli_dataset.map(transform, batched=True)

  0%|          | 0/393 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

In [27]:
# Trainer을 활용하는 형태로 모델 재생성

metric_name = 'accuracy'

training_arguments = TrainingArguments(
    output_dir, # output이 저장될 경로
    evaluation_strategy="epoch", #evaluation하는 빈도
    learning_rate = 2e-5, #learning_rate
    per_device_train_batch_size = 16, # 각 device 당 batch size
    per_device_eval_batch_size = 16, # evaluation 시에 batch size
    num_train_epochs = 1, # train 시킬 총 epochs
    weight_decay = 0.01, # weight decay
)

In [28]:
metric = load_metric('glue', 'mnli')

def compute_metrics(eval_pred):    
    predictions,labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references = labels)

In [29]:
trainer = Trainer(
    model=huggingface_model,                           # 학습시킬 model
    args=training_arguments,                  # TrainingArguments을 통해 설정한 arguments
    train_dataset=encoded_dataset['train'],    # training dataset
    eval_dataset=encoded_dataset['validation_mismatched'],       # evaluation dataset
    compute_metrics=compute_metrics,
)
trainer.train()

{'loss': 0.6647, 'learning_rate': 1.9592568448500654e-05, 'epoch': 0.02}
{'loss': 0.3945, 'learning_rate': 1.9185136897001307e-05, 'epoch': 0.04}
{'loss': 0.3677, 'learning_rate': 1.8777705345501956e-05, 'epoch': 0.06}
{'loss': 0.3381, 'learning_rate': 1.837027379400261e-05, 'epoch': 0.08}
{'loss': 0.332, 'learning_rate': 1.796284224250326e-05, 'epoch': 0.1}
{'loss': 0.3296, 'learning_rate': 1.7555410691003914e-05, 'epoch': 0.12}
{'loss': 0.3309, 'learning_rate': 1.7147979139504566e-05, 'epoch': 0.14}
{'loss': 0.3056, 'learning_rate': 1.6740547588005215e-05, 'epoch': 0.16}
{'loss': 0.3176, 'learning_rate': 1.6333116036505868e-05, 'epoch': 0.18}
{'loss': 0.3218, 'learning_rate': 1.592568448500652e-05, 'epoch': 0.2}
{'loss': 0.3193, 'learning_rate': 1.5518252933507173e-05, 'epoch': 0.22}
{'loss': 0.3061, 'learning_rate': 1.5110821382007822e-05, 'epoch': 0.24}
{'loss': 0.3047, 'learning_rate': 1.4703389830508477e-05, 'epoch': 0.26}
{'loss': 0.2998, 'learning_rate': 1.4295958279009128e-05,

TrainOutput(global_step=24544, training_loss=0.296798639922285, metrics={'train_runtime': 17997.9086, 'train_samples_per_second': 21.819, 'train_steps_per_second': 1.364, 'train_loss': 0.296798639922285, 'epoch': 1.0})

- 1epoch만 학습시켰는데도 상당히 오랜시간이 걸린다.
- 약 81% accuracy를 보여주는것을 알 수 있다.

In [36]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'hypothesis', 'idx', 'input_ids', 'label', 'premise'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['attention_mask', 'hypothesis', 'idx', 'input_ids', 'label', 'premise'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['attention_mask', 'hypothesis', 'idx', 'input_ids', 'label', 'premise'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['attention_mask', 'hypothesis', 'idx', 'input_ids', 'label', 'premise'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['attention_mask', 'hypothesis', 'idx', 'input_ids', 'label', 'premise'],
        num_rows: 9847
    })
})

In [50]:
encoded_dataset['train']['label']

[1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 2,
 2,
 0,
 2,
 1,
 1,
 2,
 0,
 0,
 2,
 2,
 1,
 0,
 0,
 2,
 0,
 0,
 1,
 2,
 1,
 2,
 0,
 0,
 0,
 2,
 1,
 2,
 0,
 0,
 2,
 2,
 2,
 0,
 1,
 0,
 1,
 0,
 0,
 2,
 2,
 1,
 0,
 0,
 0,
 2,
 1,
 1,
 1,
 2,
 1,
 2,
 1,
 1,
 0,
 2,
 0,
 1,
 2,
 2,
 2,
 2,
 2,
 1,
 1,
 2,
 2,
 2,
 0,
 2,
 1,
 0,
 1,
 2,
 2,
 2,
 0,
 1,
 0,
 1,
 0,
 2,
 1,
 1,
 2,
 2,
 0,
 0,
 2,
 1,
 0,
 2,
 0,
 2,
 0,
 2,
 1,
 2,
 1,
 0,
 0,
 2,
 0,
 2,
 0,
 0,
 0,
 0,
 0,
 2,
 2,
 2,
 0,
 2,
 1,
 2,
 0,
 0,
 0,
 0,
 2,
 0,
 0,
 0,
 0,
 1,
 2,
 2,
 0,
 2,
 1,
 2,
 0,
 0,
 0,
 2,
 1,
 2,
 2,
 1,
 2,
 1,
 2,
 0,
 2,
 1,
 1,
 0,
 1,
 2,
 0,
 1,
 1,
 1,
 0,
 2,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 2,
 0,
 1,
 2,
 0,
 0,
 1,
 2,
 2,
 1,
 0,
 1,
 0,
 2,
 2,
 0,
 2,
 2,
 2,
 0,
 2,
 2,
 2,
 2,
 1,
 2,
 2,
 0,
 2,
 0,
 2,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 2,
 2,
 2,
 1,
 2,
 1,
 0,
 0,
 1,
 2,
 1,
 1,
 1,
 2,
 2,
 2,
 0,
 2,
 2,
 1,
 1,
 0,
 1,
 0,
 1,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 2,
 0,
 2,


In [53]:
encoded_dataset['test_mismatched']['label']

[-1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,


- [Hugging Face MNLI test dataset](https://huggingface.co/datasets/glue/viewer/mnli_mismatched/test)
- 위 링크에서도 보여주듯 Test dataset은 레이블이 -1로 구성되어있다. 이유를 알 수가 없다.

In [41]:
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [57]:
# 테스트 데이터는 레이블이 -1로 구성되어 진행되지 않습니다.
#trainer.evaluate(eval_dataset = encoded_dataset['test_mismatched'])

print("완료")

완료


---
## 3. 회고

## 이번 프로젝트를 하면서 어려웠던 점
>- 이번 프로젝트는 데이터를 사용하던 방식이 아닌 다른 방식으로 데이터를 불러와 사용하는것이 가장 어려웠습니다. 뭔가 데이터 모양을 확인하기가 어려우니 직관이 오지 않아 꽤나 애먹었습니다.

## 이번 프로젝트에서 학습한 내용
>- 특히 Hugging Face 데이터셋과 모델 사용, 모델 검색 법 등에 대해 많이 알게된듯하나 아직 모델 fine-tuning 하는법은 잘 모르겠습니다.

## 알아낸 점이나 모호한 점
>- Hugging Face 에 대해 알게된 것은 좋았으나 아직 자세한 사용법에 대해서는 모호합니다.

## 루브릭 평가지표를 맞추기 위해 노력했던 점
>- __1. Processor 클래스에 대해 1개 이상의 example에 대한 단위테스트가 정상 진행되었다.__ 에 대해서는 목차 1의 Tensorflow Dataset 사용하기에서 processor 클래스를 선언하였으며 이를 통해 example 문장과 원본 문장을 비교하였습니다.
>- __2. MNLI 데이터셋의 입력과 라벨의 정의에 잘 맞는 tf.data.Dataset 인스턴스가 얻어졌다.__ 이 또한 정상적으로 진행되었으며 노드와 달랐던것은 mrpc는 라벨이 2개였지만 mnli는 라벨이 3개인 차이점이 있었습니다.
>- __3. MNLI 데이터셋에 대해 적당한 모델을 fine-tuning하여 학습하였다.__ 에 대해서는 적합한 모델을 선택하였으며 학습이 정상적으로 진행되었습니다.

## 루브릭 평가지표를 달성하지 못했다면 이유
>- 정확한 이해가 바탕이 되지 못헀기에 2, 3번 루브릭에서 여지가 조금 있지 않을까 싶습니다. 차차 프레임워크에 대한 이해를 키워야겠습니다.

---
## 4. Reference

- [Hugging Face MNLI test dataset](https://huggingface.co/datasets/glue/viewer/mnli_mismatched/test)
- [Hugging Face BERT model](https://huggingface.co/typeform/distilbert-base-uncased-mnli)

---
## 5. 자기다짐 및 아쉬운 점

- 이번 프로젝트는 상당히 유용하나 왜 마지막 Going Deeper 노드인지는 의문입니다. 이를 더 먼저 알려주고 응용할 수 있는 노드를 더 구성했으면 좋았을텐데 이게 마지막 노드이다보니 어려움이 컸습니다. 
    
    
- 그래도 앞으로 이용하게 될 프레임워크이다보니 더욱 연습해야겠다고 생각합니다.