
HuggingFace Transformers를 활용한 토큰 분류 모델 학습

본 노트북에서는 `klue/roberta-base` 모델을 **KLUE** 내 **NLI** 데이터셋을 활용하여 모델을 훈련하는 예제를 다루게 됩니다.


학습 과정 이후에는 간단한 예제 코드를 통해 모델이 어떻게 활용되는지도 함께 알아보도록 할 것입니다.

모든 소스 코드는 [`huggingface-tutorial`](https://huggingface.co/course/chapter7/2)를 참고하였습니다. 

먼저, 노트북을 실행하는데 필요한 라이브러리를 설치합니다. 모델 훈련을 위해서는 `transformers`가, 학습 데이터셋 로드를 위해서는 `datasets` 라이브러리의 설치가 필요합니다. 그 외 모델 성능 검증을 위해 `scipy`, `scikit-learn`을 추가로 설치해주도록 합니다.

In [None]:
#https://towardsdatascience.com/how-to-create-and-train-a-multi-task-transformer-model-18c54a14624
#https://medium.com/@shahrukhx01/multi-task-learning-with-transformers-part-1-multi-prediction-heads-b7001cf014bf

In [3]:
!pip install -U seqeval transformers datasets scipy scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m91.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m48.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,

In [None]:
#from huggingface_hub import notebook_login

#notebook_login()

## 문장 분류 모델 학습

노트북을 실행하는데 필요한 라이브러리들을 모두 임포트합니다.

In [4]:
import logging
import os
import random
import sys
from dataclasses import dataclass, field
from typing import Optional

import datasets
import numpy as np
from datasets import load_dataset

import transformers
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    DataCollatorForTokenClassification,
    EvalPrediction,
    HfArgumentParser,
    PretrainedConfig,
    Trainer,
    TrainingArguments,
    default_data_collator,
    set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version, send_example_telemetry
from transformers.utils.versions import require_version

In [5]:

import pandas as pd

from datasets import load_dataset, load_metric, ClassLabel, Sequence
# Will error if the minimal version of Transformers is not installed. Remove at your own risks.


task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

logger = logging.getLogger(__name__)



In [6]:
import argparse
import logging
import os
from typing import Any, List, Optional, Tuple
from torch.utils.data import DataLoader, TensorDataset
from transformers import PreTrainedModel,AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoModel, PreTrainedTokenizer
from torch import nn
import torch
from transformers.modeling_outputs import TokenClassifierOutput
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
import torch.nn.functional as F
from torch.nn.parameter import Parameter


In [7]:

@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    Using `HfArgumentParser` we can turn this class
    into argparse arguments to be able to specify them on
    the command line.
    """

    task_name: Optional[str] = field(
        default=None,
        metadata={"help": "The name of the task to train on: " + ", ".join(task_to_keys.keys())},
    )
    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    max_seq_length: int = field(
        default=128,
        metadata={
            "help": (
                "The maximum total input sequence length after tokenization. Sequences longer "
                "than this will be truncated, sequences shorter will be padded."
            )
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."}
    )
    pad_to_max_length: bool = field(
        default=True,
        metadata={
            "help": (
                "Whether to pad all samples to `max_seq_length`. "
                "If False, will pad the samples dynamically when batching to the maximum length in the batch."
            )
        },
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of training examples to this "
                "value if set."
            )
        },
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
                "value if set."
            )
        },
    )
    max_predict_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of prediction examples to this "
                "value if set."
            )
        },
    )
    train_file: Optional[str] = field(
        default=None, metadata={"help": "A tsv or a json file containing the training data."}
    )
    validation_file: Optional[str] = field(
        default=None, metadata={"help": "A tsv or a json file containing the validation data."}
    )
    test_file: Optional[str] = field(default=None, metadata={"help": "A tsv or a json file containing the test data."})

    def __post_init__(self):
        if self.task_name is not None:
            self.task_name = self.task_name.lower()
            if self.task_name not in task_to_keys.keys():
                raise ValueError("Unknown task, you should pick one in " + ",".join(task_to_keys.keys()))
        elif self.dataset_name is not None:
            pass
        elif self.train_file is None or self.validation_file is None:
            raise ValueError("Need either a GLUE task, a training/validation file or a dataset name.")
        else:
            train_extension = self.train_file.split(".")[-1]
            assert train_extension in ["tsv", "json"], "`train_file` should be a csv or a json file."
            validation_extension = self.validation_file.split(".")[-1]
            assert (
                validation_extension == train_extension
            ), "`validation_file` should have the same extension (tsv or json) as `train_file`."


@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    encoder_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=True,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    use_auth_token: bool = field(
        default=False,
        metadata={
            "help": (
                "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
                "with private models)."
            )
        },
    )
    ignore_mismatched_sizes: bool = field(
        default=False,
        metadata={"help": "Will enable to load a pretrained model whose head dimensions are different."},
    )


In [8]:
def load_datasets(dataprocessor, data_args):

    train_dataset = dataprocessor._create_dataset(data_args.train_file,'train')
    validation_dataset = dataprocessor._create_dataset(data_args.validation_file,'validation')

    head_task = Task(
                    id=0,
                    name="head_id",
                    num_labels=40,
                    type="token_classification",
                    )

    dep_task = Task(
                    id=1,
                    name="dep_ids",
                    num_labels=38,
                    type="token_classification",
                    )
    

    dataset = datasets.DatasetDict(
        {"train": train_dataset, "validation": validation_dataset}
    )
    tasks = [head_task, dep_task]
    return tasks, dataset

학습에 필요한 정보를 변수로 기록합니다.

본 노트북에서는 `klue-roberta-base` 모델을 활용하지만, https://huggingface.co/klue 페이지에서 더 다양한 사전학습 언어 모델을 확인하실 수 있습니다.

학습 태스크로는 `nli`를, 배치 사이즈로는 32를 지정하겠습니다.

In [9]:
def get_dep_labels() -> List[str]:
    """
    label for dependency relations format:
    {structure}_(optional){function}
    """
    dep_labels = [
        "NP",
        "NP_AJT",
        "VP",
        "NP_SBJ",
        "VP_MOD",
        "NP_OBJ",
        "AP",
        "NP_CNJ",
        "NP_MOD",
        "VNP",
        "DP",
        "VP_AJT",
        "VNP_MOD",
        "NP_CMP",
        "VP_SBJ",
        "VP_CMP",
        "VP_OBJ",
        "VNP_CMP",
        "AP_MOD",
        "X_AJT",
        "VP_CNJ",
        "VNP_AJT",
        "IP",
        "X",
        "X_SBJ",
        "VNP_OBJ",
        "VNP_SBJ",
        "X_OBJ",
        "AP_AJT",
        "L",
        "X_MOD",
        "X_CNJ",
        "VNP_CNJ",
        "X_CMP",
        "AP_CMP",
        "AP_SBJ",
        "R",
        "NP_SVJ",
    ]
    return dep_labels


def get_pos_labels() -> List[str]:
    """label for part-of-speech tags"""

    return [
        "NNG",
        "NNP",
        "NNB",
        "NP",
        "NR",
        "VV",
        "VA",
        "VX",
        "VCP",
        "VCN",
        "MMA",
        "MMD",
        "MMN",
        "MAG",
        "MAJ",
        "JC",
        "IC",
        "JKS",
        "JKC",
        "JKG",
        "JKO",
        "JKB",
        "JKV",
        "JKQ",
        "JX",
        "EP",
        "EF",
        "EC",
        "ETN",
        "ETM",
        "XPN",
        "XSN",
        "XSV",
        "XSA",
        "XR",
        "SF",
        "SP",
        "SS",
        "SE",
        "SO",
        "SL",
        "SH",
        "SW",
        "SN",
        "NA",
    ]

In [10]:
@dataclass
class Task:
    id: int
    name: str
    type: str
    num_labels: int

이제 HuggingFace `datasets` 라이브러리에 등록된 KLUE 데이터셋 중, NLI 데이터를 내려받습니다.

다운로드 혹은 로드 후 얻어진 `datasets` 객체를 살펴보면, 훈련 데이터와 검증 데이터가 포함되어 있는 것을 확인할 수 있습니다.

In [11]:
data_args = DataTrainingArguments(train_file = "/content/klue-dp-v1.1_train.tsv", validation_file = "/content/klue-dp-v1.1_dev.tsv", max_seq_length = 128)
model_args = ModelArguments(encoder_name_or_path="klue/bert-base")

tokenizer = AutoTokenizer.from_pretrained(
    model_args.encoder_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=model_args.use_fast_tokenizer,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)


Downloading (…)okenizer_config.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/495k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [12]:

class KlueDPInputExample:
    """A single training/test example for Dependency Parsing in .conllu format
    Args:
        guid : Unique id for the example
        text : string. the original form of sentence
        token_id : token id
        token : 어절
        pos : POS tag(s)
        head : dependency head
        dep : dependency relation
    """

    def __init__(
        self, guid: str, text: str, sent_id: int, token_id: int, token: str, pos: str, head: str, dep: str
    ) -> None:
        self.guid = guid
        self.text = text
        self.sent_id = sent_id
        self.token_id = token_id
        self.token = token
        self.pos = pos
        self.head = head
        self.dep = dep


class KlueDPInputFeatures:
    """A single set of features of data. Property names are the same names as the corresponding inputs to a model.
    Args:
        input_ids: Indices of input sequence tokens in the vocabulary.
        attention_mask: Mask to avoid performing attention on padding token indices.
            Mask values selected in ``[0, 1]``: Usually ``1`` for tokens that are NOT MASKED, ``0`` for MASKED (padded)
            tokens.
        bpe_head_mask : Mask to mark the head token of bpe in aejeol
        head_labels : head ids for each aejeols on head token index
        dep_labels : dependecy relations for each aejeols on head token index
        pos_ids : pos tag for each aejeols on head token index
    """

    def __init__(
        self,
        guid: str,
        ids: List[int],
        mask: List[int],
        bpe_head_mask: List[int],
        bpe_tail_mask: List[int],
        head_labels: List[int],
        dep_labels: List[int],
        pos_ids: List[int],
    ) -> None:
        self.guid = guid
        self.input_ids = ids
        self.attention_mask = mask
        self.bpe_head_mask = bpe_head_mask
        self.bpe_tail_mask = bpe_tail_mask
        self.head_labels = head_labels
        self.dep_labels = dep_labels
        self.pos_ids = pos_ids


class KlueDPProcessor:

    origin_train_file_name = "klue-dp-v1.1_train.tsv"
    origin_dev_file_name = "klue-dp-v1.1_dev.tsv"
    origin_test_file_name = "klue-dp-v1.1_test.tsv"


    def __init__(self, max_seq_length: int, tokenizer: PreTrainedTokenizer) -> None:
        self.tokenizer = tokenizer
        self.max_seq_length = max_seq_length

    def _create_examples(self, file_path: str, dataset_type: str) -> List[KlueDPInputExample]:
        sent_id = -1
        examples = []
        with open(file_path, "r", encoding="utf-8") as f:
            for line in f:
                line = line.strip()
                if line == "" or line == "\n" or line == "\t":
                    continue
                if line.startswith("#"):
                    parsed = line.strip().split("\t")
                    if len(parsed) != 2:  # metadata line about dataset
                        continue
                    else:
                        sent_id += 1
                        text = parsed[1].strip()
                        guid = parsed[0].replace("##", "").strip()
                else:
                    token_list = [token.replace("\n", "") for token in line.split("\t")] + ["-", "-"]
                    examples.append(
                        KlueDPInputExample(
                            guid=guid,
                            text=text,
                            sent_id=sent_id,
                            token_id=int(token_list[0]),
                            token=token_list[1],
                            pos=token_list[3],
                            head=token_list[4],
                            dep=token_list[5],
                        )
                    )
        return examples

    def convert_examples_to_features(
        self,
        examples: List[KlueDPInputExample],
        tokenizer: PreTrainedTokenizer,
        max_length: int,
        pos_label_list: List[str],
        dep_label_list: List[str],
    ) -> List[KlueDPInputFeatures]:

        pos_label_map = {label: i for i, label in enumerate(pos_label_list)}
        dep_label_map = {label: i for i, label in enumerate(dep_label_list)}

        SENT_ID = 0

        token_list: List[str] = []
        pos_list: List[str] = []
        head_list: List[int] = []
        dep_list: List[str] = []

        features = []
        for example in examples:
            if SENT_ID != example.sent_id:
                SENT_ID = example.sent_id
                encoded = tokenizer.encode_plus(
                    " ".join(token_list),
                    None,
                    add_special_tokens=True,
                    max_length=max_length,
                    truncation=True,
                    padding="max_length",
                )

                ids, mask = encoded["input_ids"], encoded["attention_mask"]

                bpe_head_mask = [0]
                bpe_tail_mask = [0]
                head_labels = [-100]
                dep_labels = [-100]
                pos_ids = [-100]  # --> CLS token

                for token, head, dep, pos in zip(token_list, head_list, dep_list, pos_list):
                    bpe_len = len(tokenizer.tokenize(token))
                    head_token_mask = [1] + [0] * (bpe_len - 1)
                    tail_token_mask = [0] * (bpe_len - 1) + [1]
                    bpe_head_mask.extend(head_token_mask)
                    bpe_tail_mask.extend(tail_token_mask)

                    head_mask = [head] + [-100] * (bpe_len - 1)
                    head_labels.extend(head_mask)
                    dep_mask = [dep_label_map[dep]] + [-100] * (bpe_len - 1)
                    dep_labels.extend(dep_mask)
                    pos_mask = [pos_label_map[pos]] + [-100] * (bpe_len - 1)
                    pos_ids.extend(pos_mask)

                bpe_head_mask.append(0)
                bpe_tail_mask.append(0)
                head_labels.append(-100)
                dep_labels.append(-100)
                pos_ids.append(-100)  # END token
                if len(bpe_head_mask) > max_length:
                    bpe_head_mask = bpe_head_mask[:max_length]
                    bpe_tail_mask = bpe_tail_mask[:max_length]
                    head_labels = head_labels[:max_length]
                    dep_labels = dep_labels[:max_length]
                    pos_ids = pos_ids[:max_length]

                else:
                    bpe_head_mask.extend([0] * (max_length - len(bpe_head_mask)))  # padding by max_len
                    bpe_tail_mask.extend([0] * (max_length - len(bpe_tail_mask)))  # padding by max_len
                    head_labels.extend([-100] * (max_length - len(head_labels)))  # padding by max_len
                    dep_labels.extend([-100] * (max_length - len(dep_labels)))  # padding by max_len
                    pos_ids.extend([-100] * (max_length - len(pos_ids)))

                feature = KlueDPInputFeatures(
                    guid=example.guid,
                    ids=ids,
                    mask=mask,
                    bpe_head_mask=bpe_head_mask,
                    bpe_tail_mask=bpe_tail_mask,
                    head_labels=head_labels,
                    dep_labels=dep_labels,
                    pos_ids=pos_ids,
                )
                features.append(feature)

                token_list = []
                pos_list = []
                head_list = []
                dep_list = []

            token_list.append(example.token)
            pos_list.append(example.pos.split("+")[-1])  # 맨 뒤 pos정보만 사용
            head_list.append(int(example.head))
            dep_list.append(example.dep)

        encoded = tokenizer.encode_plus(
            " ".join(token_list),
            None,
            add_special_tokens=True,
            max_length=max_length,
            truncation=True,
            padding="max_length",
        )

        ids, mask = encoded["input_ids"], encoded["attention_mask"]

        bpe_head_mask = [0]
        bpe_tail_mask = [0]
        head_labels = [-100]
        dep_labels = [-100]
        pos_ids = [-100]  # --> CLS token

        for token, head, dep, pos in zip(token_list, head_list, dep_list, pos_list):
            bpe_len = len(tokenizer.tokenize(token))
            head_token_mask = [1] + [0] * (bpe_len - 1)
            tail_token_mask = [0] * (bpe_len - 1) + [1]
            bpe_head_mask.extend(head_token_mask)
            bpe_tail_mask.extend(tail_token_mask)

            head_mask = [head] + [-100] * (bpe_len - 1)
            head_labels.extend(head_mask)
            dep_mask = [dep_label_map[dep]] + [-100] * (bpe_len - 1)
            dep_labels.extend(dep_mask)
            pos_mask = [pos_label_map[pos]] + [-100] * (bpe_len - 1)
            pos_ids.extend(pos_mask)

        bpe_head_mask.append(0)
        bpe_tail_mask.append(0)
        head_labels.append(-100)
        dep_labels.append(-100)  # END token
        bpe_head_mask.extend([0] * (max_length - len(bpe_head_mask)))  # padding by max_len
        bpe_tail_mask.extend([0] * (max_length - len(bpe_tail_mask)))  # padding by max_len
        head_labels.extend([-100] * (max_length - len(head_labels)))  # padding by max_len
        dep_labels.extend([-100] * (max_length - len(dep_labels)))  # padding by max_len
        pos_ids.extend([-100] * (max_length - len(pos_ids)))

        feature = KlueDPInputFeatures(
            guid=example.guid,
            ids=ids,
            mask=mask,
            bpe_head_mask=bpe_head_mask,
            bpe_tail_mask=bpe_tail_mask,
            head_labels=head_labels,
            dep_labels=dep_labels,
            pos_ids=pos_ids,
        )
        features.append(feature)

        for feature in features[:3]:
            logger.info("*** Example ***")
            logger.info("input_ids: %s" % feature.input_ids)
            logger.info("attention_mask: %s" % feature.attention_mask)
            logger.info("bpe_head_mask: %s" % feature.bpe_head_mask)
            logger.info("bpe_tail_mask: %s" % feature.bpe_tail_mask)
            logger.info("head_id: %s" % feature.head_labels)
            logger.info("dep_labels: %s" % feature.dep_labels)
            logger.info("pos_ids: %s" % feature.pos_ids)

        return features

    def _convert_features(self, examples: List[KlueDPInputExample]) -> List[KlueDPInputFeatures]:
        return self.convert_examples_to_features(
            examples,
            self.tokenizer,
            max_length=self.max_seq_length,
            dep_label_list=get_dep_labels(),
            pos_label_list=get_pos_labels(),
        )

    def _create_dataset(self, file_path: str, dataset_type: str) -> TensorDataset:
        examples = self._create_examples(file_path, dataset_type)
        features = self._convert_features(examples)

        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
        all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
        all_bpe_head_mask = torch.tensor([f.bpe_head_mask for f in features], dtype=torch.long)
        all_bpe_tail_mask = torch.tensor([f.bpe_tail_mask for f in features], dtype=torch.long)
        all_head_labels = torch.tensor([f.head_labels for f in features], dtype=torch.long)
        all_dep_labels = torch.tensor([f.dep_labels for f in features], dtype=torch.long)
        all_pos_ids = torch.tensor([f.pos_ids for f in features], dtype=torch.long)

        return TensorDataset(
            all_input_ids,
            all_attention_mask,
            all_bpe_head_mask,
            all_bpe_tail_mask,
            all_head_labels,
            all_dep_labels,
            all_pos_ids,
        )

    def collate_fn(self, batch: List[Tuple]) -> Tuple[torch.Tensor, Any, Any, Any]:
        # 1. set args
        batch_size = len(batch)
        #pos_padding_idx = None if self.hparams.no_pos else len(get_pos_labels())
        pos_padding_idx = len(get_pos_labels())
        # 2. build inputs : input_ids, attention_mask, bpe_head_mask, bpe_tail_mask
        batch_input_ids = []
        batch_attention_masks = []
        batch_bpe_head_masks = []
        batch_bpe_tail_masks = []
        batch_token_head_labels = []
        batch_token_type_labels = []
        for batch_id in range(batch_size):
            (
                input_id,
                attention_mask,
                bpe_head_mask,
                bpe_tail_mask,
                token_head_labels,
                token_type_labels,
                _,
            ) = batch[batch_id]
            batch_input_ids.append(input_id)
            batch_attention_masks.append(attention_mask)
            batch_bpe_head_masks.append(bpe_head_mask)
            batch_bpe_tail_masks.append(bpe_tail_mask)
            batch_token_head_labels.append(token_head_labels)
            batch_token_type_labels.append(token_type_labels)
        # 2. build inputs : packing tensors
        # 나는 밥을 먹는다. => [CLS] 나 ##는 밥 ##을 먹 ##는 ##다 . [SEP]
        # input_id : [2, 717, 2259, 1127, 2069, 1059, 2259, 2062, 18, 3, 0, 0, ...]
        # bpe_head_mask : [0, 1, 0, 1, 0, 1, 0, 0, 0, 0, ...] (indicate word start (head) idx)
        input_ids = torch.stack(batch_input_ids)
        attention_masks = torch.stack(batch_attention_masks)
        bpe_head_masks = torch.stack(batch_bpe_head_masks)
        bpe_tail_masks = torch.stack(batch_bpe_tail_masks)
        # 3. token_to_words : set in-batch max_word_length
        max_word_length = max(torch.sum(bpe_head_masks, dim=1)).item()
        # 3. token_to_words : placeholders
        head_ids = torch.zeros(batch_size, max_word_length).long()
        type_ids = torch.zeros(batch_size, max_word_length).long()
        pos_ids = torch.zeros(batch_size, max_word_length + 1).long()
        mask_e = torch.zeros(batch_size, max_word_length + 1).long()
        # 3. token_to_words : head_ids, type_ids, pos_ids, mask_e, mask_d
        for batch_id in range(batch_size):
            (
                _,
                _,
                bpe_head_mask,
                _,
                token_head_ids,
                token_type_ids,
                token_pos_ids,
            ) = batch[batch_id]
            # head_id : [1, 3, 5] (prediction candidates)
            # token_head_ids : [-1, 3, -1, 3, -1, 0, -1, -1, -1, .-1, ...] (ground truth head ids)
            head_id = [i for i, token in enumerate(bpe_head_mask) if token == 1]
            word_length = len(head_id)
            head_id.extend([0] * (max_word_length - word_length))
            head_ids[batch_id] = token_head_ids[head_id]
            type_ids[batch_id] = token_type_ids[head_id]
            pos_ids[batch_id][0] = torch.tensor(pos_padding_idx)
            pos_ids[batch_id][1:] = token_pos_ids[head_id]
            pos_ids[batch_id][int(torch.sum(bpe_head_mask)) + 1 :] = torch.tensor(pos_padding_idx)
            mask_e[batch_id] = torch.LongTensor([1] * (word_length + 1) + [0] * (max_word_length - word_length))
        mask_d = mask_e[:, 1:]
        # 4. pack everything
        masks = (attention_masks, bpe_head_masks, bpe_tail_masks, mask_e, mask_d)
        ids = (head_ids, type_ids, pos_ids)
        """
            self,
            input_ids=None,
            attention_mask=None,
            bpe_head_masks=None,
            bpe_tail_masks=None,
            mask_e=None,
            mask_d=None,
            head_labels=None,
            type_labels=None,
            pos_ids=None,
            **kwargs,
        """
        return {"input_ids": input_ids, "attention_masks": attention_masks, "bpe_head_masks":bpe_head_masks, "max_word_length":max_word_length, \
                "bpe_tail_masks": bpe_tail_masks, "mask_e": mask_e, "mask_d": mask_d, "head_labels" : head_ids, "type_labels" : type_ids, "pos_ids" : pos_ids}


In [15]:
dataprocessor = KlueDPProcessor(128, tokenizer)
tasks, raw_datasets = load_datasets(dataprocessor, data_args)
raw_datasets['validation'][1]

(tensor([    2,  7443,  2259,  4153,  2079,  4646,  2470,  4103,  2170,  3618,
          4901,  2116, 28959,  8248,  1670,  2483,  2522,   648,  2483,  2116,
          4006,  2125,  3992,  2138,  1170,  4000,  1540,  2069,  4089,  2371,
          2062,    18,     3,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,  

In [29]:

class BiAttention(nn.Module):
    def __init__(  # type: ignore[no-untyped-def]
        self, input_size_encoder: int, input_size_decoder: int, num_labels: int, biaffine: bool = True, **kwargs
    ) -> None:
        super(BiAttention, self).__init__()
        self.input_size_encoder = input_size_encoder
        self.input_size_decoder = input_size_decoder
        self.num_labels = num_labels
        self.biaffine = biaffine

        self.W_e = Parameter(torch.Tensor(self.num_labels, self.input_size_encoder))
        self.W_d = Parameter(torch.Tensor(self.num_labels, self.input_size_decoder))
        self.b = Parameter(torch.Tensor(self.num_labels, 1, 1))
        if self.biaffine:
            self.U = Parameter(torch.Tensor(self.num_labels, self.input_size_decoder, self.input_size_encoder))
        else:
            self.register_parameter("U", None)

        self.reset_parameters()

    def reset_parameters(self) -> None:
        nn.init.xavier_uniform_(self.W_e)
        nn.init.xavier_uniform_(self.W_d)
        nn.init.constant_(self.b, 0.0)
        if self.biaffine:
            nn.init.xavier_uniform_(self.U)

    def forward(
        self,
        input_d: torch.Tensor,
        input_e: torch.Tensor,
        mask_d: Optional[torch.Tensor] = None,
        mask_e: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        assert input_d.size(0) == input_e.size(0)
        batch, length_decoder, _ = input_d.size()
        _, length_encoder, _ = input_e.size()

        out_d = torch.matmul(self.W_d, input_d.transpose(1, 2)).unsqueeze(3)
        out_e = torch.matmul(self.W_e, input_e.transpose(1, 2)).unsqueeze(2)

        if self.biaffine:
            output = torch.matmul(input_d.unsqueeze(1), self.U)
            output = torch.matmul(output, input_e.unsqueeze(1).transpose(2, 3))
            output = output + out_d + out_e + self.b
        else:
            output = out_d + out_d + self.b

        if mask_d is not None:
            output = output * mask_d.unsqueeze(1).unsqueeze(3) * mask_e.unsqueeze(1).unsqueeze(2)

        return output


class BiLinear(nn.Module):
    def __init__(self, left_features: int, right_features: int, out_features: int):
        super(BiLinear, self).__init__()
        self.left_features = left_features
        self.right_features = right_features
        self.out_features = out_features

        self.U = Parameter(torch.Tensor(self.out_features, self.left_features, self.right_features))
        self.W_l = Parameter(torch.Tensor(self.out_features, self.left_features))
        self.W_r = Parameter(torch.Tensor(self.out_features, self.left_features))
        self.bias = Parameter(torch.Tensor(out_features))

        self.reset_parameters()

    def reset_parameters(self) -> None:
        nn.init.xavier_uniform_(self.W_l)
        nn.init.xavier_uniform_(self.W_r)
        nn.init.constant_(self.bias, 0.0)
        nn.init.xavier_uniform_(self.U)

    def forward(self, input_left: torch.Tensor, input_right: torch.Tensor) -> torch.Tensor:
        left_size = input_left.size()
        right_size = input_right.size()
        assert left_size[:-1] == right_size[:-1], "batch size of left and right inputs mis-match: (%s, %s)" % (
            left_size[:-1],
            right_size[:-1],
        )
        batch = int(np.prod(left_size[:-1]))

        input_left = input_left.contiguous().view(batch, self.left_features)
        input_right = input_right.contiguous().view(batch, self.right_features)

        output = F.bilinear(input_left, input_right, self.U, self.bias)
        output = output + F.linear(input_left, self.W_l, None) + F.linear(input_right, self.W_r, None)
        return output.view(left_size[:-1] + (self.out_features,))

In [39]:


class TokenClassificationHead(nn.Module):
    def __init__(self, hidden_size, task, dropout_p=0.1):
        super().__init__()
        self.dropout = nn.Dropout2d(p=0.33)
        self.num_labels = task.num_labels
        self.hidden_size = hidden_size
        self.task_id = task.id 
        self.n_pos_labels = len(get_pos_labels())
        self.n_dp_labels = len(get_dep_labels())

        self.pos_dim = 128
        self.arc_space = 512
        self.type_space = 256
        self.pos_embedding = nn.Embedding(self.n_pos_labels + 1, self.pos_dim)

        enc_dim = self.hidden_size * 2
        if self.pos_embedding is not None:
            enc_dim += self.pos_dim



        # Bidirectional LSTM encoder
        self.encoder = nn.LSTM(enc_dim, self.hidden_size, batch_first=True,dropout = 0.33, bidirectional=True)

        # Unidirectional LSTM decoder
        self.decoder = nn.LSTM(self.hidden_size , self.hidden_size, batch_first=True)

        self.src_dense = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.hx_dense = nn.Linear(self.hidden_size * 2, self.hidden_size)

        self.classifier = nn.Linear(self.hidden_size, self.num_labels)

        if str(self.task_id) == "0": 

          self.arc_c = nn.Linear(self.hidden_size * 2, self.arc_space)
          self.arc_h = nn.Linear(self.hidden_size, self.arc_space)
          self.attention = BiAttention(self.arc_space, self.arc_space, self.num_labels)

        else:

          self.type_c = nn.Linear(self.hidden_size * 2, self.type_space)
          self.type_h = nn.Linear(self.hidden_size, self.type_space)
          self.bilinear = BiLinear(self.type_space, self.type_space, self.num_labels)



        self._init_weights()

    def _init_weights(self):
        self.classifier.weight.data.normal_(mean=0.0, std=0.02)
        if self.classifier.bias is not None:
            self.classifier.bias.data.zero_()

    def forward(
        self, sequence_output, labels=None, pos_ids = None, max_word_length=None,  head_masks=None, tail_masks=None,mask_e=None, mask_d=None, task = None, **kwargs
    ):

        outputs = sequence_output
        outputs, sent_len = self.resize_outputs(outputs, head_masks, tail_masks, max_word_length)


        if self.pos_embedding is not None:
            pos_outputs = self.pos_embedding(pos_ids)
            pos_outputs = self.dropout(pos_outputs)
            outputs = torch.cat([outputs, pos_outputs], dim=2)

        # encoder
        packed_outputs = pack_padded_sequence(outputs, sent_len, batch_first=True, enforce_sorted=False)
        encoder_outputs, hn = self.encoder(packed_outputs)
        encoder_outputs, outputs_len = pad_packed_sequence(encoder_outputs, batch_first=True)
        encoder_outputs = self.dropout(encoder_outputs.transpose(1, 2)).transpose(1, 2)  # apply dropout for last layer
        hn = self._transform_decoder_init_state(hn)

        # decoder
        src_encoding = F.elu(self.src_dense(encoder_outputs[:, 1:]))
        sent_len = [i - 1 for i in sent_len]
        packed_outputs = pack_padded_sequence(src_encoding, sent_len, batch_first=True, enforce_sorted=False)
        decoder_outputs, _ = self.decoder(packed_outputs, hn)
        decoder_outputs, outputs_len = pad_packed_sequence(decoder_outputs, batch_first=True)
        decoder_outputs = self.dropout(decoder_outputs.transpose(1, 2)).transpose(1, 2)  # apply dropout for last layer

        #TODO : Finish the attention mechansim part
        # compute output for arc and type
        #if str(task.id) == '0':

        #  arc_c = F.elu(self.arc_c(encoder_outputs))
        #  arc_h = F.elu(self.arc_h(decoder_outputs))
        #  logits = self.attention(arc_h, arc_c, mask_d=mask_d, mask_e=mask_e).squeeze(dim=1)
          #print(logits.shape) (batch, num_label, 20, 21)
        #else:

        #  type_c = F.elu(self.type_c(encoder_outputs))
        #  type_h = F.elu(self.type_h(decoder_outputs))
        #  logits = self.bilinear(type_h, type_c)
          #print(logits.shape)


        # Pass the decoder output through the classifier
        logits = self.classifier(decoder_outputs)
        
        #additional masking
        if str(task.id) == '0':
          minus_inf = -1e8

          # Calculate the required padding size along the last dimension
          padding_size = task.num_labels - mask_e.size(-1)

          # Create a tensor of zeros with the same size as mask_e along the first dimension and padding_size along the last dimension
          zeros_padding = torch.zeros(mask_e.size(0), padding_size).to(outputs.device)

          # Concatenate mask_e with the zeros_padding tensor along the last dimension
          mask_e_padded = torch.cat((mask_e, zeros_padding), dim=-1)

          minus_mask_e = (1 - mask_e_padded) * minus_inf
          logits = logits + minus_mask_e.unsqueeze(1) 

        loss = None
        if labels is not None:
            loss_fct = torch.nn.CrossEntropyLoss()

            labels = labels.long()

            # Only keep active parts of the loss
            if mask_d is not None:
                active_loss = mask_d.view(-1) == 1

                active_logits = logits.view(-1, self.num_labels)
                active_labels = torch.where(
                    active_loss,
                    labels.view(-1),
                    torch.tensor(loss_fct.ignore_index).type_as(labels),
                )
    
                loss = loss_fct(active_logits, active_labels)
            else:
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        return logits, loss

    def resize_outputs(
        self, outputs: torch.Tensor, bpe_head_mask: torch.Tensor, bpe_tail_mask: torch.Tensor, max_word_length: int
    ) -> Tuple[torch.Tensor, List]:
        """Resize output of pre-trained transformers (bsz, max_token_length, hidden_dim) to word-level outputs (bsz, max_word_length, hidden_dim*2). """
        batch_size, input_size, hidden_size = outputs.size()
        word_outputs = torch.zeros(batch_size, max_word_length + 1, hidden_size * 2).to(outputs.device)
        sent_len = list()

        for batch_id in range(batch_size):
            head_ids = [i for i, token in enumerate(bpe_head_mask[batch_id]) if token == 1]
            tail_ids = [i for i, token in enumerate(bpe_tail_mask[batch_id]) if token == 1]
            assert len(head_ids) == len(tail_ids)

            word_outputs[batch_id][0] = torch.cat(
                (outputs[batch_id][0], outputs[batch_id][0])
            )  # replace root with [CLS]
            for i, (head, tail) in enumerate(zip(head_ids, tail_ids)):
                word_outputs[batch_id][i + 1] = torch.cat((outputs[batch_id][head], outputs[batch_id][tail]))
            sent_len.append(i + 2)

        return word_outputs, sent_len

    def _transform_decoder_init_state(self, hn: torch.Tensor) -> torch.Tensor:
        hn, cn = hn
        cn = cn[-2:]  # take the last layer
        _, batch_size, hidden_size = cn.size()
        cn = cn.transpose(0, 1).contiguous()
        cn = cn.view(batch_size, 1, 2 * hidden_size).transpose(0, 1)
        cn = self.hx_dense(cn)
        if self.decoder.num_layers > 1:
            cn = torch.cat(
                [
                    cn,
                    torch.autograd.Variable(cn.data.new(self.decoder.num_layers - 1, batch_size, hidden_size).zero_()),
                ],
                dim=0,
            )
        hn = torch.tanh(cn)
        hn = (hn, cn)
        return hn

In [18]:

class SequenceClassificationHead(nn.Module):
    def __init__(self, hidden_size, num_labels, dropout_p=0.1):
        super().__init__()
        self.num_labels = num_labels
        self.dropout = nn.Dropout(dropout_p)
        self.classifier = nn.Linear(hidden_size, num_labels)

        self._init_weights()

    def forward(self, sequence_output, pooled_output, labels=None, **kwargs):
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            if labels.dim() != 1:
                # Remove padding
                labels = labels[:, 0]

            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(
                logits.view(-1, self.num_labels), labels.long().view(-1)
            )

        return logits, loss

    def _init_weights(self):
        self.classifier.weight.data.normal_(mean=0.0, std=0.02)
        if self.classifier.bias is not None:
            self.classifier.bias.data.zero_()

In [33]:

class MultiTaskModel(nn.Module):
    def __init__(self, encoder_name_or_path, tasks: List):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(encoder_name_or_path, config = AutoConfig.from_pretrained(encoder_name_or_path))
        self.tasks = tasks
        self.output_heads = nn.ModuleDict()
        for task in self.tasks :
            decoder = self._create_output_head(self.encoder.config.hidden_size, task)
            # ModuleDict requires keys to be strings
            self.output_heads[str(task.id)] = decoder

    @staticmethod
    def _create_output_head(encoder_hidden_size: int, task):
        if task.type == "seq_classification":
            return SequenceClassificationHead(encoder_hidden_size, task)
        elif task.type == "token_classification":
            return TokenClassificationHead(encoder_hidden_size, task)
        else:
            raise NotImplementedError()

    def forward(
            self,
            input_ids=None,
            attention_mask=None,
            bpe_head_masks=None,
            bpe_tail_masks=None,
            max_word_length=None,
            mask_e=None,
            mask_d=None,
            head_labels=None,
            type_labels=None,
            pos_ids=None,
            **kwargs,
        ):
            """
                self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            labels=None,
            **kwargs,
            """
            
            outputs = self.encoder(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            
            sequence_output, pooled_output = outputs[:2]

            unique_task_ids_list = self.tasks 

            loss_list = []
            logits_list = []

            logits = None
 
            gt = [ head_labels, type_labels] #type_ids is dep_ids
            for task in self.tasks :
                labels = gt[task.id]
                logits, task_loss = self.output_heads[str(task.id)].forward(
                    sequence_output,
                    labels=labels,
                    max_word_length= max_word_length,
                    pos_ids = pos_ids,
                    head_masks=bpe_head_masks,
                    tail_masks=bpe_tail_masks,
                    mask_e = mask_e, 
                    mask_d = mask_d, 
                    task = task
                )
                logits_list.append(logits)
                if labels is not None:
                    loss_list.append(task_loss)

            # logits are only used for eval. and in case of eval the batch is not multi task
            # For training only the loss is used

            if loss_list:
                loss = torch.stack(loss_list)
            return {"loss":loss.mean(), "logits":logits_list}
            #outputs = (logits, outputs[2:])
            #return TokenClassifierOutput(loss=loss, logits=logits)
            #return outputs # (loss, outputs)

In [20]:
from seqeval.metrics import f1_score
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report

In [21]:
def compute_metrics(p: EvalPrediction): #EvalPrediction argument have predictions and label_ids as parameters 
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions


    #if preds.ndim == 2:
    # Token classification
      #preds = np.argmax(preds, axis=1)
      #logger.warning({"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()})
      #return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}
  
    # Sequence classification
    result_dict = {}

    for i in range(len(p.predictions)):
      preds = p.predictions[i]
      #https://github.com/chakki-works/seqeval/blob/cd01b5210eaa65e691c22320aba56f2be9e9fc43/seqeval/metrics/v1.py#L136
      metric = load_metric("seqeval")

      predictions = np.argmax(preds, axis=2)

      true_predictions = [
          [str(p) for (p, l) in zip(prediction, label) if l != -100]
          for prediction, label in zip(predictions, p.label_ids[i])
      ]
      true_labels = [
          [str(l) for (p, l) in zip(prediction, label) if l != -100]
          for prediction, label in zip(predictions, p.label_ids[i])
      ]

      #print({"accuracy": np.array((true_predictions == true_labels)).astype(np.float32).mean().item()})
      # Remove ignored index (special tokens)
      results = metric.compute(
          predictions=true_predictions, references=true_labels, zero_division = 0
      )    
      #print(classification_report(true_labels, true_predictions))


      result_dict[f"f1_{i}"] = f1_score(true_labels, true_predictions)
      result_dict[f"accuracy_{i}"] = accuracy_score(true_labels, true_predictions)
    return result_dict

 

In [22]:
import warnings
warnings.filterwarnings("ignore")

In [40]:
def main(model_args, data_args, training_args):

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    
    log_level = training_args.get_process_log_level()
    logger.setLevel(logging.WARNING)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process the small summary:
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
    logger.info(f"Training/evaluation parameters {training_args}")

    # Detecting last checkpoint.
    last_checkpoint = None
    if (
        os.path.isdir(training_args.output_dir)
        and training_args.do_train
        and not training_args.overwrite_output_dir
    ):
        last_checkpoint = get_last_checkpoint(training_args.output_dir)
        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
            raise ValueError(
                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
                "Use --overwrite_output_dir to overcome."
            )
        elif (
            last_checkpoint is not None and training_args.resume_from_checkpoint is None
        ):
            logger.info(
                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
            )

    set_seed(training_args.seed)

    tokenizer = AutoTokenizer.from_pretrained(
        model_args.encoder_name_or_path,
        cache_dir=model_args.cache_dir,
        use_fast=model_args.use_fast_tokenizer,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )
    dataprocessor = KlueDPProcessor(128, tokenizer)

    tasks, raw_datasets = load_datasets(dataprocessor, data_args)

    model = MultiTaskModel(model_args.encoder_name_or_path, tasks)

    if training_args.do_train:
        if "train" not in raw_datasets:
            raise ValueError("--do_train requires a train dataset")
        train_dataset = raw_datasets["train"]
        if data_args.max_train_samples is not None:
            train_dataset = train_dataset.select(range(data_args.max_train_samples))

    if training_args.do_eval:
        
        if (
            "validation" not in raw_datasets
            and "validation_matched" not in raw_datasets
        ):
            raise ValueError("--do_eval requires a validation dataset")
        eval_dataset = raw_datasets["validation"]
        """
        if data_args.max_eval_samples is not None:
            new_ds = []
            for ds in eval_datasets:
                new_ds.append(ds.select(range(data_args.max_eval_samples)))

            eval_datasets = new_ds
        """
    # Log a few random samples from the training set:
    if training_args.do_train:
        for index in random.sample(range(len(train_dataset)), 3):
            logger.info(f"Sample {index} of the training set: {train_dataset[index]}.")

    # Log a few random samples from the training set:
    if training_args.do_eval:
        for index in random.sample(range(len(eval_dataset)), 3):
            logger.info(f"Sample {index} of the eval set: {eval_dataset[index]}.")


    data_collator = DataCollatorForTokenClassification(
        tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None
    )

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=dataprocessor.collate_fn,
    )
    logger.warning("Training")
    # Training
    if training_args.do_train:
        checkpoint = None
        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        #train_result = trainer.train(resume_from_checkpoint=checkpoint)
        train_result = trainer.train()
        metrics = train_result.metrics
        max_train_samples = (
            data_args.max_train_samples
            if data_args.max_train_samples is not None
            else len(train_dataset)
        )
        metrics["train_samples"] = min(max_train_samples, len(train_dataset))

        trainer.save_model()  # Saves the tokenizer too for easy upload

        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()

    # Evaluation
    if training_args.do_eval:
        evals = [eval_dataset, eval_dataset]
        for eval, task in zip(evals, tasks):


            metrics = trainer.evaluate(eval_dataset=eval)
            max_eval_samples = (
                data_args.max_eval_samples
                if data_args.max_eval_samples is not None
                else len(eval)
            )
            metrics["eval_samples"] = min(max_eval_samples, len(eval))

            trainer.log_metrics("eval", metrics)
            trainer.save_metrics("eval", metrics)
            
if __name__ == "__main__":
    model_args = ModelArguments(encoder_name_or_path="klue/bert-base")
    training_args = TrainingArguments(
        "test-dp",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        do_train = True,
        do_eval = True,
        per_device_train_batch_size = 8, 
        per_device_eval_batch_size = 8, 
        learning_rate=3e-5,
        num_train_epochs=20,
        overwrite_output_dir=True,
        logging_dir = './log'
    )
    data_args = DataTrainingArguments(train_file = "/content/klue-dp-v1.1_train.tsv", validation_file = "/content/klue-dp-v1.1_dev.tsv", max_seq_length = 128)
    main(model_args, data_args, training_args) #validation loss is greater than Training loss because of the drop out 

- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch,Training Loss,Validation Loss,F1 0,Accuracy 0,F1 1,Accuracy 1
1,0.7927,0.494106,0.316858,0.697324,0.687154,0.962616
2,0.4091,0.345901,0.588249,0.808099,0.821979,0.972262
3,0.3258,0.290099,0.683923,0.853396,0.840858,0.974529
4,0.2472,0.268021,0.71555,0.86771,0.84493,0.97404
5,0.215,0.25657,0.74713,0.879267,0.855776,0.975773
6,0.1811,0.250955,0.769758,0.889403,0.852934,0.975951
7,0.1596,0.251962,0.780886,0.893803,0.861837,0.975685
8,0.1354,0.25261,0.786688,0.897004,0.854595,0.975284
9,0.1213,0.252773,0.794869,0.902649,0.850037,0.975196
10,0.1097,0.249646,0.805602,0.906561,0.858824,0.975018


***** train metrics *****
  epoch                    =       20.0
  total_flos               =        0GF
  train_loss               =     0.1898
  train_runtime            = 3:05:03.39
  train_samples            =      10000
  train_samples_per_second =     18.013
  train_steps_per_second   =      2.252


***** eval metrics *****
  epoch                   =       20.0
  eval_accuracy_0         =     0.9173
  eval_accuracy_1         =     0.9764
  eval_f1_0               =     0.8272
  eval_f1_1               =     0.8624
  eval_loss               =     0.2705
  eval_runtime            = 0:00:50.81
  eval_samples            =       2000
  eval_samples_per_second =      39.36
  eval_steps_per_second   =       4.92
***** eval metrics *****
  epoch                   =       20.0
  eval_accuracy_0         =     0.9173
  eval_accuracy_1         =     0.9764
  eval_f1_0               =     0.8272
  eval_f1_1               =     0.8624
  eval_loss               =     0.2705
  eval_runtime            = 0:00:50.65
  eval_samples            =       2000
  eval_samples_per_second =     39.485
  eval_steps_per_second   =      4.936
