
HuggingFace Transformers를 활용한 토큰 분류 모델 학습

본 노트북에서는 `klue/roberta-base` 모델을 **KLUE** 내 **NLI** 데이터셋을 활용하여 모델을 훈련하는 예제를 다루게 됩니다.


학습 과정 이후에는 간단한 예제 코드를 통해 모델이 어떻게 활용되는지도 함께 알아보도록 할 것입니다.

모든 소스 코드는 [`huggingface-tutorial`](https://huggingface.co/course/chapter7/2)를 참고하였습니다. 

먼저, 노트북을 실행하는데 필요한 라이브러리를 설치합니다. 모델 훈련을 위해서는 `transformers`가, 학습 데이터셋 로드를 위해서는 `datasets` 라이브러리의 설치가 필요합니다. 그 외 모델 성능 검증을 위해 `scipy`, `scikit-learn`을 추가로 설치해주도록 합니다.

In [None]:
!pip install  evaluate 
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.17.1-py3-none-any.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.8/212.8 KB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.17.1
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 23 not upgraded.


In [None]:
!pip install -U transformers datasets scipy scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Col

In [None]:
#from huggingface_hub import notebook_login

#notebook_login()

## 문장 분류 모델 학습

노트북을 실행하는데 필요한 라이브러리들을 모두 임포트합니다.

In [None]:
import random
import logging
from IPython.display import display, HTML

import numpy as np
import pandas as pd
import datasets
from datasets import load_dataset, load_metric, ClassLabel, Sequence
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

학습에 필요한 정보를 변수로 기록합니다.

본 노트북에서는 `klue-roberta-base` 모델을 활용하지만, https://huggingface.co/klue 페이지에서 더 다양한 사전학습 언어 모델을 확인하실 수 있습니다.

학습 태스크로는 `nli`를, 배치 사이즈로는 32를 지정하겠습니다.

In [None]:
model_checkpoint = "klue/roberta-base"
batch_size = 64
task = "ner"

이제 HuggingFace `datasets` 라이브러리에 등록된 KLUE 데이터셋 중, NLI 데이터를 내려받습니다.

In [None]:
#['ynat', 'sts', 'nli', 'ner', 're', 'dp', 'mrc', 'wos']
datasets = load_dataset("klue", task)



  0%|          | 0/2 [00:00<?, ?it/s]

다운로드 혹은 로드 후 얻어진 `datasets` 객체를 살펴보면, 훈련 데이터와 검증 데이터가 포함되어 있는 것을 확인할 수 있습니다.

In [None]:
datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'tokens', 'ner_tags'],
        num_rows: 21008
    })
    validation: Dataset({
        features: ['sentence', 'tokens', 'ner_tags'],
        num_rows: 5000
    })
})

각 예시 데이터는 아래와 같이 두 개의 문장과 두 문장의 추론 관계를 라벨로 지니고 있습니다.

In [None]:
ner_feature = datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['B-DT', 'I-DT', 'B-LC', 'I-LC', 'B-OG', 'I-OG', 'B-PS', 'I-PS', 'B-QT', 'I-QT', 'B-TI', 'I-TI', 'O'], id=None), length=-1, id=None)

In [None]:
label_names = ner_feature.feature.names
label_names

['B-DT',
 'I-DT',
 'B-LC',
 'I-LC',
 'B-OG',
 'I-OG',
 'B-PS',
 'I-PS',
 'B-QT',
 'I-QT',
 'B-TI',
 'I-TI',
 'O']

In [None]:
words = datasets["train"][0]["tokens"]
labels = datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

특 히   영    동    고    속    도    로      강    릉      방 향   문    막    휴    게    소    에 서   만    종    분    기    점    까 지   5    ㎞      구 간 에 는   승 용 차   전 용   임 시   갓 길 차 로 제 를   운 영 하 기 로   했 다 . 
O O O B-LC I-LC I-LC I-LC I-LC I-LC O B-LC I-LC O O O O B-LC I-LC I-LC I-LC I-LC O O O B-LC I-LC I-LC I-LC I-LC O O O B-QT I-QT O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 


데이터셋을 전반적으로 살펴보기 위한 시각화 함수를 다음과 같이 정의합니다.

In [None]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."

    picks = []
    
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)

        # 이미 등록된 예제가 뽑힌 경우, 다시 추출
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)

        picks.append(pick)

    # 임의로 추출된 인덱스들로 구성된 데이터 프레임 선언
    df = pd.DataFrame(dataset[picks])

    for column, typ in dataset.features.items():
        # 라벨 클래스를 스트링으로 변환
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])

    display(HTML(df.to_html()))

앞서 정의한 함수를 활용해 훈련 데이터를 살펴보도록 합시다.

이처럼 데이터를 살펴보는 것의 장점으로는 각 라벨에 어떠한 문장들이 해당하는지에 대한 감을 익힐 수 있다는데에 있습니다.


In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,sentence,tokens,ner_tags
0,매회 방송될때 마다 먹음직스러운 음식처럼 자꾸만 땡기는 식샤~~~ 매주 <목요일:DT> <밤 11시:TI> 본방 사수하고 있어여~^^,"[매, 회, , 방, 송, 될, 때, , 마, 다, , 먹, 음, 직, 스, 러, 운, , 음, 식, 처, 럼, , 자, 꾸, 만, , 땡, 기, 는, , 식, 샤, ~, ~, ~, , 매, 주, , 목, 요, 일, , 밤, , 1, 1, 시, , 본, 방, , 사, 수, 하, 고, , 있, 어, 여, ~, ^, ^]","[12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 0, 1, 1, 12, 10, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]"
1,그대로 베껴서 한거라기보다 <한국:LC>판으로 리메이크되서 새롭게 보는 느낌까지 든다.,"[그, 대, 로, , 베, 껴, 서, , 한, 거, 라, 기, 보, 다, , 한, 국, 판, 으, 로, , 리, 메, 이, 크, 되, 서, , 새, 롭, 게, , 보, 는, , 느, 낌, 까, 지, , 든, 다, .]","[12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 2, 3, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]"
2,제 평점은요 <10점:QT> 만점에 <10점:QT> 되겠읍니다,"[제, , 평, 점, 은, 요, , 1, 0, 점, , 만, 점, 에, , 1, 0, 점, , 되, 겠, 읍, 니, 다]","[12, 12, 12, 12, 12, 12, 12, 8, 9, 9, 12, 12, 12, 12, 12, 8, 9, 9, 12, 12, 12, 12, 12, 12]"
3,영화 스토리 상으론<6점:QT> 짜집기 연출 구성상으론 <10 점:QT>인영화 가볍게 보기엔 좋은,"[영, 화, , 스, 토, 리, , 상, 으, 론, 6, 점, , 짜, 집, 기, , 연, 출, , 구, 성, 상, 으, 론, , 1, 0, , 점, 인, 영, 화, , 가, 볍, 게, , 보, 기, 엔, , 좋, 은]","[12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 8, 9, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 8, 9, 9, 9, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]"
4,유치한 설정에 <이연걸:PS>의 액션까지 빛이 바랜다,"[유, 치, 한, , 설, 정, 에, , 이, 연, 걸, 의, , 액, 션, 까, 지, , 빛, 이, , 바, 랜, 다]","[12, 12, 12, 12, 12, 12, 12, 12, 6, 7, 7, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]"
5,<1회:QT>땐 좋은데 <2회:QT>부터 갈수록 엉망이던데,"[1, 회, 땐, , 좋, 은, 데, , 2, 회, 부, 터, , 갈, 수, 록, , 엉, 망, 이, 던, 데]","[8, 9, 12, 12, 12, 12, 12, 12, 8, 9, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]"
6,<메시:PS>는 지난 <5일:DT>(<한국:LC>시각) <바르셀로나:OG> 훈련에 모습을 드러내며 옆 머리를 짧게 자른 헤어스타일을 선보였는데요.,"[메, 시, 는, , 지, 난, , 5, 일, (, 한, 국, 시, 각, ), , 바, 르, 셀, 로, 나, , 훈, 련, 에, , 모, 습, 을, , 드, 러, 내, 며, , 옆, , 머, 리, 를, , 짧, 게, , 자, 른, , 헤, 어, 스, 타, 일, 을, , 선, 보, 였, 는, 데, 요, .]","[6, 7, 12, 12, 12, 12, 12, 0, 1, 12, 2, 3, 12, 12, 12, 12, 4, 5, 5, 5, 5, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]"
7,소녀시절의 <페넬로페 크로즈:PS>의 아름다움.,"[소, 녀, 시, 절, 의, , 페, 넬, 로, 페, , 크, 로, 즈, 의, , 아, 름, 다, 움, .]","[12, 12, 12, 12, 12, 12, 6, 7, 7, 7, 7, 7, 7, 7, 12, 12, 12, 12, 12, 12, 12]"
8,<이스라엘 안보 내각:OG>은 <유엔:OG>의 요청에 따라 <가자지구:LC>에 대한 인도주의적 정전을 <27일:DT> <자정:TI>(이하 현지시간)까지 <24시간:TI> 연장하기로 결정했다고 <이스라엘 정부:OG> 관계자가 <26일:DT> 밝혔다.,"[이, 스, 라, 엘, , 안, 보, , 내, 각, 은, , 유, 엔, 의, , 요, 청, 에, , 따, 라, , 가, 자, 지, 구, 에, , 대, 한, , 인, 도, 주, 의, 적, , 정, 전, 을, , 2, 7, 일, , 자, 정, (, 이, 하, , 현, 지, 시, 간, ), 까, 지, , 2, 4, 시, 간, , 연, 장, 하, 기, 로, , 결, 정, 했, 다, 고, , 이, 스, 라, 엘, , 정, 부, , 관, 계, 자, 가, , 2, 6, 일, , 밝, 혔, 다, .]","[4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 12, 12, 4, 5, 12, 12, 12, 12, 12, 12, 12, 12, 12, 2, 3, 3, 3, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 0, 1, 1, 12, 10, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 4, 5, 5, 5, 5, 5, 5, 12, 12, 12, 12, 12, 12, 0, 1, 1, 12, 12, 12, 12, 12]"
9,웬만해선 영화보다 안자는데 이 영화 <40분:TI> 정도보다 포기,"[웬, 만, 해, 선, , 영, 화, 보, 다, , 안, 자, 는, 데, , 이, , 영, 화, , 4, 0, 분, , 정, 도, 보, 다, , 포, 기]","[12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 10, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12]"


훈련 과정 중 모델의 성능을 파악하기 위한 메트릭을 설정합니다.

`datasets` 라이브러리에는 이미 구현된 메트릭을 사용할 수 있는 `load_metric` 함수가 있습니다.


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [None]:
tokenizer.is_fast

True

In [None]:
inputs = tokenizer(datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 '특',
 '히',
 '영',
 '동',
 '고',
 '속',
 '도',
 '로',
 '강',
 '릉',
 '방',
 '향',
 '문',
 '막',
 '휴',
 '게',
 '소',
 '에',
 '서',
 '만',
 '종',
 '분',
 '기',
 '점',
 '까',
 '지',
 '5',
 '㎞',
 '구',
 '간',
 '에',
 '는',
 '승',
 '용',
 '차',
 '전',
 '용',
 '임',
 '시',
 '갓',
 '길',
 '차',
 '로',
 '제',
 '를',
 '운',
 '영',
 '하',
 '기',
 '로',
 '했',
 '다',
 '.',
 '[SEP]']

In [None]:
inputs.word_ids()

[None,
 0,
 1,
 3,
 4,
 5,
 6,
 7,
 8,
 10,
 11,
 13,
 14,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 32,
 33,
 35,
 36,
 37,
 38,
 40,
 41,
 42,
 44,
 45,
 47,
 48,
 50,
 51,
 52,
 53,
 54,
 55,
 57,
 58,
 59,
 60,
 61,
 63,
 64,
 65,
 None]

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[12, 12, 12, 2, 3, 3, 3, 3, 3, 12, 2, 3, 12, 12, 12, 12, 2, 3, 3, 3, 3, 12, 12, 12, 2, 3, 3, 3, 3, 12, 12, 12, 8, 9, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]
[-100, 12, 12, 2, 3, 3, 3, 3, 3, 2, 3, 12, 12, 2, 3, 3, 3, 3, 12, 12, 2, 3, 3, 3, 3, 12, 12, 8, 9, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, -100]


In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_datasets = datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=datasets["train"].column_names,
)

Map:   0%|          | 0/21008 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,   12,   12,    2,    3,    3,    3,    3,    3,    2,    3,   12,
           12,    2,    3,    3,    3,    3,   12,   12,    2,    3,    3,    3,
            3,   12,   12,    8,    9,   12,   12,   12,   12,   12,   12,   12,
           12,   12,   12,   12,   12,   12,   12,   12,   12,   12,   12,   12,
           12,   12,   12,   12,   12,   12, -100],
        [-100,    8,    9,    9,   12,   12,   12,   12,   12,   12,   12,   12,
           12,   12,   12,   12,   12,   12,   12, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100]])

In [None]:
batch

{'input_ids': tensor([[   0, 1813, 1969, 1437,  856,  594, 1283,  848,  991,  553, 1026, 1129,
         1904, 1091, 1037, 1956,  578, 1282, 1421, 1258, 1038, 1558, 1175,  645,
         1540,  653, 1583,   25,  207,  615,  545, 1421,  793, 1324, 1468, 1632,
         1537, 1468, 1510, 1325,  551,  647, 1632,  991, 1545, 1022, 1471, 1437,
         1889,  645,  991, 1902,  809,   18,    2],
        [   0, 1891,  617,  842, 1258, 1885, 1023, 1498,  743, 1088,  727, 1187,
         1891, 1518, 1873, 1511,  801,  809,   18,    2,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 12, 12, 2, 3, 3, 3, 3, 3, 2, 3, 12, 12, 2, 3, 3, 3, 3, 12, 12, 2, 3, 3, 3, 3, 12, 12, 8, 9, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, -100]
[-100, 8, 9, 9, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, -100]


In [None]:
!pip install seqeval


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 KB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16180 sha256=163353372e1926623e05c7f6c82aa440abc4c6c15698b685c93312936bc78aba
  Stored in directory: /root/.cache/pip/wheels/e2/a5/92/2c80d1928733611c2747a9820e1324a6835524d9411510c142
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0


In [None]:
import evaluate

metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

`accuracy` 메트릭이 정상적으로 작동하는지 확인하기 위해, 랜덤한 예측 값과 라벨 값을 생성합니다.

In [None]:
fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
fake_preds, fake_labels

(array([0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,
        1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0]),
 array([1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
        1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,
        0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0]))

앞서 생성한 랜덤 예측, 랜덤 라벨 값을 `compute()` 함수에 입력해 잘 동작하는지 확인해봅시다.

In [None]:
labels = datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]


In [None]:
predictions = labels.copy()
predictions[3] = 'O' 
metric.compute(predictions=[predictions], references=[labels])

{'LC': {'precision': 0.75, 'recall': 0.75, 'f1': 0.75, 'number': 4},
 'QT': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 0.8,
 'overall_recall': 0.8,
 'overall_f1': 0.8000000000000002,
 'overall_accuracy': 0.9848484848484849}

In [None]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [None]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/roberta-base were not used when initializing RobertaForTokenClassification: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this mode

In [None]:
model.config.num_labels

13

In [None]:
args = TrainingArguments(
    "test-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01)

In [None]:
tokenizer

BertTokenizerFast(name_or_path='klue/roberta-base', vocab_size=32000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.158,0.168854,0.755618,0.749947,0.752772,0.948118
2,0.1066,0.141025,0.770749,0.797293,0.783796,0.956368
3,0.0723,0.143877,0.794912,0.81097,0.802861,0.959206


TrainOutput(global_step=7878, training_loss=0.13535993163977278, metrics={'train_runtime': 1353.984, 'train_samples_per_second': 46.547, 'train_steps_per_second': 5.818, 'total_flos': 2529804225374064.0, 'train_loss': 0.13535993163977278, 'epoch': 3.0})

로드된 토크나이저가 두 개 문장을 토큰화하는 방식을 파악하기 위해 두 문장을 입력 값으로 넣어줘보도록 합시다.

In [None]:
trainer.evaluate()

{'eval_loss': 0.14387665688991547,
 'eval_precision': 0.7949123410106566,
 'eval_recall': 0.8109700498000982,
 'eval_f1': 0.8028609124366364,
 'eval_accuracy': 0.9592061204124626,
 'eval_runtime': 35.6955,
 'eval_samples_per_second': 140.074,
 'eval_steps_per_second': 17.509,
 'epoch': 3.0}

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "/content/test-ner/checkpoint-7878"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
token_classifier("2013년도 강릉에서 열리는 모터쇼에 참가한 레이싱 모델 허윤미 씨의 귀요미 셀카입니다.")

[{'entity_group': 'DT',
  'score': 0.95129395,
  'word': '2013년도',
  'start': 0,
  'end': 6},
 {'entity_group': 'LC', 'score': 0.707459, 'word': '강릉', 'start': 7, 'end': 9},
 {'entity_group': 'PS',
  'score': 0.9956853,
  'word': '허윤미',
  'start': 32,
  'end': 35}]

## accelerate 모듈을 활용한 병렬처리 모델 학습

In [None]:
# 

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

In [None]:
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "bert-finetuned-ner-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

In [None]:
output_dir = "bert-finetuned-ner-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

In [None]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(**batch)

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)

        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)

    results = metric.compute()
    print(
        f"epoch {epoch}:",
        {
            key: results[f"overall_{key}"]
            for key in ["precision", "recall", "f1", "accuracy"]
        },
    )

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )