<a href="https://colab.research.google.com/github/gauss5930/Huggingface-Course/blob/main/Chapter%207./Token_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Token Classification

문장의 각 토큰에 라벨을 지정하는 모든 문제를 포함하고 있다.

- Named entity recognition(NER)
- Part-of-speech tagging(POS)
- Chunking

## Preparing the data

가장 처음으로 토큰 분류를 위한 데이터셋이 필요하다. 그 데이터셋으로 Reuter의 뉴스들을 포함하고 있는 CoNLL-2003 dataset을 사용하였다.

CoNLL-2003 dataset으로 로드하기 위해 load_dataset() method를 사용해서 🤗 Dataset 라이브러리로부터 데이터를 가져온다.

In [1]:
!pip install datasets
!pip install --upgrade accelerate
from datasets import load_dataset

raw_datasets = load_dataset('conll2003')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collec

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

데이터셋은 3개의 task에 대한 라벨을 포함하고 있다: NER, POS, chunking. 다른 데이터셋과의 가장 큰 차이점은 입력 텍스트가 문장 또는 문서로 표현되어 있지 않고, 단어의 리스트로 표현되어 있다는 것이다.

training set의 첫 번째 요소를 살펴보자.

In [3]:
raw_datasets['train'][0]['tokens']

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

named entity recognition(NER)을 수행할 것이기 때문에 NER tag를 살펴보도록 하자.

In [4]:
raw_datasets['train'][0]['ner_tags']

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [5]:
ner_feature = raw_datasets['train'].features['ner_tags']   # 라벨을 정수에서 이름으로 변환

label_names = ner_feature.feature.names

label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [6]:
# 앞서 본 것을 토대로 라벨을 디코딩
words = raw_datasets['train'][0]['tokens']
labels = raw_datasets['train'][0]['ner_tags']
line1 = ''
line2 = ''
for word, label in zip(words, labels):
  full_label = label_names[label]
  max_length = max(len(word), len(full_label))
  line1 += word + " " * (max_length - len(word) + 1)
  line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU    rejects German call to boycott British lamb . 
B-ORG O       B-MISC O    O  O       B-MISC  O    O 


## Processing the data

In [7]:
!pip install transformers
from transformers import AutoTokenizer

model_checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m92.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m114.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.3 transformers-4.29.2


Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [8]:
# 사전에 토큰화된 입력은 is_split_into_words=True를 추가함으로써 간단하게 해결 가능
inputs = tokenizer(raw_datasets['train'][0]['tokens'], is_split_into_words = True)
inputs.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

In [9]:
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

In [10]:
def align_labels_with_tokens(labels, word_ids):
  new_labels = []
  current_word = None
  for word_id in word_ids:
    if word_id != current_word:
      # 새롭게 시작
      current_word = word_id
      label = -100 if word_id is None else labels[word_id]
      new_labels.append(label)
    elif word_id is None:
      # 스페셜 토큰
      new_labels.append(-100)
    else:
      # 이전 단어와 똑같은 토큰
      label = labels[word_id]
      # 라벨이 B-XXX면 I-XXX로 바꿈
      if label % 2 == 1:
        label += 1
      new_labels.append(label)

  return new_labels

In [11]:
# 데이터셋의 첫 번째 문장에 시도해보기
labels = raw_datasets['train'][0]['ner_tags']
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


In [12]:
# 전체 데이터셋 전처리 --> 모든 입력 토큰화 & align_labels_with_tokens()를 모든 라벨에 적용
def tokenize_and_align_labels(examples):
  tokenized_inputs = tokenizer(
      examples['tokens'], truncation = True, is_split_into_words = True
  )
  all_labels = examples['ner_tags']
  new_labels = []
  for i, labels in enumerate(all_labels):
    word_ids = tokenized_inputs.word_ids(i)
    new_labels.append(align_labels_with_tokens(labels, word_ids))

  tokenized_inputs['labels'] = new_labels
  return tokenized_inputs

In [13]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched = True,
    remove_columns = raw_datasets['train'].column_names,
)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

## Trainer API를 사용해서 모델 fine-tuning

### Data collation

입력만을 패딩하는 DataCollatorWithPadding을 사용할 수 없음. 똑같은 사이즈에 머물도록 입력을 똑같은 방법으로 라벨들이 패딩되어야 함.

이 모든 걸 DataCollaotrForTokenClassification이 해결할 수 있다.

In [14]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer = tokenizer)

### Metrics

Trainer가 매 에폭마다 metric을 계산하게 하기 위해 예측과 라벨 배열을 받아서 metric 이름과 값과 함께 딕셔너리를 반환하는 함수인 compute_metrics()를 정의해야 한다.

token classification 예측을 평가하기 위해 전통적으로 사용되는 프레임워크는 seqeval이다. 이 metric을 사용하기 위해 seqeval 라이브러리를 설치해야 한다.

In [15]:
!pip install seqeval
!pip install evaluate

import evaluate

metric = evaluate.load('seqeval')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16165 sha256=c6c5fe228d8a19a6e880015429edd606e36b31f6872c56652d4fe7d67f473843
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

이 metric은 기존의 정확도와 다르게 움직인다: 라벨의 리스트를 문자열로 받지, 정수형으로 받지 않는다. 그래서 metric에 보내기 전에 예측과 라벨을 완전히 decode 해야 한다.

이 comput_metrics() 함수는 logit의 argmax를 가져와 prediction으로 변환한다. 그 다음에 라벨과 prediction을 정수에서 문자열로 변환해야 한다. 그리고 라벨이 -100인 모든 값은 제거하고 결과를 metric.compute() method에 흘려보낸다.

In [16]:
import numpy as np

def compute_metrics(eval_preds):
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis = -1)

  # 무시된 인덱스(스페셜 토큰) 제거 & 라벨로 변환
  true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
  true_predictions = [
      [label_names[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)
  ]
  all_metrics = metric.compute(predictions = true_predictions, references = true_labels)
  return {
      'precision': all_metrics['overall_precision'],
      'recall': all_metrics['overall_recall'],
      'f1': all_metrics['overall_f1'],
      'accuracy': all_metrics['overall_accuracy'],
  }

### Defining the model

token classification 문제에 대해 하고 있기 때문에, AutoModelForTokenClassification 클래스를 사용할 것이다. 이 모델의 정의할 때 기억해야할 가장 중요한 것은 가지고 있는 라벨의 수에 대한 정보를 전달하는 것이다. 이것을 하는 가장 간단한 방법은 이 값을 num_labels argument와 함께 흘려보내는 것이지만, 잘 작동하기 위해서는 올바른 라벨 대응을 설정하는 것이 좋다.


이들은 id2label과 label2id 2개의 딕셔너리를 갖는다.

In [17]:

id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [18]:
# 이제 이 값을 AutoModelForTokenClassification.from_pretrained()로 흘려보낼 수 있다.
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label = id2label,
    label2id = label2id,
)

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [19]:
# 모델이 알맞은 수의 라벨을 가지고 있는 지 확인
model.config.num_labels

9

### Fine-tuning the model

이제 모델을 학습시킬 모든 준비가 끝났다! Trainer을 정의하기 전에 두 가지가 남아있다: Hugging Face에 로그인 & training argument 정의.

In [20]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [21]:
# TrainingArguments 정의
from transformers import TrainingArguments

args = TrainingArguments(
    'bert-finetuned-ner',
    evaluation_strategy = 'epoch',
    save_strategy = 'epoch',
    learning_rate = 2e-5,
    num_train_epochs = 3,
    weight_decay = 0.01,
    push_to_hub = True
)

In [22]:
# 이제 모든 것을 Trainer에 집어넣고 training을 시작!
from transformers import Trainer

trainer = Trainer(
    model = model,
    args = args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['validation'],
    data_collator = data_collator,
    compute_metrics = compute_metrics,
    tokenizer = tokenizer
)
trainer.train()

Cloning https://huggingface.co/Cartinoe5930/bert-finetuned-ner into local empty directory.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.09,0.073159,0.915344,0.931673,0.923436,0.981383
2,0.0346,0.069128,0.928347,0.946314,0.937245,0.985224
3,0.0183,0.063441,0.933698,0.947997,0.940793,0.985783


TrainOutput(global_step=5268, training_loss=0.06761122926583142, metrics={'train_runtime': 692.2425, 'train_samples_per_second': 60.85, 'train_steps_per_second': 7.61, 'total_flos': 923954298531210.0, 'train_loss': 0.06761122926583142, 'epoch': 3.0})

In [23]:
# 모델의 가장 최신 버전을 push_to_hub()를 사용해서 업로드
trainer.push_to_hub(commit_message = 'Training complete')

Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 1.00/411M [00:00<?, ?B/s]

Upload file runs/May18_05-09-25_ddd181a45b82/events.out.tfevents.1684386577.ddd181a45b82.256.0:   0%|         …

To https://huggingface.co/Cartinoe5930/bert-finetuned-ner
   096b64a..1e1092d  main -> main

   096b64a..1e1092d  main -> main

To https://huggingface.co/Cartinoe5930/bert-finetuned-ner
   1e1092d..bf3bee5  main -> main

   1e1092d..bf3bee5  main -> main



'https://huggingface.co/Cartinoe5930/bert-finetuned-ner/commit/1e1092d94ae160e78ca3e2dd126f54c990e1f88a'

### FIne-tuned model 사용

추론 위젯을 사용하여 Model Hub에서 fine-tune된 어떻게 사용하는지 보여줬다. 이 모델을 pipeline에서 지역적으로 사용하기 위해서는 적절한 model identifier를 명시하면 된다.

In [27]:
from transformers import pipeline

# 자신의 checkpoint로 대체
model_checkpoint = 'Cartinoe5930/bert-finetuned-ner'
token_classifier = pipeline(
    'token-classification', model = model_checkpoint, aggregation_strategy = 'simple'
)
token_classifier('My name is Jake and I work at Apple in California.')

[{'entity_group': 'PER',
  'score': 0.9953784,
  'word': 'Jake',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': 0.9980902,
  'word': 'Apple',
  'start': 30,
  'end': 35},
 {'entity_group': 'LOC',
  'score': 0.99932504,
  'word': 'California',
  'start': 39,
  'end': 49}]