# BERTを用いた固有表現認識
BERT(Bidirectional Encoder Representations from Transformers)を用いて固有表現認識を行う. BERTはPre-Trainingとfine-tuningの仕組みによって誰でも簡単に用いることができるために流行したモデルである. データセットはCoNLL2003を用いる.

# 準備

In [1]:
pip install -q transformers datasets seqeval

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np

from datasets import load_dataset, load_metric
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification

  from .autonotebook import tqdm as notebook_tqdm


# データセット読み込み
datasetsライブラリを用いてCoNLL2003データセットを読み込む. ラベルが既にKey-Value形式で与えられているからモデルで簡単に扱うことができる.

In [2]:
datasets = load_dataset("conll2003")

Reusing dataset conll2003 (/root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee)
100% 3/3 [00:00<00:00, 541.71it/s]


In [3]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3251
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3454
    })
})

In [4]:
datasets["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [5]:
datasets["train"].features["ner_tags"]

Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [6]:
label_list = datasets["train"].features["ner_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

# 前処理

BERTに入力するためにテキストをトークン化してモデルが期待するフォーマットに変換する. 次のセルでは使用するモデルのアーキテクチャに対応したtokenizerの取得と, Pre-Trainingに使用した語彙をダウンロードする.

In [7]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [8]:
# tokenizerの使用例
tokenizer("Hello, this is one sentence!")

{'input_ids': [101, 7592, 1010, 2023, 2003, 2028, 6251, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
# 入力単語が分割済みの場合
split_words = "Hello, this is one sentence! split into words .".split()
print(split_words)
tokenizer(split_words,is_split_into_words=True)

['Hello,', 'this', 'is', 'one', 'sentence!', 'split', 'into', 'words', '.']


{'input_ids': [101, 7592, 1010, 2023, 2003, 2028, 6251, 999, 3975, 2046, 2616, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
# サブワードレベルに分割する場合
# trainから適当にテキストを抽出
example = datasets["train"][4]
print(example["tokens"],"\n")

# サブワードレベルのtokenizerを使用
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.'] 

['[CLS]', 'germany', "'", 's', 'representative', 'to', 'the', 'european', 'union', "'", 's', 'veterinary', 'committee', 'werner', 'z', '##wing', '##mann', 'said', 'on', 'wednesday', 'consumers', 'should', 'buy', 'sheep', '##me', '##at', 'from', 'countries', 'other', 'than', 'britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.', '[SEP]']


sheepmeatという単語が'sheep','me','at'というサブワードに分割されている.

tokenizerの処理によって[CLS]や[SEP]のように文頭や2文目との区切りを表す特殊文字が追加されるため入力と出力の長さが一致しなくなる問題がある. またサブワードレベルに分割した場合も入力と出力の長さが一致しなくなる. しかしtokenizerのward_idsは1つの単語から生成されたサブワードに同じidsを振るためサブワードに分割したときについては問題ない.

In [11]:
len(example["ner_tags"]), len(tokenized_input["input_ids"])

(31, 39)

In [12]:
print(tokenized_input.word_ids())

[None, 0, 1, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 11, 11, 12, 13, 14, 15, 16, 17, 18, 18, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, None]


In [13]:
# 特殊なトークンをNone, その他のトークンをそれぞれの単語にマッピングすることでサブワードと元の単語の対応付けを行ってかつ, 文字列の長さを同じにする
word_ids = tokenized_input.word_ids()
aligned_labels = [-100 if i is None else example["ner_tags"][i] for i in word_ids]
aligned_labels

[-100,
 5,
 0,
 0,
 0,
 0,
 0,
 3,
 4,
 0,
 0,
 0,
 0,
 1,
 2,
 2,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 5,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 -100]

In [14]:
print(len(aligned_labels), len(tokenized_input["input_ids"]))

39 39


ここでは特殊なトークンに-100(pytorchが無視する値), 他のすべてのトークンに対して元となった単語のラベルを設定している.
別の方法としてはある単語から得られた最初のトークンにのみラベルを設定し, 同じ単語から得られた他のサブトークンには-100のラベルを与えるという方法もある. 方法を切り替えられるようにフラグを準備しておく.

In [15]:
label_all_tokens = True

In [16]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # 特殊なトークンの場合-100を設定
            if word_idx is None:
                label_ids.append(-100)
            # 各単語の最初のトークンにはラベルを設定
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # 残りのトークンは戦略によって、-100か最初のトークンと同じラベルを設定するか切り替える
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [17]:
tokenize_and_align_labels(datasets['train'][:5])

{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], [101, 2848, 13934, 102], [101, 9371, 2727, 1011, 5511, 1011, 2570, 102], [101, 1996, 2647, 3222, 2056, 2006, 9432, 2009, 18335, 2007, 2446, 6040, 2000, 10390, 2000, 18454, 2078, 2329, 12559, 2127, 6529, 5646, 3251, 5506, 11190, 4295, 2064, 2022, 11860, 2000, 8351, 1012, 102], [101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1, 1

In [18]:
# datasetにtokenizerを適用
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee/cache-efe49fb60bb4ec73.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee/cache-234ea76de4e2f826.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee/cache-67c116316e0ca1b8.arrow


In [19]:
tokenized_datasets["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'input_ids': [101,
  7327,
  19164,
  2446,
  2655,
  2000,
  17757,
  2329,
  12559,
  1012,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]}

# Modeling

In [20]:
# Pre-trained modelを読み込み
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint,num_labels=len(label_list))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

In [21]:
# パラメータ
batch_size = 8 #自分の環境では16ではGPUメモリが足りないためエラーが出る
args = TrainingArguments(
    "ner-conll2003",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

In [22]:
# DataCollatorを用いてデータ長が同じになるようにパディングする

data_collator = DataCollatorForTokenClassification(tokenizer)

In [23]:
# 系列ラベリングの評価メトリクス
metric = load_metric("seqeval")

metricは予測ラベルと正解ラベルを受け取る. そのため予測値に後処理を行う関数を定義する. 関数の処理内容は次の通りである.
- 各トークンの予測インデックス(最大ロジット)を選択し文字列ラベルに変換する.
- ラベルに-100を設定している場合は無視する

In [24]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [25]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [26]:
# fine-tuning
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: ner_tags, id, chunk_tags, pos_tags, tokens.
***** Running training *****
  Num examples = 14042
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5268


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.083,0.064508,0.929916,0.933661,0.931785,0.983367
2,0.0356,0.057946,0.938646,0.944737,0.941682,0.986036
3,0.0227,0.062174,0.935751,0.946638,0.941163,0.986227


Saving model checkpoint to ner-conll2003/checkpoint-500
Configuration saved in ner-conll2003/checkpoint-500/config.json
Model weights saved in ner-conll2003/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ner-conll2003/checkpoint-500/tokenizer_config.json
Special tokens file saved in ner-conll2003/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ner-conll2003/checkpoint-1000
Configuration saved in ner-conll2003/checkpoint-1000/config.json
Model weights saved in ner-conll2003/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ner-conll2003/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ner-conll2003/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ner-conll2003/checkpoint-1500
Configuration saved in ner-conll2003/checkpoint-1500/config.json
Model weights saved in ner-conll2003/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ner-conll2003/checkpoint-1500/tokenizer_config.json
Special tok

TrainOutput(global_step=5268, training_loss=0.06592675475523883, metrics={'train_runtime': 577.9548, 'train_samples_per_second': 72.888, 'train_steps_per_second': 9.115, 'total_flos': 893043283669020.0, 'train_loss': 0.06592675475523883, 'epoch': 3.0})

In [27]:
predictions, labels, _ = trainer.predict(tokenized_datasets["test"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

The following columns in the test set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: ner_tags, id, chunk_tags, pos_tags, tokens.
***** Running Prediction *****
  Num examples = 3454
  Batch size = 8


{'LOC': {'precision': 0.8984517304189436,
  'recall': 0.9289077212806026,
  'f1': 0.913425925925926,
  'number': 2124},
 'MISC': {'precision': 0.7821466524973433,
  'recall': 0.7389558232931727,
  'f1': 0.7599380485286527,
  'number': 996},
 'ORG': {'precision': 0.8673929376408716,
  'recall': 0.8921947449768161,
  'f1': 0.8796190476190477,
  'number': 2588},
 'PER': {'precision': 0.9706755753526355,
  'recall': 0.9621044885945548,
  'f1': 0.966371027346637,
  'number': 2718},
 'overall_precision': 0.8987401389379489,
 'overall_recall': 0.9058865416567766,
 'overall_f1': 0.9022991902594716,
 'overall_accuracy': 0.9769870865271736}

In [29]:
predictions[0]

array([0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0])

In [33]:
labels[0]

array([-100,    0,    0,    5,    0,    0,    0,    0,    1,    0,    0,
          0,    0, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100])

In [32]:
tokenized_datasets["test"][0]["tokens"]

['SOCCER',
 '-',
 'JAPAN',
 'GET',
 'LUCKY',
 'WIN',
 ',',
 'CHINA',
 'IN',
 'SURPRISE',
 'DEFEAT',
 '.']