<a href="https://colab.research.google.com/github/forrestpark/NLPwithDeepLearning/blob/main/Named_Entity_Recognition_(NER)_using_LSTM_(Long_Short_Term_Memory).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  0.  Objective / 목표

The objective of this project is named entity recognition, also known as NER, using Long Short-Term Memory, or LSTM, an artificial recurrent neural network (RNN) architecture.

---



이 과제의 목표는 LSTM을 사용하여 한국어 개체명 인식을 수행하는 것이다.

##  1.  Configurations / 설정

##  1.1. Importing Libraries / 라이브러리 가져오기

In [None]:
# Written by / 작성: Robert Guthrie (https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#example-an-lstm-for-part-of-speech-tagging)
# Edited by / 수정: Jang Woo (Forrest) Park

import torch
import numpy as np
from torch import nn, optim
from torch.nn import functional as F
from torch.nn.utils.rnn import pad_sequence

from pathlib import Path
from collections import defaultdict
from tqdm.notebook import trange, tqdm
from sklearn.metrics import classification_report

torch.manual_seed(42)

<torch._C.Generator at 0x7f2cafb07a70>

In [None]:
resource_dir = "resources/"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## 2. Data / 데이터


First, let us download ENR training set dataset from the NAVER x Changwon National University (CWNU) NLP Challenge.

---


먼저 네이버와  창원대가 함께하는 NLP-Challenge에서 공개한 NER 훈련 집합 데이터 파일을 내려받는다.

In [None]:
!wget -P $resource_dir https://raw.githubusercontent.com/naver/nlp-challenge/master/missions/ner/data/train/train_data

--2021-06-19 16:11:01--  https://raw.githubusercontent.com/naver/nlp-challenge/master/missions/ner/data/train/train_data
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16945023 (16M) [text/plain]
Saving to: ‘resources/train_data.1’


2021-06-19 16:11:01 (158 MB/s) - ‘resources/train_data.1’ saved [16945023/16945023]



In [None]:
!head -20 "$resource_dir""train_data"

1	비토리오	PER_B
2	양일	DAT_B
3	만에	-
4	영사관	ORG_B
5	감호	CVL_B
6	용퇴,	-
7	항룡	-
8	압력설	-
9	의심만	-
10	가율	-

1	이	-
2	음경동맥의	-
3	직경이	-
4	8	NUM_B
5	19mm입니다	NUM_B
6	.	-

1	9세이브로	NUM_B
2	구완	-



Let us write a method `txt_to_seq()`, which receives a string literal that corresponds to one sentence input and returns a list of words and a tuple composed of lists of word tags.

---


위와 같은 형식에서 한 문장에 해당하는 문자열을 받아서 단어의 리스트와 태그의 리스트로 이루어진 튜플을 반환하는 함수 `txt_to_seq()`을 작성한다.

In [None]:
def txt_to_seq(text: str):
    words = (word.split("\t") for word in text.split("\n"))
    _, sentences, tags = zip(*words)
    return list(sentences), list(tags)

In [None]:
text = '''1	비토리오	PER_B
2	양일	DAT_B
3	만에	-
4	영사관	ORG_B
5	감호	CVL_B
6	용퇴,	-
7	항룡	-
8	압력설	-
9	의심만	-
10	가율	-'''

print(txt_to_seq(text))
# (['비토리오', '양일', '만에', '영사관', '감호', '용퇴,', '항룡', '압력설', '의심만', '가율'], ['PER_B', 'DAT_B', '-', 'ORG_B', 'CVL_B', '-', '-', '-', '-', '-'])


(['비토리오', '양일', '만에', '영사관', '감호', '용퇴,', '항룡', '압력설', '의심만', '가율'], ['PER_B', 'DAT_B', '-', 'ORG_B', 'CVL_B', '-', '-', '-', '-', '-'])


Let us save the entire file as in the format of `List(tuple(List() of words, List() of tags))` and as `training_data`, using the `txt_to_seq()` function written above, so that our code functions like below.


---

아래의 예시와 같이 작동할 수 있도록,  위에서 만든 함수 `txt_to_seq()`를 활용하여 파일 전체를 [([단어의 리스트],  [태그의 리스트])  튜플의 리스트]로 변환하고 `training_data`로 저장하라.

In [None]:
resource_dir = Path(resource_dir)
with resource_dir.joinpath('train_data').open() as f:
    training_data = []
    text = ""
    for line in f:
        if not line.strip() and text.strip():
            training_data.append(txt_to_seq(text.rstrip()))
            text = ""
        else:
            text += line

We delete the last entry of `training_data` if it is an empty tuple.


---


단, `training_data`의 마지막 데이터가 빈 튜플로 이루어져 있는 경우 삭제한다.

In [None]:
#print(training_data[-1:])
#training_data.pop()

In [None]:
print(len(training_data))
# 90000
print(training_data[0][0])
# ['비토리오', '양일', '만에', '영사관', '감호', '용퇴,', '항룡', '압력설', '의심만', '가율']
print(training_data[0][1])
# ['PER_B', 'DAT_B', '-', 'ORG_B', 'CVL_B', '-', '-', '-', '-', '-']

90000
['비토리오', '양일', '만에', '영사관', '감호', '용퇴,', '항룡', '압력설', '의심만', '가율']
['PER_B', 'DAT_B', '-', 'ORG_B', 'CVL_B', '-', '-', '-', '-', '-']


Let us convert a list composed of a list of words and a list of tags into a sequence of corresponding indices so that it can be used in PyTorch.


---


단어의 리스트와 태그의 리스트로 이루어진 리스트를 PyTorch에서 사용할 수 있도록 단어와 태그를 각각 인덱스로 변환하자.

In [None]:
# a method that converts a sequence of either words or tags into a sequence of indices
# 단어나 태그의 시퀀스를 인덱스의 시퀀스로 만들어 주는 함수
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long, device=device)

We initialize a dictionary `word_to_ix` with words as keys and indices as values. We save index 0 for the special word `<pad>`.


---


단어를 key로, 인덱스를 value로 하는 딕셔너리 `word_to_ix`를 만든다. 단,  패딩을 위해 특수 단어  `<pad>`에 0번 인덱스를 먼저 부여한다.

In [None]:
word_to_ix = {'<pad>': 0}
# For each words-list (sentence) and tags-list in each tuple of training_data
for (sentence, tags) in training_data:
    for word in sentence:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index

One can observe that the dataset contains 331,197 unique words.


---


이 데이터에는 모두 331,197가지의 단어가 있음을 알 수 있다.

In [None]:
print(len(word_to_ix))
# 331197
print(word_to_ix['<pad>'])
# 0

331197
0


In [None]:
tag_to_ix = {'<pad>': 0}
for _, tags in training_data:
    for tag in tags:
        if tag not in tag_to_ix:  # word has not been assigned an index yet
            tag_to_ix[tag] = len(tag_to_ix)

In [None]:
print(len(tag_to_ix))
# 30
print(tag_to_ix['<pad>'])
#  0

30
0


Using both dictionaries created above, let us convert the words and tags of `training_data` into indices.

---


위에서 만든 두 딕셔너리를 사용하여 `training_data`의 단어와 태그를 인덱스로 바꾸어 준다.

In [None]:
for (i, (sentence, tags)) in enumerate(training_data):
  words_ix = prepare_sequence(sentence, word_to_ix)
  tags_ix = prepare_sequence(tags, tag_to_ix)
  training_data[i] = words_ix, tags_ix

Let us split the dataset of 90,000 total data entries into a training dataset of 80,000 entries and an experiment set of 10,000 entries.


---


90000개의 데이터를 80000개의 훈련 집합과 10000개의 실험 집합으로 분할해 준다.

In [None]:
training_data, test_data = torch.utils.data.random_split(
    training_data, [80000, 10000], generator=torch.Generator().manual_seed(42))

Let us fix `seed` to 42. Then, the split result should be as follows.


---


seed를 42로 고정하고,  분할된 결과가 아래와 같아야 한다.

In [None]:
print(training_data[0])
# (tensor([221820, 269329, 269330, 269331, 269332, 162950,  86700,   2219,     16]), tensor([3, 3, 5, 3, 3, 3, 3, 3, 3]))
print(test_data[0])
# (tensor([ 11024,   8111, 143545, 143546,  15872,   6827,  20755,    471, 143547,
            # 16]), tensor([3, 4, 6, 3, 3, 3, 3, 3, 3, 3]))

(tensor([221820, 269329, 269330, 269331, 269332, 162950,  86700,   2219,     16],
       device='cuda:0'), tensor([3, 3, 5, 3, 3, 3, 3, 3, 3], device='cuda:0'))
(tensor([ 11024,   8111, 143545, 143546,  15872,   6827,  20755,    471, 143547,
            16], device='cuda:0'), tensor([3, 4, 6, 3, 3, 3, 3, 3, 3, 3], device='cuda:0'))


For processing data in batches, let us create a meethod `pad_collate()` and a `DataLoader()` object as follows. The batch size can be set to a different value during experimentation.


---


배치 처리를 위해 아래와 같이 `pad_collate()` 함수를 만들고 `DataLoader()` 객체를 만들 수 있다. 실험 과정에서 배치 사이즈를 변경할 수 있다.

In [None]:
def pad_collate(batch):
  (xx, yy) = zip(*batch)

  xx_pad = pad_sequence(xx, padding_value=0)
  yy_pad = pad_sequence(yy, batch_first=True, padding_value=0)

  return xx_pad, yy_pad

Let us set to certain values `EMBEDDING_DIM`, the embedding size of each word, and `HIDDEN_DIM`, the size of LSTM RNN's hidden layers. Both hyperparameters may be altered during any time of this experiment.


---


하이퍼패러미터로 단어의 임베딩 크기 `EMBEDDING_DIM`과 LSTM RNN의 은닉층 크기 `HIDDEN_DIM`의 값을 설정한다.  두 값 모두 실험 과정에서 변경할 수 있다.

In [None]:
# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 200
HIDDEN_DIM = 100

BATCH_SIZE = 256
EPOCHS = 30
LR = 0.003

## 3. Langauge Model / 모형

+ `embedding_dim`: size of word embeddings; flexible
+ `hidden_dim`: size of hideen layers; flexible
+ `vocab_size`:  number of unique words; size of `word_to_ix`
+ `tagset_size`: number of unique tags; size of `tag_to_ix`

---


+ `embedding_dim`: 단어 임베딩의 크기. 변경할 수 있다.
+ `hidden_dim`: 은닉층의 크기. 변경할 수 있다.
+ `vocab_size`:  단어의 가짓수.  `word_to_ix`의 크기.
+ `tagset_size`: 태그의 가짓수.  `tag_to_ix`의 크기.

In [None]:
class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super().__init__()
        self.hidden_dim = hidden_dim

        # nn.Embedding: 단어의 인덱스의 시퀀스 -> 임베딩 벡터의 시퀀스.
        # [숙제7]을 참조하여 pre-trained FastText vector로 설정할 수도 있다.
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim, 0)

        # nn.LSTM: 임베딩 벡터의 시퀀스 -> 은닉층 벡터의 시퀀스.
        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        
        # nn.Linear: 은닉층 벡터의 시퀀스 -> 태그의 logit(softmax 직전의 값) 벡터의 시퀀스.
        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentences):
        # sentence: 문자열의 리스트가 아니라 인덱스의 리스트임.
        embeds = self.word_embeddings(sentences)
        # LSTM
        # lstm_out은 출력층(은닉층)의 값, _는 은닉 상태 h_t와 이동 상태 c_t의 값.
        hidden, _ = self.lstm(embeds)
        tag_space = self.hidden2tag(hidden.transpose(0, 1))
        tag_space = tag_space.transpose(1, 2)
        return tag_space

For training of the RNN, we configure the model, loss function, and an optimizer. Changes to the type of optimizers can be made anytime during experimentation.


---


신경망 훈련을 위해 모형,  손실함수,  최적화기를 설정한다. 최적화기의 종류는 실험 과정에서 변경할 수 있다.

In [None]:
training_loader = torch.utils.data.DataLoader(training_data, batch_size=BATCH_SIZE, collate_fn=pad_collate)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=BATCH_SIZE, collate_fn=pad_collate)

# Add-1 smoothing
cnt_per_ix = torch.ones(len(tag_to_ix), dtype=torch.float32, device=device)
for _, tags in training_loader:
    cnts = torch.bincount(tags.ravel(), minlength=len(tag_to_ix))
    cnt_per_ix += cnts

weight = cnt_per_ix.reciprocal()
weight[0] = 0
weight = weight / torch.linalg.norm(weight) * (len(tag_to_ix) - 1)

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
model.to(device)
loss_function = nn.CrossEntropyLoss(weight=weight, ignore_index=0)
optimizer = optim.Adam(model.parameters(), lr=LR)

weight

Starting model training. Number of epoch can be changed anytime during experimentation.


---


이제 훈련을 시작한다. epoch 횟수는 실험 과정에서 변경할 수 있다.

In [None]:
model.train()
for epoch in trange(EPOCHS, desc="epochs"):
    for sentences, tags in tqdm(training_loader, desc="batches", leave=False):
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Run our forward pass.
        tag_scores = model(sentences)

        # Step 3. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, tags)
        loss.backward()
        optimizer.step()

HBox(children=(FloatProgress(value=0.0, description='epochs', max=30.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='batches', max=313.0, style=ProgressStyle(description_widt…




##  4. Evaluation / 평가


Obtain the answer value (true) in the experiment set and compute the prediction value (pred).

---


실험 집합에서 정답(true)을 구하고 예측(pred)을 계산한다.

In [None]:
model.eval()
with torch.no_grad():
  true = []
  pred = []
  losses = []
  
  for sentences, tags in test_loader:
    output = model(sentences)
    true.append(tags.ravel())
    pred.append(output.argmax(1).ravel())
  
  true = torch.cat(true).cpu()
  pred = torch.cat(pred).cpu()

Improving the model in lgiht of precision, recall, F1, and accuracy.


---


정밀도, 재현도, F1, 정확도의 값을 살펴보고 모형을 개선.

In [None]:
nonzero_idxs = true.nonzero(as_tuple=True)
print(classification_report(y_true=true[nonzero_idxs],
                            y_pred=pred[nonzero_idxs],
                            labels=list(tag_to_ix.values())[1:],
                            target_names=list(tag_to_ix)[1:],
                            zero_division=0))

              precision    recall  f1-score   support

       PER_B       0.44      0.39      0.42      4930
       DAT_B       0.69      0.79      0.74      2889
           -       0.91      0.87      0.89     81614
       ORG_B       0.51      0.50      0.50      4498
       CVL_B       0.47      0.39      0.42      6538
       NUM_B       0.65      0.61      0.63      6275
       LOC_B       0.44      0.49      0.46      2403
       EVT_B       0.39      0.53      0.45      1210
       TRM_B       0.35      0.47      0.40      2052
       TRM_I       0.12      0.21      0.15       335
       EVT_I       0.42      0.53      0.47       719
       PER_I       0.14      0.19      0.17       568
       CVL_I       0.09      0.15      0.11       406
       NUM_I       0.51      0.65      0.57       964
       TIM_B       0.35      0.70      0.46       339
       TIM_I       0.48      0.81      0.60       100
       ANM_B       0.29      0.49      0.37       651
       DAT_I       0.67    