## 2 모델 적용 


### 2.2 SKT KoBERT 


&nbsp;&nbsp;  ■ 참고 사이트 : 1)[SKT KoBERT](https://github.com/SKTBrain/KoBERT) 홈페이지
2) [Hugging Face 공유 모델 Model: monologg/kobert](https://huggingface.co/monologg/kobert)
> **KoBERT 에 대한 간단한 소개** \
구글 BERT base multilingual cased의 한국어 성능 한계로 이를 보완하고자, 다량의 한국어 코퍼스를 기반으로 만들어진 한글 맞춤 버트 입니다. \
학습셋은 한국어 위키 (5M 문장, 54M 단어) + 한국어 뉴스 (20M 문장, 270M 단어)로 학습되었습니다. \
사전(Vocabulary) 크기 : 8,002 \
토크나이저 : 한글 위키 + 뉴스 텍스트 기반으로 학습한 토크나이저(SentencePiece) \

■ 학습시 시도했던 방법 3가지 :
1) Hugging Face 공유 모델(.from_pretrained)을 활용한 방법으로 시도 \
2) SKT KoBERT 에서 공유한 예제 기반 \
3) 직접 구현 

<br><br>
#### (1) 환경 설치 및 라이브러리 임포트 
- KoBERT 설치
- [파이토치 설치 (설치환경별 코드 상이)](https://pytorch.org/)


In [None]:
# KoBERT 설치
# git clone https://github.com/SKTBrain/KoBERT.git
# cd KoBERT
# pip install -r requirements.txt
# pip install .

In [2]:
import pandas as pd
import numpy as np
import torch
import transformers
import matplotlib.pyplot as plt

In [3]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   
os.environ["CUDA_VISIBLE_DEVICES"]="0"

# GPU 사용 지정
if torch.cuda.is_available():    
    # PyTorch 에게 GPU를 사용하라고 지시합니다.    
    device = torch.device('cuda')
#     print('There are %d GPU(s) available.' % torch.cuda.device_count())
#     print('We will use the GPU:', torch.cuda.get_device_name(1))

# If not... CPU 사용 지정
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

<br><br>
#### (2) 데이터 로딩 
- 데이터 탐색 및 전처리 시 저장한 <code>train_data.tsv</code>, <code>test_data.tsv</code> 불러옵니다.
- 각 데이터는 label(예측해야할 값)과 text(예측에 사용될 값)로 이루어져 있습니다.
- 2.1 구글버트에서 로딩했던 방법과 동일합니다.

In [4]:
# Load the dataset 
train = pd.read_csv("data/train_data.tsv", sep ='\t', header=0, encoding='utf-8', names=['label','text'])
test = pd.read_csv("data/test_data.tsv", sep ='\t', header=0, encoding='utf-8', names=['label','text'])

# sentences 개수
print('Training sentences 개수: {:,}'.format(train.shape[0]))
print('Test sentences 개수: {:,}'.format(test.shape[0]))

# train/test 각각의 text 와 label을 리스트로 가져옵니다.
train_texts = train.text.values.tolist()
train_labels = train.label.values.tolist()
test_texts = test.text.values.tolist()
test_labels = test.label.values.tolist()

# 데이터 모양 출력
train_texts[:2] , train_labels[:2]

Training sentences 개수: 34,396
Test sentences 개수: 8,327


(['고객님 정보 확인 불가 문의 안녕하세요 당일 고객님 분실정지 진행하려고 하는데 인입시 고객정보가 조회되지 않습니다 처리결과 값이 올바르지 않습니다 라고 나오면 정보조회의 기본정보 성명 생년월일등 가 확인되지 않습니다 해당 내용 확인 부탁 드립니다 감사합니다',
  '긴급 tv m oss 결과보고시 기존바코드 확인안되어hds연동할수 없음 업무경로 데이터 국사 원주 회선번호 수리접수번호 상품정보 서비스계약번호 연락처 이윤범 현장 확인사항 수리오더로 결과보고시 연동시 기존바코드가 올라오지않아 교체업무 진행을 할 수 가 없습니다 완료보고 확인시 기존바코드에 빌트인 로 확인되며 단말원부확인시 댁내설치된기존바코드 확인시 사용 사용으로 되어 있습니다 긴급건으로 확인 및 조치 부탁 드립니다'],
 ['ASM34824', 'ASM31688'])

<br><br>
#### (3) 데이터 포맷 변경 
1. label(Output : y) 정수화 - label 포맷을 str 문자에서 int 정수로 매핑 
2. text(Input : x) 토큰화 - 데이터를 BERT를 학습 할 수 있는 형식으로 변환합니다. \
  . 문장별 토큰화 \
  . vocab 기반 정수 인덱스 매핑 & padding \
  . input text mask 적용 \
  (text 토큰화는 아래 3가지 모델 적용방식에서 따로 구현하였으니 참고바랍니다.)

In [5]:
# 1. label(Output : y) 정수 인코딩 
label_idx = {j:i for i,j in enumerate(sorted(set(test.label.values.tolist())))}  # dict{ 'ASM14261' : 0 , 'ASM14262' : 1 , ...} 
train_labels = [label_idx[i] for i in train.label.values.tolist()]
test_labels = [label_idx[i] for i in test.label.values.tolist()]
print('Train Data 개수 :',len(train_labels) , '\nTest Data 개수 :',len(test_labels))
print('\n▶ label 정수 인덱스화 예시 :' , test.label.values.tolist()[:3], '->', test_labels[:3] )

Train Data 개수 : 34396 
Test Data 개수 : 8327

▶ label 정수 인덱스화 예시 : ['ASM14326', 'ASM30034', 'ASM30014'] -> [6, 65, 58]


<br><br>
#### (4) 모델 학습 (3가지 방법)

A. Hugging Face 공유 모델을 사용(.from_pretrained) \
B. SKT Kobert github 공개 예제 차용 \
C. kobert pytorch 구현 (from the scratch) \
C-2.(추가) .py 스크립트 파일을 이용한 학습


<br><br>
#### A.Hugging Face 공유 모델을 사용 (.from_pretrained)
- monologg/kobert의 토큰화 적용시 UNK 다수
- 신뢰성 부족

In [8]:
# 2. text(Input : x) 토큰화
from transformers import BertTokenizer, AutoTokenizer

# Load the BERT tokenizer in Hugging Face share models version.
tokenizer = AutoTokenizer.from_pretrained("monologg/kobert")

# 예시) 원본 text를 출력 
print('train data 원본 문장: ', train_texts[0])

# 예시) 문장을 토큰화 적용 
print('\n토큰화 적용: ', tokenizer.tokenize(train_texts[0]))

# 예시) 문장의 토큰을 token ids(정수)로 매핑한 결과 출력
print('\nToken ids 적용: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(train_texts[0])))

train data 원본 문장:  고객님 정보 확인 불가 문의 안녕하세요 당일 고객님 분실정지 진행하려고 하는데 인입시 고객정보가 조회되지 않습니다 처리결과 값이 올바르지 않습니다 라고 나오면 정보조회의 기본정보 성명 생년월일등 가 확인되지 않습니다 해당 내용 확인 부탁 드립니다 감사합니다

토큰화 적용:  ['[UNK]', '정보', '확인', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '하는데', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '라고', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '가', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '확인', '[UNK]', '드립니다', '[UNK]']

Token ids 적용:  [0, 7229, 7945, 0, 0, 0, 0, 0, 0, 0, 7795, 0, 0, 0, 0, 0, 0, 0, 0, 6004, 0, 0, 0, 0, 0, 5330, 0, 0, 0, 0, 7945, 0, 5925, 0]


In [None]:
# 모델 학습 결과 역시 좋지 않았음

> 허깅페이스에 올려둔 kobert 관련 공유 모델("monologg/kobert")을 적용해보니 모델의 안정성이 낮아 적용하지 않았습니다. \
> 위의 예시와 같이 다운받은 모델의 tokenizer를 확인해보면, **대다수의 문장의 토큰이 [UNK]**으로 정보가 손실되고 있음을 확인 할 수 있었습니다. \
> 허깅페이스의 공유모델은 skt에서 공식적으로 올린 버전은 없었고, 일반 개인이 kobert를 이용해 허깅페이스 버전으로 모델을 공유한 것으로 확인되었습니다.

<br><br>
#### B. SKT Kobert github 공개 예제 활용
- 공개 예제를 기반으로 커스텀화 하여 구현

In [1]:
!pip install mxnet-cu101
!pip install gluonnlp pandas tqdm
!pip install sentencepiece==0.1.85
!pip install transformers==2.1.1
!pip install torch==1.3.1

Collecting mxnet-cu101
[?25l  Downloading https://files.pythonhosted.org/packages/40/26/9655677b901537f367c3c473376e4106abc72e01a8fc25b1cb6ed9c37e8c/mxnet_cu101-1.7.0-py2.py3-none-manylinux2014_x86_64.whl (846.0MB)
[K     |███████████████████████████████▌| 834.1MB 1.3MB/s eta 0:00:09tcmalloc: large alloc 1147494400 bytes == 0x39894000 @  0x7fc5faa74615 0x591f47 0x4cc229 0x4cc38b 0x50a51c 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x58e793 0x50c467 0x58e793 0x50c467 0x58e793 0x50c467 0x58e793 0x50c467 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d
[K     |████████████████████████████████| 846.0MB 21kB/s 
Collecting graphviz<0.9.0,>=0.8.1
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
Installing collected packages: graphviz, mxnet-cu101
  Found existing installation: graphviz 0.10.1
    Uninstalling graphv

In [2]:
!pip install git+https://git@github.com/SKTBrain/KoBERT.git@master

Collecting git+https://****@github.com/SKTBrain/KoBERT.git@master
  Cloning https://****@github.com/SKTBrain/KoBERT.git (to revision master) to /tmp/pip-req-build-wk2cz5rs
  Running command git clone -q 'https://****@github.com/SKTBrain/KoBERT.git' /tmp/pip-req-build-wk2cz5rs
Building wheels for collected packages: kobert
  Building wheel for kobert (setup.py) ... [?25l[?25hdone
  Created wheel for kobert: filename=kobert-0.1.1-cp36-none-any.whl size=12825 sha256=724be3bc8f59370043f4eca70147e50fd1716b5fe66b83730f3328f1d64d851b
  Stored in directory: /tmp/pip-ephem-wheel-cache-85qegn8t/wheels/a2/b0/41/435ee4e918f91918be41529283c5ff86cd010f02e7525aecf3
Successfully built kobert
Installing collected packages: kobert
Successfully installed kobert-0.1.1


In [3]:
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm import tqdm, tqdm_notebook

from kobert.utils import get_tokenizer
from kobert.pytorch_kobert import get_pytorch_kobert_model

from transformers import AdamW
from transformers.optimization import WarmupLinearSchedule

##GPU 사용 시
device = torch.device("cuda:0")

In [4]:
# 2. text(Input : x) 토큰화

from transformers.optimization import WarmupLinearSchedule

# 모델 
bertmodel, vocab = get_pytorch_kobert_model()
    
# 데이터 로드 
dataset_train = nlp.data.TSVDataset('data/train_data.tsv', field_indices=[0,1], num_discard_samples=1, allow_missing=True)
dataset_test = nlp.data.TSVDataset('data/test_data.tsv', field_indices=[0,1], num_discard_samples=1, allow_missing=True)

tokenizer = get_tokenizer()
tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)


[██████████████████████████████████████████████████]
[██████████████████████████████████████████████████]
using cached model


In [10]:
# Load the dataset 
import pandas as pd
train = pd.read_csv("train_data.tsv", sep ='\t', header=0, encoding='utf-8', names=['label','text'])
test = pd.read_csv("test_data.tsv", sep ='\t', header=0, encoding='utf-8', names=['label','text'])

# sentences 개수
print('Training sentences 개수: {:,}\n'.format(train.shape[0]))
print('Test sentences 개수: {:,}\n'.format(test.shape[0]))

# train/test 각각의 text 와 label을 리스트로 가져옵니다.
train_texts = train.text.values.tolist()
train_labels = train.label.values.tolist()

test_texts = test.text.values.tolist()
test_labels = test.label.values.tolist()

label_dict = {j:i for i,j in enumerate(sorted(set(test.label.values.tolist())))}  # dict{ 'ASM14261' : 0 , 'ASM14262' : 1 , ...} 
train_labels = [label_dict[i] for i in train.label.values.tolist()]
test_labels = [label_dict[i] for i in test.label.values.tolist()]

Training sentences 개수: 34,396

Test sentences 개수: 8,327



In [11]:
class BERTDataset(Dataset):
    def __init__(self, dataset, labels, label_idx, sent_idx, bert_tokenizer, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)

        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i) for i in labels]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

In [12]:
## Setting parameters
max_len = 256
batch_size = 16
warmup_ratio = 0.1 #0
num_epochs = 8
max_grad_norm = 1
log_interval = 200
learning_rate =  5e-5

In [13]:
data_train = BERTDataset(dataset_train, train_labels, 0, 1, tok, max_len, True, False)
data_test = BERTDataset(dataset_test, test_labels, 0, 1, tok, max_len, True, False)

In [14]:
train_dataloader = torch.utils.data.DataLoader(data_train, batch_size=batch_size, num_workers=5)
test_dataloader = torch.utils.data.DataLoader(data_test, batch_size=batch_size, num_workers=5)

In [15]:
len(label_dict)

183

In [16]:
class BERTClassifier(nn.Module):
    def __init__(self,
                 bert,
                 hidden_size = 768,
                 num_classes= 183,
                 dr_rate=None,
                 params=None):
        super(BERTClassifier, self).__init__()
        self.bert = bert
        self.dr_rate = dr_rate
        self.classifier = nn.Linear(hidden_size , num_classes)
        if dr_rate:
            self.dropout = nn.Dropout(p=dr_rate)
    
    def gen_attention_mask(self, token_ids, valid_length):
        attention_mask = torch.zeros_like(token_ids)
        for i, v in enumerate(valid_length):
            attention_mask[i][:v] = 1
        return attention_mask.float()

    def forward(self, token_ids, valid_length, segment_ids):
        attention_mask = self.gen_attention_mask(token_ids, valid_length)
        _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device))
        if self.dr_rate:
            out = self.dropout(pooler)
        return self.classifier(out)

In [17]:
model = BERTClassifier(bertmodel, dr_rate=0.5).to(device)

In [19]:
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
# 옵티마이저 
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()

# 스케줄러
t_total = len(train_dataloader) * num_epochs
warmup_step = int(t_total * warmup_ratio)
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_step, t_total=t_total)

In [22]:
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc

■ 학습 시작
- max_len: 256, lr = 5e-5

In [23]:
for e in range(num_epochs):
    train_acc = 0.0
    test_acc = 0.0
    test_acc_list = []
    test_loss_list = []
    test_loss = 0.0
    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        test_acc += calc_accuracy(out, label)
        test_loss += loss
    avg_loss = test_loss / (batch_id+1)
    avg_acc = test_acc / (batch_id+1)
    print("epoch {} test loss {}".format(e+1, test_loss / (batch_id+1)))
    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))
    test_acc_list.append(avg_acc)
    if avg_loss <= min(test_loss_list):
        torch.save(model.state_dict(), 'kobert_256_16.pt')
    test_loss_list.append(avg_loss)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=0.0, max=2150.0), HTML(value='')))

epoch 1 batch id 1 loss 5.303109645843506 train acc 0.0
epoch 1 batch id 201 loss 5.0639777183532715 train acc 0.02767412935323383
epoch 1 batch id 401 loss 4.778959274291992 train acc 0.04239401496259352
epoch 1 batch id 601 loss 4.11456298828125 train acc 0.05178868552412646
epoch 1 batch id 801 loss 3.341724395751953 train acc 0.07997815230961298
epoch 1 batch id 1001 loss 2.7731356620788574 train acc 0.12793456543456544
epoch 1 batch id 1201 loss 2.7290821075439453 train acc 0.17938176519567028
epoch 1 batch id 1401 loss 2.468221426010132 train acc 0.2221627408993576
epoch 1 batch id 1601 loss 1.7572476863861084 train acc 0.25975952529668955
epoch 1 batch id 1801 loss 1.4746654033660889 train acc 0.2940380344253193
epoch 1 batch id 2001 loss 1.746311068534851 train acc 0.3232446276861569

epoch 1 train acc 0.34375


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=521.0), HTML(value='')))


epoch 1 test acc 0.6062517137373183


HBox(children=(FloatProgress(value=0.0, max=2150.0), HTML(value='')))

epoch 2 batch id 1 loss 1.67597234249115 train acc 0.5625
epoch 2 batch id 201 loss 1.0161962509155273 train acc 0.5718283582089553
epoch 2 batch id 401 loss 2.393324136734009 train acc 0.593360349127182
epoch 2 batch id 601 loss 1.1267873048782349 train acc 0.5992096505823628
epoch 2 batch id 801 loss 1.1827679872512817 train acc 0.6113451935081149
epoch 2 batch id 1001 loss 2.041337013244629 train acc 0.6221903096903096
epoch 2 batch id 1201 loss 1.0758068561553955 train acc 0.6343671940049959
epoch 2 batch id 1401 loss 1.851233720779419 train acc 0.645521056388294
epoch 2 batch id 1601 loss 0.96855628490448 train acc 0.6546299188007495
epoch 2 batch id 1801 loss 0.41522830724716187 train acc 0.6634855635757912
epoch 2 batch id 2001 loss 0.8762131929397583 train acc 0.6714455272363818

epoch 2 train acc 0.6765116279069767


HBox(children=(FloatProgress(value=0.0, max=521.0), HTML(value='')))


epoch 2 test acc 0.7033006580751302


HBox(children=(FloatProgress(value=0.0, max=2150.0), HTML(value='')))

epoch 3 batch id 1 loss 1.3271163702011108 train acc 0.625
epoch 3 batch id 201 loss 0.4705163240432739 train acc 0.7154850746268657
epoch 3 batch id 401 loss 1.6134333610534668 train acc 0.7227244389027432
epoch 3 batch id 601 loss 0.5181348323822021 train acc 0.721089850249584
epoch 3 batch id 801 loss 0.660610020160675 train acc 0.7272159800249688
epoch 3 batch id 1001 loss 1.505534291267395 train acc 0.7328921078921079
epoch 3 batch id 1201 loss 0.5381229519844055 train acc 0.739748126561199
epoch 3 batch id 1401 loss 1.1948156356811523 train acc 0.7465649536045682
epoch 3 batch id 1601 loss 0.5854792594909668 train acc 0.7524594003747658
epoch 3 batch id 1801 loss 0.4388090968132019 train acc 0.7570099944475291
epoch 3 batch id 2001 loss 0.2723481357097626 train acc 0.7615567216391804

epoch 3 train acc 0.7647383720930233


HBox(children=(FloatProgress(value=0.0, max=521.0), HTML(value='')))


epoch 3 test acc 0.7417226487523992


HBox(children=(FloatProgress(value=0.0, max=2150.0), HTML(value='')))

epoch 4 batch id 1 loss 1.2908886671066284 train acc 0.75
epoch 4 batch id 201 loss 0.44723010063171387 train acc 0.7820273631840796
epoch 4 batch id 401 loss 1.234215497970581 train acc 0.7835099750623441
epoch 4 batch id 601 loss 0.4354294240474701 train acc 0.7792221297836939
epoch 4 batch id 801 loss 0.49687978625297546 train acc 0.783941947565543
epoch 4 batch id 1001 loss 1.2532260417938232 train acc 0.7900224775224776
epoch 4 batch id 1201 loss 0.23885023593902588 train acc 0.7955349708576187
epoch 4 batch id 1401 loss 1.1808233261108398 train acc 0.8010795860099929
epoch 4 batch id 1601 loss 0.4593297839164734 train acc 0.8059025608994379
epoch 4 batch id 1801 loss 0.23374906182289124 train acc 0.8098972792892837
epoch 4 batch id 2001 loss 0.21739551424980164 train acc 0.8135932033983009

epoch 4 train acc 0.8155232558139535


HBox(children=(FloatProgress(value=0.0, max=521.0), HTML(value='')))


epoch 4 test acc 0.7612763915547025


HBox(children=(FloatProgress(value=0.0, max=2150.0), HTML(value='')))

epoch 5 batch id 1 loss 0.8204097151756287 train acc 0.8125
epoch 5 batch id 201 loss 0.398654967546463 train acc 0.8280472636815921
epoch 5 batch id 401 loss 0.9888365864753723 train acc 0.8308915211970075
epoch 5 batch id 601 loss 0.32603320479393005 train acc 0.8299708818635607
epoch 5 batch id 801 loss 0.707271158695221 train acc 0.8320848938826467
epoch 5 batch id 1001 loss 1.1512913703918457 train acc 0.8361013986013986
epoch 5 batch id 1201 loss 0.17350754141807556 train acc 0.8409138218151541
epoch 5 batch id 1401 loss 0.9455651640892029 train acc 0.8456905781584583
epoch 5 batch id 1601 loss 0.2592124044895172 train acc 0.84946908182386
epoch 5 batch id 1801 loss 0.21805977821350098 train acc 0.8531024430871738
epoch 5 batch id 2001 loss 0.1412857174873352 train acc 0.8566654172913544

epoch 5 train acc 0.8587015503875969


HBox(children=(FloatProgress(value=0.0, max=521.0), HTML(value='')))


epoch 5 test acc 0.7642754318618042


HBox(children=(FloatProgress(value=0.0, max=2150.0), HTML(value='')))

epoch 6 batch id 1 loss 0.6702744364738464 train acc 0.8125
epoch 6 batch id 201 loss 0.20723655819892883 train acc 0.8610074626865671
epoch 6 batch id 401 loss 0.6920619010925293 train acc 0.8644014962593516
epoch 6 batch id 601 loss 0.3106262981891632 train acc 0.8631447587354409
epoch 6 batch id 801 loss 0.31842437386512756 train acc 0.8658707865168539
epoch 6 batch id 1001 loss 0.7936205267906189 train acc 0.8692557442557443
epoch 6 batch id 1201 loss 0.06561380624771118 train acc 0.8742714404662781
epoch 6 batch id 1401 loss 0.9733357429504395 train acc 0.8785242683797287
epoch 6 batch id 1601 loss 0.16807642579078674 train acc 0.8826514678326046
epoch 6 batch id 1801 loss 0.028620421886444092 train acc 0.8854802887284842
epoch 6 batch id 2001 loss 0.06731796264648438 train acc 0.8889930034982508

epoch 6 train acc 0.8899709302325581


HBox(children=(FloatProgress(value=0.0, max=521.0), HTML(value='')))


epoch 6 test acc 0.7785508637236085


HBox(children=(FloatProgress(value=0.0, max=2150.0), HTML(value='')))

epoch 7 batch id 1 loss 0.4677696228027344 train acc 0.8125
epoch 7 batch id 201 loss 0.2717874050140381 train acc 0.8908582089552238
epoch 7 batch id 401 loss 0.6238974928855896 train acc 0.8952618453865336
epoch 7 batch id 601 loss 0.3166370987892151 train acc 0.8950707154742097
epoch 7 batch id 801 loss 0.09493833780288696 train acc 0.8968476903870163
epoch 7 batch id 1001 loss 0.7508553862571716 train acc 0.8980394605394605
epoch 7 batch id 1201 loss 0.02687174081802368 train acc 0.9012801831806828
epoch 7 batch id 1401 loss 0.7920633554458618 train acc 0.9052462526766595
epoch 7 batch id 1601 loss 0.052518486976623535 train acc 0.9084556527170519
epoch 7 batch id 1801 loss 0.025860190391540527 train acc 0.9104664075513603
epoch 7 batch id 2001 loss 0.034074246883392334 train acc 0.9132621189405298

epoch 7 train acc 0.9141957364341085


HBox(children=(FloatProgress(value=0.0, max=521.0), HTML(value='')))


epoch 7 test acc 0.7863483685220729


HBox(children=(FloatProgress(value=0.0, max=2150.0), HTML(value='')))

epoch 8 batch id 1 loss 0.4070279002189636 train acc 0.9375
epoch 8 batch id 201 loss 0.02105635404586792 train acc 0.9104477611940298
epoch 8 batch id 401 loss 0.40893927216529846 train acc 0.9141209476309227
epoch 8 batch id 601 loss 0.25954002141952515 train acc 0.9137895174708819
epoch 8 batch id 801 loss 0.09685277938842773 train acc 0.9158863920099876
epoch 8 batch id 1001 loss 0.38326725363731384 train acc 0.9177072927072927
epoch 8 batch id 1201 loss 0.027375519275665283 train acc 0.920587010824313
epoch 8 batch id 1401 loss 0.9562413096427917 train acc 0.923313704496788
epoch 8 batch id 1601 loss 0.0733366310596466 train acc 0.9262960649594004
epoch 8 batch id 1801 loss 0.009888052940368652 train acc 0.9279219877845641
epoch 8 batch id 2001 loss 0.07057502865791321 train acc 0.9302848575712144

epoch 8 train acc 0.9304554263565892


HBox(children=(FloatProgress(value=0.0, max=521.0), HTML(value='')))


epoch 8 test acc 0.7905470249520153


> 공개 예제 버전으로 learning_rate =  5e-5 일때 <code>test acc : 0.7905</code> 로 산출되었습니다.
(참고로, lr =  2e-5 일 때 0.7816 로 산출) 

<br><br>
#### C. kobert pytorch 구현 (from the scratch)
- pytorch를 기반 으로 from the scratch 로 학습 모듈을 구현 하였습니다.
- 전체 문장을 max_len으로 자르고 input_ids를 구하는 방법에서(구글버트 버전) -> 스텝(step)마다 학습 시 필요한 배치 데이터에 input_ids, mask를 그때그때 적용하는 방식으로 진행하였습니다. \
    (이 방법은 메모리 부담을 줄이는 면에서 긍정적) 
- tqdm 패키지의 pbar 사용하여 실시간 학습 경과 구현 하였습니다.

In [7]:
import random
import numpy as np
seed_val = 2020
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [8]:
from kobert.pytorch_kobert import get_pytorch_kobert_model
from torch import nn

# 1) 버트 분류 모델 구현 
class BertClassification(nn.Module):
    def __init__(self, n_class):
        super(BertClassification, self).__init__()
        self.encoder, _  = get_pytorch_kobert_model()
        self.classification_head = nn.Linear(768,n_class)
        nn.init.normal_(self.classification_head.weight,std=0.02)

    def forward(self, input_ids, input_mask, token_type_ids):
        sequence_output, pooled_output = self.encoder(input_ids, input_mask, token_type_ids)
        return self.classification_head(pooled_output)

In [9]:
from gluonnlp.data import SentencepieceTokenizer
from kobert.utils import get_tokenizer
from kobert.pytorch_kobert import get_pytorch_kobert_model
import torch

# 2) 배치 단위별로 indexing, padding, mask 진행   
class Batchfier:
    def __init__(self, padding_idx = 0):
        tok_path = get_tokenizer()
        # 2. text(Input : x) 토큰화
        self.tokenizer = SentencepieceTokenizer(tok_path) # SentencepieceTokenizer 적용
        _, self.vocab = get_pytorch_kobert_model() 
        self.padding_idx = padding_idx

    def batchfy(self, texts, labels=None):
        """
        > param texts : lists of texts
        > param labels : lists of indexed labels
        > return : indexed texts, masks, token_type_ids
        """
        def index_one(text):
            # 문장앞에 cls token을 넣어줍니다.(분류 목적)
            indexed = [self.vocab[self.vocab.cls_token]] + self.vocab[self.tokenizer(text)]
            return indexed[:MAX_LEN]   
            
        indexed = [index_one(text) for text in texts]
        padded = self.pad_indexed(indexed)
        masks = self.get_masks(indexed)
        types = self.get_types(indexed)
        
        if labels:
            return padded, masks, types, torch.LongTensor(labels)
        else:
            return padded, masks, types

    def pad_indexed(self, indexed):
        padded = [torch.LongTensor(text) for text in indexed]
        padded = torch.nn.utils.rnn.pad_sequence(padded, batch_first=True, padding_value=self.padding_idx)
        return padded

    def get_masks(self, indexed):
        masks = [torch.LongTensor([1 for _ in range(len(text))]) for text in indexed]
        masks = torch.nn.utils.rnn.pad_sequence(masks, batch_first=True, padding_value=0)
        return masks

    def get_types(self, indexed):
        bs, l = len(indexed), max(map(len,indexed))
        return torch.zeros((bs,l), dtype=torch.int64)

In [10]:
from torch import nn
from tqdm import tqdm
import random

# 3) Trainer 클래스로 training과 test(validation) 함수 구현 
class Trainer:
    def __init__(self, model, optimizer, scheduler, batch_size=16 ): 
        self.model = model
        self.optimizer = optimizer
        self.batchfier = Batchfier()
        self.batch_size = batch_size
        self.cur_idx = 0
        self.criteria = nn.CrossEntropyLoss()
        self.scheduler = scheduler 
        
    def get_acc(self, logits, y):
        _, predicted = torch.max(logits.data, 1)
        total = y.size(0)
        correct = (predicted == y).sum().item()
        return correct, total

    def get_batch(self, texts, labels):
        if self.cur_idx > len(texts):
            return None
        # 배치사이즈 만큼 자르기
        batch_texts = texts[self.cur_idx:self.cur_idx + self.batch_size]
        batch_labels = labels[self.cur_idx:self.cur_idx + self.batch_size]
        self.cur_idx += self.batch_size
        return self.batchfier.batchfy(batch_texts, batch_labels)    
    
    def train_epochs(self, texts, labels):
        self.cur_idx = 0
        self.model.train()
        self.model.zero_grad()
        pbar = tqdm()
        tot_loss = 0
        tot_correct = 0
        total_n = 0
        total_step = 0
        # 랜덤셔플적용 # 
        combine  = list(zip(texts,labels))
        random.shuffle(combine)
        texts,labels = zip(*combine)
        while True:
            res = self.get_batch(texts,labels) 
            if not res:
                break
            x, masks, token_index, y = res
            x = x.to(device)
            masks = masks.to(device)
            token_index = token_index.to(device)
            y = y.to(device)
            logits = self.model(x, masks, token_index)
            loss = self.criteria(logits, y)
            correct, n = self.get_acc(logits,y)
            loss.backward()
            
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0) # # 
            self.optimizer.step()
            self.scheduler.step() # # 
            self.model.zero_grad()
            tot_correct += correct
            total_n += n
            tot_loss+=loss
            total_step +=1
            pbar.set_description(
                "training loss : %f training acc : %f step : %f" % (
                    tot_loss / total_step, tot_correct / total_n, total_step) )
        pbar.close()

    def test_epochs(self, texts, labels):
        self.cur_idx = 0
        self.model.eval()
        pbar = tqdm()
        tot_loss = 0
        tot_correct = 0
        total_n = 0
        total_step = 0
        while True:
            with torch.no_grad():
                res = self.get_batch(texts, labels)
                if not res:
                    break
                x, masks, token_index, y = res
                x = x.to(device)
                masks = masks.to(device)
                token_index = token_index.to(device)
                y = y.to(device)
                logits = self.model(x, masks, token_index)
                loss = self.criteria(logits, y)
                correct, n = self.get_acc(logits, y)
                tot_correct += correct
                total_n += n
                tot_loss += loss
                total_step += 1               
                pbar.set_description(
                    "test loss : %f test acc : %f step : %f" % (
                        tot_loss / total_step, tot_correct / total_n, total_step))
                epoch_loss = tot_loss/total_step
                epoch_acc = tot_correct/total_n
        pbar.close()
        return epoch_loss, epoch_acc


In [11]:
import torch
from transformers import  get_linear_schedule_with_warmup

# max_len 최대로 실험 
epochs = 12
MAX_LEN = 400 
BATCH_SIZE = 8

# 모델 가져오기
def get_model(lr,train_texts):
    model = BertClassification(n_class=183).to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, eps = 1e-8) 
    total_steps = len(train_texts)/BATCH_SIZE * epochs  
    scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, 
                                            num_training_steps = total_steps) 
    return model, optimizer , scheduler

# 메인 함수 
def main(train_texts, train_labels, test_texts, test_labels):
    lrs = [2.5e-5]
    for lrss in lrs:
        print('--------{}-----'.format(lrss))
        model, optimizer, scheduler = get_model(lr=lrss, train_texts=train_texts) # 초기화
        trainer = Trainer(model, optimizer, scheduler, batch_size=BATCH_SIZE)
        loss_list = []
        for i in range(epochs): 
            trainer.train_epochs(train_texts, train_labels)
            epoch_loss, epoch_acc = trainer.test_epochs(test_texts, test_labels)
#             if epoch_loss <= min(loss_list):
#                 torch.save(model.state_dict(), f'kobert_{lrss}_{MAX_LEN}_{BATCH_SIZE}.pt') 
#             loss_list.append(epoch_loss)

In [16]:
# 학습 데이터 체크
print('Train text set 개수 :',len(train_texts) , '\nTrain labels 개수 :',len(train_labels))
print('Test text set 개수 :',len(test_texts) , '\nTest labels 개수 :',len(test_labels))

Train text set 개수 : 34396 
Train labels 개수 : 34396
Test text set 개수 : 8327 
Test labels 개수 : 8327


■ 학습 시작
- 직접 구현한 kobert 프로세스에서는 lr=2.5e-5 , max_len=400 일때 상대적으로 좋은 성능을 보였습니다. 
- 최종 성능은 best_loss 기준 시 0.7772 , best_acc 기준 0.7918 로 산출되었습니다.


In [None]:
# 학습 시작! 
main(train_texts = train_texts, train_labels = train_labels, test_texts = test_texts, test_labels = test_labels)

--------2.5e-05-----
using cached model
using cached model
using cached model
using cached model
using cached model


training loss : 2.292726 training acc : 0.508489 step : 4300.000000: : 0it [11:18, ?it/s]
test loss : 1.482641 test acc : 0.664585 step : 1041.000000: : 0it [00:46, ?it/s]
training loss : 1.160836 training acc : 0.729591 step : 4300.000000: : 0it [11:04, ?it/s]
test loss : 1.175307 test acc : 0.732317 step : 1041.000000: : 0it [00:39, ?it/s]
training loss : 0.858477 training acc : 0.793057 step : 4300.000000: : 0it [10:28, ?it/s]
test loss : 1.053249 test acc : 0.753212 step : 1041.000000: : 0it [00:40, ?it/s]
training loss : 0.665886 training acc : 0.835533 step : 4300.000000: : 0it [11:26, ?it/s]
test loss : 1.041169 test acc : 0.762459 step : 1041.000000: : 0it [00:42, ?it/s]
training loss : 0.518977 training acc : 0.872253 step : 4300.000000: : 0it [11:27, ?it/s]
test loss : 1.036964 test acc : 0.777231 step : 1041.000000: : 0it [00:41, ?it/s]
training loss : 0.403604 training acc : 0.901587 step : 4300.000000: : 0it [11:22, ?it/s]
test loss : 1.066042 test acc : 0.782635 step : 10

In [50]:
# 학습 시작! 
main(train_texts = train_texts, train_labels = train_labels, test_texts = test_texts, test_labels = test_labels)

--------2.5e-05-----
using cached model
using cached model
using cached model
using cached model
using cached model


training loss : 2.387952 training acc : 0.471712 step : 4300.000000: : 0it [10:41, ?it/s]
test loss : 1.577102 test acc : 0.631200 step : 1041.000000: : 0it [00:39, ?it/s]
training loss : 1.235698 training acc : 0.709559 step : 4300.000000: : 0it [10:31, ?it/s]
test loss : 1.231580 test acc : 0.718146 step : 1041.000000: : 0it [00:39, ?it/s]
training loss : 0.921678 training acc : 0.775584 step : 4300.000000: : 0it [10:45, ?it/s]
test loss : 1.067951 test acc : 0.750691 step : 1041.000000: : 0it [00:39, ?it/s]
training loss : 0.727624 training acc : 0.819979 step : 4300.000000: : 0it [10:47, ?it/s]
test loss : 1.069843 test acc : 0.762459 step : 1041.000000: : 0it [00:39, ?it/s]
training loss : 0.595471 training acc : 0.853006 step : 4300.000000: : 0it [14:53, ?it/s]
test loss : 1.076355 test acc : 0.772427 step : 1041.000000: : 0it [01:16, ?it/s]
training loss : 0.486731 training acc : 0.880742 step : 4300.000000: : 0it [14:14, ?it/s]
test loss : 1.121378 test acc : 0.777231 step : 10

> 직접 구현한 버전으로 <code>test acc : 0.7918</code> 까지 산출되었습니다. \
> 과제의 목표인 acc 0.80 까지는 어려워보여, kobert를 버리고 다른 버트를 통해 학습 및 실험 하고자 합니다.

<br><br>
#### C-2.(추가) 각각의 기능별 클래스를 .py 스크립트 파일로 저장 후 bash로 학습 
- C. kobert pytorch 구현 (from the scratch)에서 만든 클래스를 .py 파일로 기능 분리 구현하였습니다.
- bash script를 통해 학습을 진행하였습니다.
- (관련 파일은 ./kobert 폴더 참조)

In [None]:
cd kobert

In [7]:
!bash scripts/kobert_train_1.sh 

--------2.5e-05--------
--------epodch 1-----
training loss : 2.675329 training acc : 0.441418 step : 2150.000000 
test loss : 1.596798 test acc : 0.652936 step : 521.000000 
--------epodch 2-----
training loss : 1.294857 training acc : 0.708396 step : 2150.000000 
test loss : 1.199180 test acc : 0.727513 step : 521.000000 
--------epodch 3-----
training loss : 0.936986 training acc : 0.776893 step : 2150.000000 
test loss : 1.050810 test acc : 0.755494 step : 521.000000 
--------epodch 4-----
training loss : 0.727372 training acc : 0.822043 step : 2150.000000 
test loss : 0.985938 test acc : 0.763540 step : 521.000000 
--------epodch 5-----
training loss : 0.577965 training acc : 0.853966 step : 2150.000000 
test loss : 0.970217 test acc : 0.769064 step : 521.000000 
--------epodch 6-----
training loss : 0.462415 training acc : 0.883591 step : 2150.000000 
test loss : 0.998379 test acc : 0.773628 step : 521.000000 
--------epodch 7-----
training loss : 0.372296 training acc : 0.907053

<br><br>
#### (10) lessons learned

- 구글 버트를 적용할 때 보다 kobert를 사용할 때, 정확도가 약 0.5% 정도 더 높아짐을 확인 하였습니다.
- kobert에서 사용한 하이퍼 파라메터는 구글버트와 동일하게 max_len = 256, learning_rate = 5e-5 에서 좋은 결과가 있음을 확인하였습니다.
- pytorch를 이용하여 training 방법을 연구하고 적용해 보았습니다.
- notebook 내 코드를 구현 방식에서 -> .py 스크립트 파일로 기능/모듈별 코드를 구현하여 bash로 학습하는 법을 익혔습니다.

<br><br><br><br><br><br><br>
<hr>

작성자 : 박은진