Reference: https://github.com/SKTBrain/KoBERT
## KoBERT를 이용해서 5가지 감정(기쁨, 슬픔, 설렘, 화남, 우울) 분류하기

+ 구글 드라이브 마운트, 라이브러리 설치

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install mxnet-cu101
!pip install gluonnlp pandas tqdm
!pip install sentencepiece==0.1.85
!pip install transformers==2.1.1
!pip install torch==1.3.1

Collecting mxnet-cu101
[?25l  Downloading https://files.pythonhosted.org/packages/40/26/9655677b901537f367c3c473376e4106abc72e01a8fc25b1cb6ed9c37e8c/mxnet_cu101-1.7.0-py2.py3-none-manylinux2014_x86_64.whl (846.0MB)
[K     |███████████████████████████████▌| 834.1MB 1.3MB/s eta 0:00:10tcmalloc: large alloc 1147494400 bytes == 0x39a74000 @  0x7f0c6dec8615 0x591f47 0x4cc229 0x4cc38b 0x50a51c 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x58e793 0x50c467 0x58e793 0x50c467 0x58e793 0x50c467 0x58e793 0x50c467 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d
[K     |████████████████████████████████| 846.0MB 21kB/s 
Collecting graphviz<0.9.0,>=0.8.1
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
Installing collected packages: graphviz, mxnet-cu101
  Found existing installation: graphviz 0.10.1
    Uninstalling graphv

In [3]:
!pip install git+https://git@github.com/SKTBrain/KoBERT.git@master

Collecting git+https://****@github.com/SKTBrain/KoBERT.git@master
  Cloning https://****@github.com/SKTBrain/KoBERT.git (to revision master) to /tmp/pip-req-build-vcrmcthx
  Running command git clone -q 'https://****@github.com/SKTBrain/KoBERT.git' /tmp/pip-req-build-vcrmcthx
Building wheels for collected packages: kobert
  Building wheel for kobert (setup.py) ... [?25l[?25hdone
  Created wheel for kobert: filename=kobert-0.1.1-cp36-none-any.whl size=12825 sha256=dc403803f253d5cc4634d3c842f2d507bb84c294c941fe64cd286f401abc54a7
  Stored in directory: /tmp/pip-ephem-wheel-cache-fmkv29ng/wheels/a2/b0/41/435ee4e918f91918be41529283c5ff86cd010f02e7525aecf3
Successfully built kobert
Installing collected packages: kobert
Successfully installed kobert-0.1.1


In [4]:
import pandas as pd
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm import tqdm, tqdm_notebook

In [5]:
from kobert.utils import get_tokenizer
from kobert.pytorch_kobert import get_pytorch_kobert_model

In [6]:
from transformers import AdamW
from transformers.optimization import WarmupLinearSchedule

## 데이터 준비
+ 깃허브 https://github.com/lini1634/AdvancedML_project/tree/master/dataset 의 angry, happy, love, sad, depressed엑셀파일을 구글드라이브에 올려두고 가져온 후, 판다스로 학습데이터를 만든다. 


In [7]:
##GPU 사용 시
device = torch.device("cuda:0")
bertmodel, vocab = get_pytorch_kobert_model()

[██████████████████████████████████████████████████]
[██████████████████████████████████████████████████]


In [8]:
angry = pd.read_excel("/content/drive/My Drive/Colab Notebooks/4-2고급기계플젝/dataset/angry.xlsx")
happy = pd.read_excel("/content/drive/My Drive/Colab Notebooks/4-2고급기계플젝/dataset/happy.xlsx")
love = pd.read_excel("/content/drive/My Drive/Colab Notebooks/4-2고급기계플젝/dataset/love.xlsx")
sad = pd.read_excel("/content/drive/My Drive/Colab Notebooks/4-2고급기계플젝/dataset/sad.xlsx")
depressed = pd.read_excel("/content/drive/My Drive/Colab Notebooks/4-2고급기계플젝/dataset/depressed.xlsx")

In [9]:
angry.columns=['sentence','label']
happy.columns=['sentence','label']
love.columns=['sentence','label']
sad.columns=['sentence','label']
depressed.columns=['sentence','label']

In [10]:
angry = angry.dropna()
happy = happy.dropna()
love = love.dropna()
sad = sad.dropna()
depressed = depressed.dropna()

In [11]:
data=pd.concat([angry,happy,love,sad,depressed])
data = data.reset_index(drop=True)

In [12]:
data.head()

Unnamed: 0,sentence,label
0,화나다와 화내다,4
1,여러분물어볼게있는데요.화나다하고속상...,4
2,화나다와 화내다의 구분방법,4
3,‘화나다’는 동사인가요 형용사인가요?,4
4,화나다 / 차이다,4


In [13]:
from sklearn.model_selection import train_test_split
train,test = train_test_split(data,test_size=0.33, random_state=42)

In [14]:
train.to_csv('train.txt', sep="\t")

In [15]:
test.to_csv('test.txt', sep="\t")

+ gluonnlp와 Kobert(bert korean)을 이용하여 단어의 의미를 잘 표현하는 벡터로 사전 임베딩 과정을 거친다. 

In [16]:
dataset_train = nlp.data.TSVDataset("train.txt", field_indices=[1,2], num_discard_samples=1)
dataset_test = nlp.data.TSVDataset("test.txt", field_indices=[1,2], num_discard_samples=1)

In [17]:
tokenizer = get_tokenizer()
tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)

using cached model


In [18]:
class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)

        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))


## 모델 학습

In [19]:
## Setting parameters
max_len = 64
batch_size = 64
warmup_ratio = 0.1
num_epochs = 5
max_grad_norm = 1
log_interval = 200
learning_rate =  5e-5

In [20]:
data_train = BERTDataset(dataset_train, 0, 1, tok, max_len, True, False)
data_test = BERTDataset(dataset_test, 0, 1, tok, max_len, True, False)

In [21]:
train_dataloader = torch.utils.data.DataLoader(data_train, batch_size=batch_size, num_workers=5)
test_dataloader = torch.utils.data.DataLoader(data_test, batch_size=batch_size, num_workers=5)

In [22]:
class BERTClassifier(nn.Module):
    def __init__(self,
                 bert,
                 hidden_size = 768,
                 num_classes=5,
                 dr_rate=None,
                 params=None):
        super(BERTClassifier, self).__init__()
        self.bert = bert
        self.dr_rate = dr_rate
                 
        self.classifier = nn.Linear(hidden_size , num_classes)
        if dr_rate:
            self.dropout = nn.Dropout(p=dr_rate)
    
    def gen_attention_mask(self, token_ids, valid_length):
        attention_mask = torch.zeros_like(token_ids)
        for i, v in enumerate(valid_length):
            attention_mask[i][:v] = 1
        return attention_mask.float()

    def forward(self, token_ids, valid_length, segment_ids):
        attention_mask = self.gen_attention_mask(token_ids, valid_length)
        
        _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device))
        if self.dr_rate:
            out = self.dropout(pooler)
        return self.classifier(out)

In [23]:
model = BERTClassifier(bertmodel,  dr_rate=0.5).to(device)

In [24]:
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

In [25]:
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()

In [26]:
t_total = len(train_dataloader) * num_epochs
warmup_step = int(t_total * warmup_ratio)

In [27]:
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_step, t_total=t_total)

In [28]:
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc

In [29]:
for e in range(num_epochs):
    train_acc = 0.0
    test_acc = 0.0
    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        test_acc += calc_accuracy(out, label)
    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


HBox(children=(FloatProgress(value=0.0, max=258.0), HTML(value='')))

epoch 1 batch id 1 loss 1.6061840057373047 train acc 0.234375
epoch 1 batch id 201 loss 0.18748632073402405 train acc 0.7481343283582089

epoch 1 train acc 0.797425563777308


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=127.0), HTML(value='')))


epoch 1 test acc 0.969467336180435


HBox(children=(FloatProgress(value=0.0, max=258.0), HTML(value='')))

epoch 2 batch id 1 loss 0.2953975796699524 train acc 0.90625
epoch 2 batch id 201 loss 0.07292285561561584 train acc 0.9736473880597015

epoch 2 train acc 0.9761385658914729


HBox(children=(FloatProgress(value=0.0, max=127.0), HTML(value='')))


epoch 2 test acc 0.9783464566929134


HBox(children=(FloatProgress(value=0.0, max=258.0), HTML(value='')))

epoch 3 batch id 1 loss 0.23736630380153656 train acc 0.9375
epoch 3 batch id 201 loss 0.019408833235502243 train acc 0.9860851990049752

epoch 3 train acc 0.9874636627906976


HBox(children=(FloatProgress(value=0.0, max=127.0), HTML(value='')))


epoch 3 test acc 0.9812992125984252


HBox(children=(FloatProgress(value=0.0, max=258.0), HTML(value='')))

epoch 4 batch id 1 loss 0.0992492064833641 train acc 0.984375
epoch 4 batch id 201 loss 0.005658421665430069 train acc 0.9934701492537313

epoch 4 train acc 0.9939437984496124


HBox(children=(FloatProgress(value=0.0, max=127.0), HTML(value='')))


epoch 4 test acc 0.9825295275590551


HBox(children=(FloatProgress(value=0.0, max=258.0), HTML(value='')))

epoch 5 batch id 1 loss 0.09586261957883835 train acc 0.984375
epoch 5 batch id 201 loss 0.008562367409467697 train acc 0.9964241293532339

epoch 5 train acc 0.9967296511627907


HBox(children=(FloatProgress(value=0.0, max=127.0), HTML(value='')))


epoch 5 test acc 0.9833907480314961


## 모델 저장

In [30]:
torch.save(model, "/content/first_model.pt")

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


## 모델 로드
+ 미리 저장된 모델을 가져와서 학습과정을 거치지 않고 바로 추론과정에 이용할 수도 있다. 

In [39]:
model = torch.load("/content/first_model.pt")


## 모델 추론

In [80]:
x=input("문장을 입력하세요.")

문장을 입력하세요.아 빡쳐


In [81]:
sample=pd.Series([x])
sample.to_csv('sample2.txt', sep="\t")
dt = nlp.data.TSVDataset("./sample2.txt", field_indices=[0,1], num_discard_samples=1)
datatest = BERTDataset(dt,  1,0, tok, max_len, True, False)
test_dataloader = torch.utils.data.DataLoader(datatest, batch_size=batch_size, num_workers=5)
for (token_ids, valid_length, segment_ids, label) in (tqdm_notebook(test_dataloader)):  
  t= token_ids.long().to(device)
  s = segment_ids.long().to(device)
  v= valid_length

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [82]:
model.forward(t,v,s).argmax()
output=model.forward(t,v,s).argmax()
index = int(output.cpu().numpy())

str="지금 감정은"
if index==0:
  print(str,"기쁨입니다.")
elif index==1:
  print(str,"슬픔입니다.")
elif index==2:
  print(str,"우울입니다.")
elif index==3:
  print(str,"설렘입니다.")
elif index==4:
  print(str,"화남입니다.")


지금 감정은 화남입니다.
