<a href="https://colab.research.google.com/github/devyulbae/AIClass/blob/main/Project_0)_Naver_Review_GPT_pre_trained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification

### Task
* 네이버에서 영화평을 가지고 positive/negative인지 구분해보자.
* 데이터 불러오기를 제외한 딥러닝 트레이닝 과정을 직접 구현해보는 것이 목표 입니다.

### Dataset
* [Naver sentiment movie corpus v1.0](https://github.com/e9t/nsmc/)

### Base code
* Dataset: train, val, test로 split
* Input data shape: (`batch_size`, `max_sequence_length`)
* Output data shape: (`batch_size`, 1)
* Training
* Evaluation

### Try some techniques
* Training-epochs 조절
* Change model architectures (Custom model)
  * Use another cells (LSTM, GRU, etc.)
  * Use dropout layers
* Embedding size 조절
  * 또는 one-hot vector로 학습
* Number of words in the vocabulary 변화
* `pad` 옵션 변화
* Data augmentation (if possible)

## Import modules

In [80]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [114]:
!pip install sentencepiece
!pip install konlpy

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting JPype1>=0.7.0 (from konlpy)
  Downloading JPype1-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (488 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m488.6/488.6 kB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: JPype1, konlpy
Successfully installed JPype1-1.5.0 konlpy-0.6.0


In [82]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import time
import shutil
import tarfile

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output
import urllib.request

import pandas as pd

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

import sentencepiece as spm

from collections import Counter, defaultdict


## Load Data

* ratings_train.txt: 훈련용으로 사용되는 15만 개의 리뷰
* ratings_test.txt: 테스트용으로 보류된 5만 개의 리뷰
* 모든 리뷰는 140자 이내입니다
* 각 감정 클래스는 동등하게 샘플링되었습니다 (즉, 무작위 추측은 50%의 정확도를 보입니다)
* 10만 개의 부정적 리뷰 (원래 1-4점의 리뷰)
* 10만 개의 긍정적 리뷰 (원래 9-10점의 리뷰)
* 중립적 리뷰 (원래 5-8점의 리뷰)는 제외되었습니다


In [83]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt", filename="ratings_train.txt")
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt", filename="ratings_test.txt")

('ratings_test.txt', <http.client.HTTPMessage at 0x7edb7b203ac0>)

In [84]:
train_data = pd.read_table('ratings_train.txt')
train_data = train_data.dropna()
test_data = pd.read_table('ratings_test.txt')
test_data = test_data.dropna()

In [85]:
train_data.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [86]:
test_data.head()

Unnamed: 0,id,document,label
0,6270596,굳 ㅋ,1
1,9274899,GDNTOPCLASSINTHECLUB,0
2,8544678,뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아,0
3,6825595,지루하지는 않은데 완전 막장임... 돈주고 보기에는....,0
4,6723715,3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??,0


### Tokenizing


In [115]:
# sp = spm.SentencePieceProcessor()
# sp.load('/content/drive/MyDrive/datas/naver_review/naver_review.model')  # 모델 경로 설정


from konlpy.tag import Hannanum
hn = Hannanum()

In [116]:
for i, (line) in enumerate(train_data['document']):
    print(line)
    print(hn.morphs(line))
    print(hn.morphs(line))
    if i == 5:
        break

아 더빙.. 진짜 짜증나네요 목소리
['아', '더빙', '..', '진짜', '짜증나', '네', '요', '목소리']
['아', '더빙', '..', '진짜', '짜증나', '네', '요', '목소리']
흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
['흠', '.', '..', '포스터보고', '초딩영화줄', '....', '오버연기', '조차', '가볍', '지', '않', '구나']
['흠', '.', '..', '포스터보고', '초딩영화줄', '....', '오버연기', '조차', '가볍', '지', '않', '구나']
너무재밓었다그래서보는것을추천한다
['너무재밓었다그래서보는것을추천한다']
['너무재밓었다그래서보는것을추천한다']
교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
['교도소', '이야기구먼', '..', '솔직히', '재미', '는', '없', '다', '..', '평점', '조정']
['교도소', '이야기구먼', '..', '솔직히', '재미', '는', '없', '다', '..', '평점', '조정']
사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다
['사이몬페그', '의', '익살', '스런', '연기', '가', '돋보이', '었던', '영화', '!', '스파이더맨', '에서', '늙', '어', '보이', '기', '만', '하', '었던', '커스틴', '던스트', '가', '너무나', '도', '이쁘', '어', '보이', '었다']
['사이몬페그', '의', '익살', '스런', '연기', '가', '돋보이', '었던', '영화', '!', '스파이더맨', '에서', '늙', '어', '보이', '기', '만', '하', '었던', '커스틴', '던스트', '가', '너무나', '도', '이쁘', '어', '보이', '었다']
막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.
['막', '걸음마', '떼', 

In [89]:
# eos_token = '[BOS]'
# eos_id = sp.piece_to_id(eos_token)

# print(f"토큰 '{eos_token}'의 ID: {eos_id}")

토큰 '[BOS]'의 ID: 2


In [90]:
# sp.encode_as_ids(['[EOS]'])

[[4379, 7127, 6566, 6866, 7344]]

In [91]:
# BOS_id = sp.piece_to_id('[BOS]')
# EOS_id = sp.piece_to_id('[EOS]')
BOS_id = ##
EOS_id = ##

# 결과 출력
lengths = []
input_train_text, target_train_text = [], []
for line in train_data['document']:
    # Encode line, add BOS at the beginning and EOS at the end
    input_line = [BOS_id] + hn.morphs(line) # TODO
    target_line = hn.morphs(line) + [EOS_id] # TODO
    input_train_text.append(tf.convert_to_tensor(input_line, dtype=tf.int32))
    target_train_text.append(tf.convert_to_tensor(target_line, dtype=tf.int32))
    lengths.append(len(line))

input_test_text, target_test_text = [], []
for line in test_data['document']:
    # Similar process for test data
    input_line = [BOS_id] + hn.morphs(line) # TODO
    target_line = hn.morphs(line) + [EOS_id] # TODO
    input_test_text.append(tf.convert_to_tensor(input_line, dtype=tf.int32))
    target_test_text.append(tf.convert_to_tensor(target_line, dtype=tf.int32))
    lengths.append(len(line))

print(max(lengths))

146


In [92]:
print(len(input_test_text), len(target_train_text))

49997 149995


In [93]:
print(input_test_text[0], target_train_text[0])

tf.Tensor([   2 1293  558    3], shape=(4,), dtype=int32) tf.Tensor([  14 1226    7   88 2990   55 2393    3], shape=(8,), dtype=int32)


### Padding and truncating data using pad sequences
* 전부 길이가 다른 리뷰들의 길이를 통일해주자

In [94]:
batch_size = 16
max_seq_length = 150 # TODO # max length

In [95]:
input_train_data_pad = pad_sequences(input_train_text, padding='post', maxlen=max_seq_length,value=0)
target_train_data_pad = pad_sequences(target_train_text, padding='post', maxlen=max_seq_length,value=0)
input_test_data_pad = pad_sequences(input_test_text, padding='post', maxlen=max_seq_length,value=0)
target_test_data_pad = pad_sequences(target_test_text, padding='post', maxlen=max_seq_length,value=0)

print(input_train_data_pad.shape, target_train_data_pad.shape)

(149995, 150) (149995, 150)


### Dataset 구성

In [96]:
# for train
train_dataset = tf.data.Dataset.from_tensor_slices((input_train_data_pad, target_train_data_pad))
train_dataset = train_dataset.shuffle(10000).repeat().batch(batch_size=batch_size)
print(train_dataset)

# for test
test_dataset = tf.data.Dataset.from_tensor_slices((input_test_data_pad, target_test_data_pad))
test_dataset = test_dataset.batch(batch_size=batch_size)
print(test_dataset)


<_BatchDataset element_spec=(TensorSpec(shape=(None, 150), dtype=tf.int32, name=None), TensorSpec(shape=(None, 150), dtype=tf.int32, name=None))>
<_BatchDataset element_spec=(TensorSpec(shape=(None, 150), dtype=tf.int32, name=None), TensorSpec(shape=(None, 150), dtype=tf.int32, name=None))>


## Build the model


## Setup hyper-parameters

In [97]:
kargs = {'model_name': 'GPT',
         'num_layers': 12,
         'd_model': 256,
         'num_heads': 8,
         'dff': 256 * 4,
         'input_vocab_size': sp.get_piece_size(),
         'target_vocab_size': sp.get_piece_size(),
         'maximum_position_encoding': max_seq_length,
         'segment_encoding': 2,
         'end_token_idx': sp.piece_to_id('[EOS]'),
         'rate': 0.1
        }

In [98]:
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * i//2) / np.float32(d_model))
    return pos * angle_rates

In [99]:
def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

In [100]:
def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead)
    but it must be broadcastable for addition.

    Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
    output, attention_weights
    """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # softmax is normalized on the last axis (seq_len_k) so that the scores
    # add up to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)
    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights

In [101]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = kargs['num_heads']
        self.d_model = kargs['d_model']

        assert self.d_model % self.num_heads == 0

        self.depth = self.d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(kargs['d_model'])
        self.wk = tf.keras.layers.Dense(kargs['d_model'])
        self.wv = tf.keras.layers.Dense(kargs['d_model'])

        self.dense = tf.keras.layers.Dense(kargs['d_model'])

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention,
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

In [102]:
def point_wise_feed_forward_network(**kargs):
    return tf.keras.Sequential([
            tf.keras.layers.Conv1D(kargs['dff'], 1, activation='relu'),  # (batch_size, seq_len, dff)
            tf.keras.layers.Conv1D(kargs['d_model'], 1)  # (batch_size, seq_len, d_model)
        ])

In [103]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(DecoderLayer, self).__init__()

        self.mha = MultiHeadAttention(**kargs)

        self.ffn = point_wise_feed_forward_network(**kargs)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(kargs['rate'])
        self.dropout2 = tf.keras.layers.Dropout(kargs['rate'])
        self.dropout3 = tf.keras.layers.Dropout(kargs['rate'])


    def call(self, x, look_ahead_mask, padding_mask):
        attn1, attn_weights_block1 = self.mha(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn1 = self.dropout1(attn1)
        out1 = self.layernorm1(attn1 + x)

        ffn_output = self.ffn(out1)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output)
        out2 = self.layernorm3(ffn_output + out1)  # (batch_size, target_seq_len, d_model)

        return out2, attn_weights_block1

In [104]:
class Decoder(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(Decoder, self).__init__()

        self.d_model = kargs['d_model']
        self.num_layers = kargs['num_layers']

        self.embedding = tf.keras.layers.Embedding(kargs['target_vocab_size'], self.d_model)
        self.pos_encoding = positional_encoding(kargs['maximum_position_encoding'], self.d_model)

        self.dec_layers = [DecoderLayer(**kargs)
                           for _ in range(self.num_layers)]
        self.dropout = tf.keras.layers.Dropout(kargs['rate'])

    def call(self, x, look_ahead_mask, padding_mask):
        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x)

        for i in range(self.num_layers):
            x, block1 = self.dec_layers[i](x, look_ahead_mask, padding_mask)

            attention_weights['decoder_layer{}_block1'.format(i+1)] = block1

        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights

In [105]:
def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

    # add extra dimensions to add the padding
    # to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

def create_look_ahead_mask(size):
        mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
        return mask  # (seq_len, seq_len)

def create_masks(input):
    # Used in the 2nd attention block in the decoder.
    # This padding mask is used to mask the encoder outputs.
    dec_padding_mask = create_padding_mask(input)

    # Used in the 1st attention block in the decoder.
    # It is used to pad and mask future tokens in the input received by
    # the decoder.
    look_ahead_mask = create_look_ahead_mask(tf.shape(input)[1])
    dec_target_padding_mask = create_padding_mask(input)
    look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return look_ahead_mask, dec_padding_mask

In [106]:
class GPT(tf.keras.Model):
    def __init__(self, **kargs):
        super(GPT, self).__init__(name=kargs['model_name'])
        self.end_token_idx = kargs['end_token_idx']
        self.decoder = Decoder(**kargs)
        self.outputs_layer = tf.keras.layers.Dense(kargs['d_model'],
                                                   activation='gelu')

        self.final_layer = tf.keras.layers.Dense(2)

    def call(self, x):
        look_ahead_mask, mask = create_masks(x)

        dec_output, attn = self.decoder(x, look_ahead_mask, mask)  # (batch_size, inp_seq_len, d_model)
        dec_output = self.outputs_layer(dec_output)  # (batch_size, inp_seq_len, d_model)
        final_output = self.final_layer(dec_output)  # (batch_size, tar_vocab_size)

        return final_output

In [107]:
model = GPT(**kargs)

## Train the model

In [108]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')

def loss(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

def accuracy(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    mask = tf.expand_dims(tf.cast(mask, dtype=pred.dtype), axis=-1)
    pred *= mask
    acc = train_accuracy(real, pred)

    return tf.reduce_mean(acc)

In [109]:
model.compile(optimizer=tf.keras.optimizers.Adam(1e-5),
              loss=loss,
              metrics=[accuracy])

In [110]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,
                                                     monitor='val_loss',
                                                     restore_best_weights=True,
                                                     verbose=1)

In [111]:
history = model.fit(train_dataset,
                    steps_per_epoch = len(input_train_data_pad) // batch_size,
                    epochs = 2,
                    validation_data = test_dataset,
                    validation_steps = len(input_test_data_pad) // batch_size,
                    verbose = 1
                    )


Epoch 1/2
Epoch 2/2


## Test the model

In [112]:
results = model.evaluate(test_dataset)
# loss
print("loss value: {:.3f}".format(results[0]))
# accuracy
print("accuracy value: {:.3f}".format(results[1]))

loss value: nan
accuracy value: 0.880
