<font size=6><b> KLUE-Bert 기반 뉴스 감성분류 모델 (Test)

In [1]:
import os
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import urllib.request
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow_addons as tfa
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from transformers import BertTokenizer, TFBertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, \
                            roc_auc_score, confusion_matrix, classification_report, \
                            matthews_corrcoef, cohen_kappa_score, log_loss

import warnings
warnings.filterwarnings(action='ignore')


TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 



# 언어모델 및 Tokenizer 불러오기
- Hugging Face에서 간편하게 Pretrained 언어 모델과 tokenizer를 불러옴.
- [KLUE Bert-base 모델](https://huggingface.co/klue/bert-base)을 활용.

In [2]:
MODEL_NAME = "klue/bert-base"
model = TFBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3, from_pt=True)
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# GPU 작동확인
- 모델을 학습하고, 예측하는 과정에서 많은 연산을 필요로하고 속도가 아주 느리기 때문에  
Colab을 활용하여 GPU 환경에서 진행하기를 권장.

In [3]:
device_name = tf.test.gpu_device_name()
if device_name == '/device:GPU:0':
    print("GPU 작동 중")
    mirrored_strategy = tf.distribute.MirroredStrategy()
else:
    print("GPU 미작동 중")

GPU 미작동 중


# 경제 뉴스, Test data load

In [4]:
newsdf = pd.read_csv('./datasets/test_news.csv')
newsdf.head()

Unnamed: 0,key_rdate,key_title
0,2019-12-13 15:34:00,(마감)코스피 기관 4775억 매수 우위
1,2019-12-16 07:16:00,"""제한된 미중 '1단계' 무역합의…글로벌 경기 회복 지연 경계"""
2,2019-12-16 15:47:00,"위안·달러 환율 떨어지면…""외국인 매수 신호"""
3,2019-12-17 07:59:00,"펩트론, JP모건 헬스케어 컨퍼런스 초청받아…빅파마와 미팅"
4,2019-12-17 17:44:00,"잘나갈 때 무리했나…여천NCC, 부메랑 된 설비투자"


In [5]:
newsdf = newsdf.set_index('key_rdate')
newsdf.head()

Unnamed: 0_level_0,key_title
key_rdate,Unnamed: 1_level_1
2019-12-13 15:34:00,(마감)코스피 기관 4775억 매수 우위
2019-12-16 07:16:00,"""제한된 미중 '1단계' 무역합의…글로벌 경기 회복 지연 경계"""
2019-12-16 15:47:00,"위안·달러 환율 떨어지면…""외국인 매수 신호"""
2019-12-17 07:59:00,"펩트론, JP모건 헬스케어 컨퍼런스 초청받아…빅파마와 미팅"
2019-12-17 17:44:00,"잘나갈 때 무리했나…여천NCC, 부메랑 된 설비투자"


## BERT 입력용 데이터 포맷으로 변경

In [6]:
# 입력 데이터(문장) 길이 제한
MAX_SEQ_LEN = 64

In [7]:
def convert_data(X_data, y_data):
    # BERT 입력으로 들어가는 token, mask, segment, target 저장용 리스트
    tokens, masks, segments, targets = [], [], [], []

    for X, y in tqdm(zip(X_data, y_data)):
        # token: 입력 문장 토큰화
        token = tokenizer.encode(X, truncation = True, padding = 'max_length', max_length = MAX_SEQ_LEN)

        # Mask: 토큰화한 문장 내 패딩이 아닌 경우 1, 패딩인 경우 0으로 초기화
        num_zeros = token.count(0)
        mask = [1] * (MAX_SEQ_LEN - num_zeros) + [0] * num_zeros

        # segment: 문장 전후관계 구분: 오직 한 문장이므로 모두 0으로 초기화
        segment = [0]*MAX_SEQ_LEN

        tokens.append(token)
        masks.append(mask)
        segments.append(segment)
        targets.append(y)

    # numpy array로 저장
    tokens = np.array(tokens)
    masks = np.array(masks)
    segments = np.array(segments)
    targets = np.array(targets)

    return [tokens, masks, segments], targets

In [8]:
newsdf['label'] = 0
test_x, _ = convert_data(newsdf['key_title'], newsdf['label'])

19477it [00:04, 4001.27it/s]


# 감정분류 모델 로드

In [9]:
# token, mask, segment 입력 정의
token_inputs = tf.keras.layers.Input((MAX_SEQ_LEN,), dtype = tf.int32, name = 'input_word_ids')
mask_inputs = tf.keras.layers.Input((MAX_SEQ_LEN,), dtype = tf.int32, name = 'input_masks')
segment_inputs = tf.keras.layers.Input((MAX_SEQ_LEN,), dtype = tf.int32, name = 'input_segment')
bert_outputs = model([token_inputs, mask_inputs, segment_inputs])

In [10]:
bert_output = bert_outputs[0]

In [11]:
DROPOUT_RATE = 0.5
NUM_CLASS = 3
dropout = tf.keras.layers.Dropout(DROPOUT_RATE)(bert_output)

# Multi-class classification 문제이므로 activation function은 softmax로 설정
sentiment_layer = tf.keras.layers.Dense(NUM_CLASS, activation='softmax', kernel_initializer = tf.keras.initializers.TruncatedNormal(stddev=0.02))(dropout)
sentiment_model = tf.keras.Model([token_inputs, mask_inputs, segment_inputs], sentiment_layer)

In [12]:
# 모델 불러오기
BEST_MODEL_NAME = './model/best_model.h5'
sentiment_model_best = tf.keras.models.load_model(BEST_MODEL_NAME,
                                                  custom_objects={'TFBertForSequenceClassification': TFBertForSequenceClassification})

# 감정 분류 및 결과 저장
* 해당 파일은 Ground Truth Label이 없기 때문에 Accuracy, F1 Score등을 구할 수 없음.
* 예측 결과를 file로 저장하여 확인해야함.

In [None]:
# 모델이 예측한 라벨 도출
predicted_value = sentiment_model_best.predict(test_x)
predicted_label = np.argmax(predicted_value, axis = 1)



In [None]:
newsdf['label'] = predicted_label
newsdf.to_csv('./dataset/news_pred_result.csv')