# Distil-BERT, SentimentText to Aspect
- BERT를 경량화한 Distil-BERT를 활용
- SentimentText를 X로, Aspect를 y로 학습시킴
- BASE_FOLDER 값은 직접 입력 필요

In [None]:
import os
import tensorflow as tf
import pandas as pd
from transformers import AutoTokenizer, TFDistilBertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
BASE_FOLDER = "compass 폴더의 위치를 지정해주세요"
target_data_path = os.path.join(BASE_FOLDER, "data/garments.csv")

In [None]:
base_df = pd.read_csv(target_data_path).dropna()
base_df.head(2)

Unnamed: 0,Index,RawText,Source,Domain,MainCategory,ProductName,ReviewScore,Syllable,Word,RDate,GeneralPolarity,Aspect,SentimentText,SentimentWord,SentimentPolarity
0,128481,아들에게 선물했는데 불편하고 활동하기 안좋다고 잘 안입고 다니네요,쇼핑몰,패션,남성의류,OO 남성 매** 데님 3종,100,37,8,20180626,-1.0,착용감,불편하고,1,-1
1,128481,아들에게 선물했는데 불편하고 활동하기 안좋다고 잘 안입고 다니네요,쇼핑몰,패션,남성의류,OO 남성 매** 데님 3종,100,37,8,20180626,-1.0,착용감,활동하기 안좋다고 잘 안입고 다니네요,5,-1


In [None]:
base_df["Aspect"].value_counts()[:6]

디자인    16177
사이즈    13793
가격     13450
품질     11627
착화감    10602
기능     10280
Name: Aspect, dtype: int64

In [None]:
# 착화감은 신발에만 해당되므로, 제외하고 top5 기준으로 데이터 추출
targets = ["디자인", "사이즈", "가격", "품질", "기능"]
target_df = base_df.loc[base_df["Aspect"].isin(targets), :]
target_df.head(2)

Unnamed: 0,Index,RawText,Source,Domain,MainCategory,ProductName,ReviewScore,Syllable,Word,RDate,GeneralPolarity,Aspect,SentimentText,SentimentWord,SentimentPolarity
4,128484,이번에구매한데님은사이즈가잘맞네요 색상구성도괜찮고맘에든답니다 잘입겠습니다,쇼핑몰,패션,남성의류,OO 남성 매** 데님 3종,100,39,3,20180315,1.0,사이즈,사이즈가잘맞네요,1,1
15,128494,바지는 너무 편하고 좋은데 좀크게나온듯 그리고 허리고리 하나가 안달려서 밑단수선하면...,쇼핑몰,패션,남성의류,OO 남성 매** 데님 3종,60,118,24,20180317,0.0,사이즈,좀크게나온듯,1,-1


In [None]:
# # Aspect 통합은 모델 학습이 너무 오래 걸려 포기

# def compress_category(x):
#     categories = {
#         "size": ["사이즈", "핏", "두께", "길이", "사이즈/폭/길이/두께", "치수/사이즈", "굽"],
#         "design": ["색상", "디자인"],
#         "quality": ["마감", "품질", "소재", "촉감", "냄새", "내구성"],
#         "usability": ["사용성/편의성", "사용성", "수납", "활용성", "제품구성", "무게", "신축성", "기능", "기능성", "착화감", "착용감"],
#         "price": ["가격"]
#     }

#     for category, detail in categories.items():
#         if x in detail:
#             return category

# compressed_categories = base_df["Aspect"].map(compress_category)
# compressed_categories.value_counts()

In [None]:
label_encoder = LabelEncoder()
enc_data = label_encoder.fit_transform(target_df["Aspect"])
num_labels = len(set(enc_data))

In [None]:
X, y = target_df.loc[:, "SentimentText"].to_list(), enc_data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=88)

In [None]:
# # KLUE-BERT 활용
# model = TFBertForSequenceClassification.from_pretrained(HUGGING_FACE_PATH, num_labels=num_labels, from_pt=True)
# tokenizer = AutoTokenizer.from_pretrained(HUGGING_FACE_PATH)

In [None]:
# DistilBERT multilingual cased model
HUGGING_FACE_PATH = "distilbert-base-multilingual-cased"
model = TFDistilBertForSequenceClassification.from_pretrained(HUGGING_FACE_PATH, num_labels=num_labels, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(HUGGING_FACE_PATH)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'cla

In [None]:
# # Use saved model - 로컬에 저장된 모델 불러오기
# MODEL_SAVE_PATH = "모델이 저장된 위치를 입력해주세요."
# model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_SAVE_PATH, num_labels=num_labels, local_files_only=True)
# tokenizer = AutoTokenizer.from_pretrained(MODEL_SAVE_PATH, local_files_only=True)

In [None]:
X_train_encoding = tokenizer(X_train, padding=True, truncation=True, max_length=42)

In [None]:
SHUFFLE_PARAM = 1000

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(X_train_encoding),
    y_train
)).shuffle(SHUFFLE_PARAM)

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, metrics=["accuracy"])
model.summary()

Model: "tf_distil_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 134734080 
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  3845      
                                                                 
 dropout_39 (Dropout)        multiple                  0         
                                                                 
Total params: 135,328,517
Trainable params: 135,328,517
Non-trainable params: 0
_________________________________________________________________


In [None]:
BATCH_PARAM = 16

validation_length = len(X_train) // 10 # Train 데이터의 10%를 Validation 데이터로 활용
train_except_val = train_dataset.skip(validation_length).batch(BATCH_PARAM)
validation_data = train_dataset.take(validation_length).batch(BATCH_PARAM)

In [None]:
model.fit(
    train_except_val,
    epochs=1,
    batch_size=BATCH_PARAM,
    validation_data=validation_data)



<keras.callbacks.History at 0x7a8894a47250>

- 93%가 나왔지만, SentimentText로 validation한 것이므로 높게 나올 수밖에 없음
- predict 할 때에는 model에 RawText를 입력하고, softmax로 각 label별 확률이 나올 것을 기대
- label 별 확률에서 상위 2~3개를 뽑거나, 일정 확률 이상인 경우 Aspect가 있다고 판단 가능
- 예시 예측값: [0.47, 0.31, 0.03, 0.08, 0.11], 기준을 0.3이라고 하여 맨 앞 두개를 RawText 해당되는 Aspect로 판단 가능