## 幾種思路

思路1：TF-IDF + 機器學習分類器
直接使用TF-IDF對文本提取特徵，並使用分類器進行分類。在分類器的選擇上，可以使用SVM、LR、或者XGBoost。

思路2：FastText
FastText是入門款的詞向量，利用Facebook提供的FastText工具，可以快速構建出分類器。

思路3：WordVec + 深度學習分類器
WordVec是進階款的詞向量，並通過構建深度學習分類完成分類。深度學習分類的網絡結構可以選擇TextCNN、TextRNN或者BiLSTM。

思路4：Bert詞向量
Bert是高配款的詞向量，具有強大的建模學習能力。

## 獲取6種模型的資料集

In [25]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import numpy as np
import re
path = '../data/0407/review_data(seg+pos+stopwords)_train_n+v+f+p.csv'
df_train = pd.read_csv(path)
path = '../data/0407/review_data(seg+pos+stopwords)_test_n+v+f+p.csv'
df_test = pd.read_csv(path)

### 檢查重複值、空值

In [26]:
#印出重複資料
print(df_test[df_test.duplicated()])
print(df_train[df_train.duplicated()])

Empty DataFrame
Columns: [reviews, value, comfort, location, cleanliness, service, facilities, ws_pos_reviews, filtered, filtered_word]
Index: []
Empty DataFrame
Columns: [reviews, value, comfort, location, cleanliness, service, facilities, ws_pos_reviews, filtered, filtered_word]
Index: []


In [27]:
#移除重複值
#df = df.drop_duplicates()
#print(df.shape)

In [28]:
#印出空值資料
#df_train[df_train.isnull().T.any()]

### 切分為6個資料集

In [114]:
def split_df(df):
    df_value = df[['value','filtered_word']]
    df_value.rename(columns={'value': 'label'}, inplace=True)
    df_comfort = df[['comfort','filtered_word']]
    df_comfort.rename(columns={'comfort': 'label'}, inplace=True)
    df_location = df[['location','filtered_word']]
    df_location.rename(columns={'location': 'label'}, inplace=True)
    df_cleanliness = df[['cleanliness','filtered_word']]
    df_cleanliness.rename(columns={'cleanliness': 'label'}, inplace=True)
    df_service = df[['service','filtered_word']]
    df_service.rename(columns={'service': 'label'}, inplace=True)
    df_facilities = df[['facilities','filtered_word']]
    df_facilities.rename(columns={'facilities': 'label'}, inplace=True)
    return df_value, df_comfort, df_location, df_cleanliness, df_service, df_facilities

In [115]:
df_value_train, df_comfort_train, df_location_train, df_cleanliness_train, df_service_train ,df_facilities_train = split_df(df_train)
df_value_test, df_comfort_test, df_location_test, df_cleanliness_test, df_service_test, df_facilities_test = split_df(df_test)

### 清理資料(移除詞性標註的文字)

In [116]:
def remove_N_comma(sentence):
    # 把後面(N..)(V..)(F..)拿掉
    sentence = str(sentence)
    pattern = re.compile(r"\([N,V,F,P].*?\)") #移除詞性標示
    sentence = re.sub(pattern, '', sentence)
    pattern = re.compile(r",") #將逗號替換為空格
    sentence = re.sub(pattern, ' ', sentence)
    return sentence
pd.options.mode.chained_assignment = None  # 忽略警告

In [117]:
#訓練集
df_value_train['filtered_word'] = df_value_train.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_comfort_train['filtered_word'] = df_comfort_train.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_location_train['filtered_word'] = df_location_train.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_cleanliness_train['filtered_word'] = df_cleanliness_train.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_service_train['filtered_word'] = df_service_train.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_facilities_train['filtered_word'] = df_facilities_train.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
#測試集
df_value_test['filtered_word'] = df_value_test.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_comfort_test['filtered_word'] = df_comfort_test.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_location_test['filtered_word'] = df_location_test.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_cleanliness_test['filtered_word'] = df_cleanliness_test.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_service_test['filtered_word'] = df_service_test.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_facilities_test['filtered_word'] = df_facilities_test.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)

In [118]:
df_value_test.shape

(420, 2)

## 模型架構

### 套件引用

In [313]:
#import package
#轉向量用
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer 
from scipy.sparse import coo_matrix

from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_validate
import pickle #儲存模型用
from sklearn.model_selection import train_test_split
#類別採樣
import imblearn.over_sampling as over_sampling
import imblearn.under_sampling as under_sampling
import imblearn.combine as combine
from imblearn.pipeline import make_pipeline as make_pipeline_imb


#模型
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB
#from xgboost.sklearn import XGBClassifier

#模型效能表現
import sklearn.metrics as metrics

### 顯示訓練結果

In [314]:
def classification_report(y_test, pre):
    #混淆矩陣
    confusion = metrics.confusion_matrix(y_test, pre)
    TP = confusion[1,1]
    TN = confusion[0,0]
    FP = confusion[0,1]
    FN = confusion[1,0]
    #Accuracy
    accuracy = (TP+TN)/float(TP+TN+FN+FP)
    print("Accuracy：", accuracy)
    #Sensitivity(Recall)
    recall = TP/float(TP+FN)
    print("Recall：", recall)
    #Specificity
    specificity = TN/float(TN+FP)
    print("Specificity：", specificity)
    #Precision
    precision = TP/float(TP+FP)
    print("Precision：", precision)
    #f1-score
    f1_score = ((2*precision*recall)/(precision+recall))
    print("f1_score：", f1_score)
    #AUC
    print("AUC：", metrics.roc_auc_score(y_test, pre))

### 切分數據label

In [315]:
def split_label(df_train, df_test):
    X_train = df_train.filtered_word.tolist()
    y_train = df_train.label
    X_test = df_test.filtered_word.tolist()
    y_test = df_test.label
    return X_train, y_train, X_test, y_test

### SVM模型設計

#### (1) baseline

In [316]:
def SVM_model(df_train, df_test):
    print("SVM baseline")
    #切分數據集
    X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(max_df=0.8,min_df=5,dtype=np.float32), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

#### (2)執行採樣 => 解決類別不平衡

In [318]:
def SVM_model2(df_train, df_test):
    print("ADASYN")
    #切分數據集
    X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), over_sampling.ADASYN(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [319]:
def SVM_model3(df_train, df_test):
    print("SMOTE")
    #切分數據集
    X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), over_sampling.SMOTE(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [320]:
def SVM_model4(df_train, df_test):
    print("RandomOverSampler")
    #切分數據集
    X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), over_sampling.RandomOverSampler(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [321]:
def SVM_model5(df_train, df_test):
    print("RandomUnderSampler")
    #切分數據集
    X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), under_sampling.RandomUnderSampler(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

## 模型訓練&結果

### 模型一、value

In [329]:
#X_train, y_train, X_test, y_test = data_to_vec(df_value_train, df_value_test)
SVM_model(df_value_train, df_value_test)
#XG_model(df_value_train, df_value_test)
SVM_model2(df_value_train, df_value_test)
SVM_model3(df_value_train, df_value_test)
SVM_model4(df_value_train, df_value_test)
SVM_model5(df_value_train, df_value_test)

SVM baseline
Accuracy： 0.9380952380952381
Recall： 0.65625
Specificity： 0.9887640449438202
Precision： 0.9130434782608695
f1_score： 0.7636363636363634
AUC： 0.8225070224719101


ADASYN
Accuracy： 0.9047619047619048
Recall： 0.71875
Specificity： 0.9382022471910112
Precision： 0.6764705882352942
f1_score： 0.696969696969697
AUC： 0.8284761235955055


SMOTE
Accuracy： 0.9071428571428571
Recall： 0.6875
Specificity： 0.9466292134831461
Precision： 0.6984126984126984
f1_score： 0.6929133858267716
AUC： 0.817064606741573


RandomOverSampler
Accuracy： 0.9095238095238095
Recall： 0.65625
Specificity： 0.9550561797752809
Precision： 0.7241379310344828
f1_score： 0.6885245901639345
AUC： 0.8056530898876405


RandomUnderSampler
Accuracy： 0.8095238095238095
Recall： 0.71875
Specificity： 0.8258426966292135
Precision： 0.42592592592592593
f1_score： 0.5348837209302325
AUC： 0.7722963483146068




### 模型二、comfort

In [330]:
#X_train, y_train, X_test, y_test = data_to_vec(df_comfort_train, df_comfort_test)
#SVM_model(X_train, y_train, X_test, y_test)
SVM_model(df_comfort_train, df_comfort_test)
SVM_model2(df_comfort_train, df_comfort_test)
SVM_model3(df_comfort_train, df_comfort_test)
SVM_model4(df_comfort_train, df_comfort_test)
SVM_model5(df_comfort_train, df_comfort_test)

SVM baseline
Accuracy： 0.8261904761904761
Recall： 0.72
Specificity： 0.8711864406779661
Precision： 0.703125
f1_score： 0.7114624505928854
AUC： 0.795593220338983


ADASYN
Accuracy： 0.7761904761904762
Recall： 0.776
Specificity： 0.7762711864406779
Precision： 0.5950920245398773
f1_score： 0.6736111111111112
AUC： 0.776135593220339


SMOTE
Accuracy： 0.7857142857142857
Recall： 0.768
Specificity： 0.7932203389830509
Precision： 0.6114649681528662
f1_score： 0.6808510638297872
AUC： 0.7806101694915254


RandomOverSampler
Accuracy： 0.7761904761904762
Recall： 0.744
Specificity： 0.7898305084745763
Precision： 0.6
f1_score： 0.6642857142857143
AUC： 0.7669152542372881


RandomUnderSampler
Accuracy： 0.7428571428571429
Recall： 0.792
Specificity： 0.7220338983050848
Precision： 0.5469613259668509
f1_score： 0.6470588235294118
AUC： 0.7570169491525424




### 模型三、location

In [331]:
#X_train, y_train, X_test, y_test = data_to_vec(df_location_train, df_location_test)
#SVM_model(X_train, y_train, X_test, y_test)
SVM_model(df_location_train, df_location_test)
SVM_model2(df_location_train, df_location_test)
SVM_model3(df_location_train, df_location_test)
SVM_model4(df_location_train, df_location_test)
SVM_model5(df_location_train, df_location_test)

SVM baseline
Accuracy： 0.8595238095238096
Recall： 0.6287878787878788
Specificity： 0.9652777777777778
Precision： 0.8924731182795699
f1_score： 0.7377777777777779
AUC： 0.7970328282828283


ADASYN
Accuracy： 0.8547619047619047
Recall： 0.6666666666666666
Specificity： 0.9409722222222222
Precision： 0.8380952380952381
f1_score： 0.7426160337552743
AUC： 0.8038194444444444


SMOTE
Accuracy： 0.8642857142857143
Recall： 0.7045454545454546
Specificity： 0.9375
Precision： 0.8378378378378378
f1_score： 0.7654320987654323
AUC： 0.8210227272727273


RandomOverSampler
Accuracy： 0.8547619047619047
Recall： 0.6742424242424242
Specificity： 0.9375
Precision： 0.8317757009345794
f1_score： 0.7447698744769874
AUC： 0.8058712121212122


RandomUnderSampler
Accuracy： 0.8166666666666667
Recall： 0.7272727272727273
Specificity： 0.8576388888888888
Precision： 0.7007299270072993
f1_score： 0.7137546468401488
AUC： 0.7924558080808081




### 模型四、cleanliness

In [332]:
#X_train, y_train, X_test, y_test = data_to_vec(df_cleanliness_train, df_cleanliness_test)
#SVM_model(X_train, y_train, X_test, y_test)
SVM_model(df_cleanliness_train, df_cleanliness_test)
SVM_model2(df_cleanliness_train, df_cleanliness_test)
SVM_model3(df_cleanliness_train, df_cleanliness_test)
SVM_model4(df_cleanliness_train, df_cleanliness_test)
SVM_model5(df_cleanliness_train, df_cleanliness_test)

SVM baseline
Accuracy： 0.9476190476190476
Recall： 0.8347107438016529
Specificity： 0.9933110367892977
Precision： 0.9805825242718447
f1_score： 0.9017857142857142
AUC： 0.9140108902954753


ADASYN
Accuracy： 0.9523809523809523
Recall： 0.859504132231405
Specificity： 0.9899665551839465
Precision： 0.9719626168224299
f1_score： 0.912280701754386
AUC： 0.9247353437076757


SMOTE
Accuracy： 0.95
Recall： 0.8512396694214877
Specificity： 0.9899665551839465
Precision： 0.9716981132075472
f1_score： 0.9074889867841409
AUC： 0.9206031123027171


RandomOverSampler
Accuracy： 0.9571428571428572
Recall： 0.8760330578512396
Specificity： 0.9899665551839465
Precision： 0.9724770642201835
f1_score： 0.9217391304347826
AUC： 0.9329998065175931


RandomUnderSampler
Accuracy： 0.95
Recall： 0.8677685950413223
Specificity： 0.9832775919732442
Precision： 0.9545454545454546
f1_score： 0.9090909090909091
AUC： 0.9255230935072832




### 模型五、service

In [333]:
#X_train, y_train, X_test, y_test = data_to_vec(df_service_train, df_service_test)
#SVM_model(X_train, y_train, X_test, y_test)
SVM_model(df_service_train, df_service_test)
SVM_model2(df_service_train, df_service_test)
SVM_model3(df_service_train, df_service_test)
SVM_model4(df_service_train, df_service_test)
SVM_model5(df_service_train, df_service_test)

SVM baseline
Accuracy： 0.8738095238095238
Recall： 0.8592592592592593
Specificity： 0.9
Precision： 0.9392712550607287
f1_score： 0.8974854932301741
AUC： 0.8796296296296297


ADASYN
Accuracy： 0.8928571428571429
Recall： 0.8925925925925926
Specificity： 0.8933333333333333
Precision： 0.9377431906614786
f1_score： 0.9146110056925997
AUC： 0.892962962962963


SMOTE
Accuracy： 0.8833333333333333
Recall： 0.8888888888888888
Specificity： 0.8733333333333333
Precision： 0.9266409266409267
f1_score： 0.9073724007561437
AUC： 0.8811111111111111


RandomOverSampler
Accuracy： 0.8904761904761904
Recall： 0.8962962962962963
Specificity： 0.88
Precision： 0.9307692307692308
f1_score： 0.9132075471698113
AUC： 0.8881481481481482


RandomUnderSampler
Accuracy： 0.8761904761904762
Recall： 0.8555555555555555
Specificity： 0.9133333333333333
Precision： 0.9467213114754098
f1_score： 0.8988326848249028
AUC： 0.8844444444444445




### 模型六、facilities

In [334]:
#X_train, y_train, X_test, y_test = data_to_vec(df_facilities_train, df_facilities_test)
#SVM_model(X_train, y_train, X_test, y_test)
SVM_model(df_value_train, df_value_test)
SVM_model2(df_value_train, df_value_test)
SVM_model3(df_value_train, df_value_test)
SVM_model4(df_value_train, df_value_test)
SVM_model5(df_value_train, df_value_test)

SVM baseline
Accuracy： 0.9380952380952381
Recall： 0.65625
Specificity： 0.9887640449438202
Precision： 0.9130434782608695
f1_score： 0.7636363636363634
AUC： 0.8225070224719101


ADASYN
Accuracy： 0.9047619047619048
Recall： 0.71875
Specificity： 0.9382022471910112
Precision： 0.6764705882352942
f1_score： 0.696969696969697
AUC： 0.8284761235955055


SMOTE
Accuracy： 0.9166666666666666
Recall： 0.640625
Specificity： 0.9662921348314607
Precision： 0.7735849056603774
f1_score： 0.7008547008547009
AUC： 0.8034585674157304


RandomOverSampler
Accuracy： 0.9071428571428571
Recall： 0.6875
Specificity： 0.9466292134831461
Precision： 0.6984126984126984
f1_score： 0.6929133858267716
AUC： 0.817064606741573


RandomUnderSampler
Accuracy： 0.819047619047619
Recall： 0.734375
Specificity： 0.8342696629213483
Precision： 0.44339622641509435
f1_score： 0.5529411764705883
AUC： 0.7843223314606741


