## 幾種思路

思路1：TF-IDF + 機器學習分類器
直接使用TF-IDF對文本提取特徵，並使用分類器進行分類。在分類器的選擇上，可以使用SVM、LR、或者XGBoost。

思路2：FastText
FastText是入門款的詞向量，利用Facebook提供的FastText工具，可以快速構建出分類器。

思路3：WordVec + 深度學習分類器
WordVec是進階款的詞向量，並通過構建深度學習分類完成分類。深度學習分類的網絡結構可以選擇TextCNN、TextRNN或者BiLSTM。

思路4：Bert詞向量
Bert是高配款的詞向量，具有強大的建模學習能力。

## 獲取五種模型的資料集

In [56]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import numpy as np
import re
path = '../data/review_data(seg+pos+stopwords)_noun.csv'
df = pd.read_csv(path)
print(df.shape)
df.head()

(1019, 9)


Unnamed: 0,reviews,value,comfort,location,cleanliness,service,ws_pos_reviews,filtered,filtered_noun
0,價格合理舒適的房間老闆娘人很好還會自己做早餐給旅客重點是早餐吃到飽,1.0,1.0,0.0,0.0,1.0,"價格(Na),合理(VH),舒適(VH),的(DE),房間(Nc),老闆娘(Na),人(Na...","價格(Na),合理(VH),舒適(VH),房間(Nc),老闆娘(Na),人(Na),好(VH...","價格(Na),房間(Nc),老闆娘(Na),人(Na),早餐(Na),旅客(Na),重點(N..."
1,內部房間滿乾淨的還有大場地可以讓團體來使用我覺得很棒,0.0,1.0,0.0,1.0,0.0,"內部(Ncd),房間(Nc),滿(Dfa),乾淨(VH),的(DE),還(D),有(V_2)...","內部(Ncd),房間(Nc),乾淨(VH),場地(Na),團體(Na),使用(VC),覺得(...","內部(Ncd),房間(Nc),場地(Na),團體(Na)"
2,隔音還可以房間不小但美中不足沒有乾濕分離,0.0,1.0,0.0,0.0,0.0,"隔音(A),還可以(D),房間(Nc),不(D),小(VH),但(Cbb),美中不足(VH)...","隔音(A),還可以(D),房間(Nc),小(VH),美中不足(VH),乾(VH),濕(VH)...",房間(Nc)
3,房子設計的很棒房間採光很好大廳挑高氣派房價合理台東必住民宿,1.0,1.0,0.0,0.0,0.0,"房子(Na),設計(VC),的(DE),很(Dfa),棒(VH),房間(Nc),採光(Na)...","房子(Na),設計(VC),棒(VH),房間(Nc),採光(Na),好(VH),大廳(Nc)...","房子(Na),房間(Nc),採光(Na),大廳(Nc),氣派(Na),房價(Na),台東(N..."
4,Cp值高乾淨舒適空間大樓下有免費吐司和咖啡老闆回復速度快,1.0,1.0,0.0,0.0,1.0,"Cp值(FW),高(VH),乾淨(VH),舒適(VH),空間(Na),大樓(Na),下(Nc...","Cp值(FW),高(VH),乾淨(VH),舒適(VH),空間(Na),大樓(Na),下(Nc...","空間(Na),大樓(Na),下(Ncd),吐司(Na),咖啡(Na),老闆(Na),速度(Na)"


In [62]:
df_value = df[['value','filtered_noun']]
df_value.rename(columns={'value': 'label'}, inplace=True)
df_comfort = df[['comfort','filtered_noun']]
df_comfort.rename(columns={'comfort': 'label'}, inplace=True)
df_location = df[['location','filtered_noun']]
df_location.rename(columns={'location': 'label'}, inplace=True)
df_cleanliness = df[['cleanliness','filtered_noun']]
df_cleanliness.rename(columns={'cleanliness': 'label'}, inplace=True)
df_service = df[['service','filtered_noun']]
df_service .rename(columns={'service': 'label'}, inplace=True)

### 清理資料

In [63]:
def remove_N_comma(sentence):
    # 把後面(N..)拿掉
    sentence = str(sentence)
    pattern = re.compile(r"\(N.*?\)") #移除詞性標示
    sentence = re.sub(pattern, '', sentence)
    pattern = re.compile(r",") #將逗號替換為空格
    sentence = re.sub(pattern, ' ', sentence)
    return sentence
pd.options.mode.chained_assignment = None  # 忽略警告
df_value['filtered_noun'] = df_value.apply(lambda x: remove_N_comma(x['filtered_noun']),axis=1)
df_comfort['filtered_noun'] = df_value.apply(lambda x: remove_N_comma(x['filtered_noun']),axis=1)
df_location['filtered_noun'] = df_value.apply(lambda x: remove_N_comma(x['filtered_noun']),axis=1)
df_cleanliness['filtered_noun'] = df_value.apply(lambda x: remove_N_comma(x['filtered_noun']),axis=1)
df_service['filtered_noun'] = df_value.apply(lambda x: remove_N_comma(x['filtered_noun']),axis=1)

## 模型訓練

In [242]:
#import package
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
import pickle #儲存模型用
from sklearn.model_selection import train_test_split
#模型
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

### 切分資料集

In [268]:
def split_data(df):
    x = df.filtered_noun
    y = df.label
    # x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) #訓練集：測試集 = 8:2
    return x,y

### 三種模型訓練

In [269]:
def train_SVM(X,y):
    #向量化
    vectorizer = TfidfVectorizer()
    clf = make_pipeline(vectorizer, SVC(kernel='linear'))
    scores = cross_validate(clf, X, y, scoring=['accuracy','recall','precision','f1'], cv=10, return_train_score=False)
    return scores
def train_LR(X,y):
    #向量化
    vectorizer = TfidfVectorizer()
    clf = make_pipeline(vectorizer, LogisticRegression(random_state=0))
    scores = cross_validate(clf, X, y, scoring=['accuracy','recall','precision','f1'], cv=10, return_train_score=False)
    return scores
def train_RF(X,y):
    #向量化
    vectorizer = TfidfVectorizer()
    clf = make_pipeline(vectorizer, RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0))
    scores = cross_validate(clf, X, y, scoring=['accuracy','recall','precision','f1'], cv=10, return_train_score=False)
    return scores

### 顯示訓練結果

In [322]:
def show_output(scores_SVM,scores_LR,scores_RF):
    total_score=[]
    SVM_list = [scores_SVM['test_accuracy'].mean(),
                scores_SVM['test_recall'].mean(),
                scores_SVM['test_precision'].mean(),
                scores_SVM['test_f1'].mean()]
    LR_list = [scores_LR['test_accuracy'].mean(),
                scores_LR['test_recall'].mean(),
                scores_LR['test_precision'].mean(),
                scores_LR['test_f1'].mean()]
    RF_list = [scores_RF['test_accuracy'].mean(),
                scores_RF['test_recall'].mean(),
                scores_RF['test_precision'].mean(),
                scores_RF['test_f1'].mean()]
    total_score.append(SVM_list)
    total_score.append(LR_list)
    total_score.append(RF_list)
    df_scores = pd.DataFrame(total_score, columns=["accuracy", "recall", "precision", "f1_score"],index=['SVM','LR','RF'])
    return df_scores

## 模型一、value

In [323]:
df_value.head()

Unnamed: 0,label,filtered_noun
0,1.0,價格 房間 老闆娘 人 早餐 旅客 重點 早餐
1,0.0,內部 房間 場地 團體
2,0.0,房間
3,1.0,房子 房間 採光 大廳 氣派 房價 台東 民宿
4,1.0,空間 大樓 下 吐司 咖啡 老闆 速度


In [324]:
x, y = split_data(df_value)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [325]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score
SVM,0.881256,0.332026,0.949603,0.481895
LR,0.851835,0.137908,1.0,0.237402
RF,0.828266,0.0,0.0,0.0


## 模型二、comfort

In [326]:
df_comfort.head()

Unnamed: 0,label,filtered_noun
0,1.0,價格 房間 老闆娘 人 早餐 旅客 重點 早餐
1,1.0,內部 房間 場地 團體
2,1.0,房間
3,1.0,房子 房間 採光 大廳 氣派 房價 台東 民宿
4,1.0,空間 大樓 下 吐司 咖啡 老闆 速度


In [327]:
x, y = split_data(df_comfort)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

In [328]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score
SVM,0.75961,0.895775,0.789026,0.83855
LR,0.740934,0.970423,0.739393,0.839239
RF,0.696768,1.0,0.696768,0.821286


## 模型三、location

In [329]:
df_location.head()

Unnamed: 0,label,filtered_noun
0,0.0,價格 房間 老闆娘 人 早餐 旅客 重點 早餐
1,0.0,內部 房間 場地 團體
2,0.0,房間
3,0.0,房子 房間 採光 大廳 氣派 房價 台東 民宿
4,0.0,空間 大樓 下 吐司 咖啡 老闆 速度


In [330]:
x, y = split_data(df_location)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [331]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score
SVM,0.887168,0.704032,0.911032,0.792164
LR,0.834168,0.484173,0.949438,0.634422
RF,0.691856,0.0,0.0,0.0


## 模型四、cleanliness

In [332]:
df_cleanliness.head()

Unnamed: 0,label,filtered_noun
0,0.0,價格 房間 老闆娘 人 早餐 旅客 重點 早餐
1,1.0,內部 房間 場地 團體
2,0.0,房間
3,0.0,房子 房間 採光 大廳 氣派 房價 台東 民宿
4,0.0,空間 大樓 下 吐司 咖啡 老闆 速度


In [333]:
x, y = split_data(df_cleanliness)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [334]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score
SVM,0.768414,0.248718,0.609773,0.349644
LR,0.75959,0.149145,0.659365,0.240612
RF,0.743865,0.0,0.0,0.0


## 模型五、service

In [335]:
df_service.head()

Unnamed: 0,label,filtered_noun
0,1.0,價格 房間 老闆娘 人 早餐 旅客 重點 早餐
1,0.0,內部 房間 場地 團體
2,0.0,房間
3,0.0,房子 房間 採光 大廳 氣派 房價 台東 民宿
4,1.0,空間 大樓 下 吐司 咖啡 老闆 速度


In [336]:
x, y = split_data(df_service)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

In [337]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score
SVM,0.855795,0.860312,0.89932,0.877938
LR,0.844991,0.912295,0.846596,0.876754
RF,0.604514,1.0,0.604514,0.753507
