## 幾種思路

思路1：TF-IDF + 機器學習分類器
直接使用TF-IDF對文本提取特徵，並使用分類器進行分類。在分類器的選擇上，可以使用SVM、LR、或者XGBoost。

思路2：FastText
FastText是入門款的詞向量，利用Facebook提供的FastText工具，可以快速構建出分類器。

思路3：WordVec + 深度學習分類器
WordVec是進階款的詞向量，並通過構建深度學習分類完成分類。深度學習分類的網絡結構可以選擇TextCNN、TextRNN或者BiLSTM。

思路4：Bert詞向量
Bert是高配款的詞向量，具有強大的建模學習能力。

## 獲取五種模型的資料集

In [2]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import numpy as np
import re
path = '../data/0407/review_data(seg+pos+stopwords)_train_n+v+f+p.csv'
df_train = pd.read_csv(path)
path = '../data/0407/review_data(seg+pos+stopwords)_test_n+v+f+p.csv'
df_test = pd.read_csv(path)

### 檢查重複值、空值

In [4]:
df_test

Unnamed: 0,reviews,value,comfort,location,cleanliness,service,facilities,ws_pos_reviews,filtered,filtered_word
0,而且廚房有提供餅乾飲料可以買其實也可以不用去到全家買,0.0,0.0,0.0,0.0,1.0,0.0,"而且(Cbb),廚房(Nc),有(V_2),提供(VD),餅乾(Na),飲料(Na),可以(...","廚房(Nc),提供(VD),餅乾(Na),飲料(Na),買(VC),不用(D),全(Neqa...","廚房(Nc),提供(VD),餅乾(Na),飲料(Na),買(VC),全(Neqa),家(Nc..."
1,而且離車站客運站都很近有機會會想回住,0.0,0.0,1.0,0.0,0.0,0.0,"而且(Cbb),離(P),車站(Nc),客運站(Nc),都(D),很(Dfa),近(VH),...","離(P),車站(Nc),客運站(Nc),機會(Na),想(VE),回住(VCL)","離(P),車站(Nc),客運站(Nc),機會(Na),想(VE),回住(VCL)"
2,離車站好近房間很大隨時訂放都有人接洽雖然不是主題民宿推薦洽公背包客住宿,0.0,1.0,1.0,0.0,1.0,0.0,"離(P),車站(Nc),好(Dfa),近(VH),房間(Nc),很(Dfa),大(VH),隨...","離(P),車站(Nc),好(Dfa),房間(Nc),隨時(D),訂放(VC),人(Na),接...","離(P),車站(Nc),房間(Nc),訂放(VC),人(Na),接洽(VC),主題(Na),..."
3,4人房空間超大床埔也大整體乾淨舒適超讚,0.0,1.0,0.0,1.0,0.0,0.0,"4(Neu),人(Na),房(Na),空間(Na),超大(A),床埔(Nc),也(D),大(...","人(Na),房(Na),空間(Na),超大(A),床埔(Nc),整體(Na),乾淨(VH),...","人(Na),房(Na),空間(Na),床埔(Nc),整體(Na),乾淨(VH),舒適(VH)"
4,價格公道的背包客民宿老闆也蠻客氣的旁邊還有機車等代步工具可租乘,1.0,0.0,0.0,0.0,1.0,1.0,"價格(Na),公道(VH),的(DE),背包客(Na),民宿(Nc),老闆(Na),也(D)...","價格(Na),公道(VH),背包客(Na),民宿(Nc),老闆(Na),客氣(VH),旁邊(...","價格(Na),公道(VH),背包客(Na),民宿(Nc),老闆(Na),客氣(VH),旁邊(..."
...,...,...,...,...,...,...,...,...,...,...
415,房間就可以看到海馬哥非常熱情民宿風格簡約很多ikea的東西,0.0,0.0,1.0,0.0,1.0,1.0,"房間(Nc),就(D),可以(D),看到(VE),海馬哥(Nb),非常(Dfa),熱情(VH...","房間(Nc),看到(VE),海馬哥(Nb),熱情(VH),民宿(Nc),風格(Na),簡約(...","房間(Nc),看到(VE),海馬哥(Nb),熱情(VH),民宿(Nc),風格(Na),簡約(..."
416,老闆娘還說如果先問可以借嬰兒小推車我們就不用帶了帶小孩吃烤肉必備,0.0,0.0,0.0,0.0,0.0,1.0,"老闆娘(Na),還(D),說(VE),如果(Cbb),先(D),問(VE),可以(D),借(...","老闆娘(Na),說(VE),先(D),問(VE),嬰兒(Na),小推車(VH),不用(D),...","老闆娘(Na),說(VE),問(VE),嬰兒(Na),小推車(VH),小孩(Na),吃(VC..."
417,乾淨舒適闆娘和小幫手超熱情推薦在地景點及美食招待的水果好甜好好吃完食,0.0,1.0,0.0,1.0,1.0,0.0,"乾淨(VH),舒適(VH),闆娘(Na),和(Caa),小(VH),幫手(Na),超(Dfa...","乾淨(VH),舒適(VH),闆娘(Na),小(VH),幫手(Na),超(Dfa),熱情(VH...","乾淨(VH),舒適(VH),闆娘(Na),小(VH),幫手(Na),熱情(VH),推薦(VC..."
418,老闆娘親切贈予水果點心房間舒適整潔小孩很喜歡,0.0,1.0,0.0,1.0,1.0,0.0,"老闆娘(Na),親切(VH),贈予(VD),水果(Na),點心(Na),房間(Nc),舒適(...","老闆娘(Na),親切(VH),贈予(VD),水果(Na),點心(Na),房間(Nc),舒適(...","老闆娘(Na),親切(VH),贈予(VD),水果(Na),點心(Na),房間(Nc),舒適(..."


In [6]:
#印出重複資料
print(df_test[df_test.duplicated()])
print(df_train[df_train.duplicated()])

Empty DataFrame
Columns: [reviews, value, comfort, location, cleanliness, service, facilities, ws_pos_reviews, filtered, filtered_word]
Index: []
Empty DataFrame
Columns: [reviews, value, comfort, location, cleanliness, service, facilities, ws_pos_reviews, filtered, filtered_word]
Index: []


In [10]:
#移除重複值
#df = df.drop_duplicates()
#print(df.shape)

In [11]:
#印出空值資料
#df_train[df_train.isnull().T.any()]

### 切分為五個資料集

In [12]:
df_value = df_train[['value','filtered_word']]
df_value.rename(columns={'value': 'label'}, inplace=True)
df_comfort = df_train[['comfort','filtered_word']]
df_comfort.rename(columns={'comfort': 'label'}, inplace=True)
df_location = df_train[['location','filtered_word']]
df_location.rename(columns={'location': 'label'}, inplace=True)
df_cleanliness = df_train[['cleanliness','filtered_word']]
df_cleanliness.rename(columns={'cleanliness': 'label'}, inplace=True)
df_service = df_train[['service','filtered_word']]
df_service.rename(columns={'service': 'label'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_value.rename(columns={'value': 'label'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_comfort.rename(columns={'comfort': 'label'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_location.rename(columns={'location': 'label'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-

### 清理資料

In [14]:
def remove_N_comma(sentence):
    # 把後面(N..)(V..)(F..)拿掉
    sentence = str(sentence)
    pattern = re.compile(r"\([N,V,F,P].*?\)") #移除詞性標示
    sentence = re.sub(pattern, '', sentence)
    pattern = re.compile(r",") #將逗號替換為空格
    sentence = re.sub(pattern, ' ', sentence)
    return sentence
pd.options.mode.chained_assignment = None  # 忽略警告
df_value['filtered_word'] = df_value.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_comfort['filtered_word'] = df_value.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_location['filtered_word'] = df_value.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_cleanliness['filtered_word'] = df_value.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)
df_service['filtered_word'] = df_value.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)

In [325]:
#df_location.to_csv('../data/review_data(df_location)_n+v+f+p.csv', encoding='utf_8_sig', index=False)

In [327]:
#df_location.loc[df_location['filtered_noun'].str.contains('在') & df_location['label']==1]

## 模型訓練

In [382]:
#import package
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer 
from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_validate
import pickle #儲存模型用
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE,ADASYN
from imblearn.combine import SMOTEENN

#模型
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

### 切分資料集 (X為特徵；y為label)

In [383]:
def split_data(df):
    x = df.filtered_noun
    y = df.label
    # x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) #訓練集：測試集 = 8:2
    return x,y

### 三種模型驗證訓練

In [384]:
def train_SVM(X,y):
    #向量化
    vectorizer = TfidfVectorizer()
    #clf = make_pipeline(vectorizer, SVC(kernel='linear'))
    clf = make_pipeline(vectorizer, ADASYN(random_state=12), SVC(kernel='linear'))
    # clf = Pipeline([
    #   ('vect', CountVectorizer()),
    #    ('tfidf', TfidfTransformer()),
    #    ('smote', SMOTE(random_state=42)),
    #    ('svc', SVC(kernel='linear'))
    # ])
    scores = cross_validate(clf, X, y, scoring=['accuracy','recall','precision','f1','roc_auc'], cv=10, return_train_score=False)
    return scores
def train_LR(X,y):
    #向量化
    vectorizer = TfidfVectorizer()
    #clf = make_pipeline(vectorizer, LogisticRegression(random_state=0))
    clf = make_pipeline(vectorizer, ADASYN(random_state=12), LogisticRegression(random_state=0))
    scores = cross_validate(clf, X, y, scoring=['accuracy','recall','precision','f1','roc_auc'], cv=10, return_train_score=False)
    return scores
def train_RF(X,y):
    #向量化
    vectorizer = TfidfVectorizer()
    #clf = make_pipeline(vectorizer, RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0))
    clf = make_pipeline(vectorizer, ADASYN(random_state=12), RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0))
    scores = cross_validate(clf, X, y, scoring=['accuracy','recall','precision','f1','roc_auc'], cv=10, return_train_score=False)
    return scores

### 顯示訓練結果

In [385]:
def show_output(scores_SVM,scores_LR,scores_RF):
    total_score=[]
    SVM_list = [scores_SVM['test_accuracy'].mean(),
                scores_SVM['test_recall'].mean(),
                scores_SVM['test_precision'].mean(),
                scores_SVM['test_f1'].mean(),
                scores_SVM['test_roc_auc'].mean(),]
    LR_list = [scores_LR['test_accuracy'].mean(),
                scores_LR['test_recall'].mean(),
                scores_LR['test_precision'].mean(),
                scores_LR['test_f1'].mean(),
               scores_LR['test_roc_auc'].mean(),]
    RF_list = [scores_RF['test_accuracy'].mean(),
                scores_RF['test_recall'].mean(),
                scores_RF['test_precision'].mean(),
                scores_RF['test_f1'].mean(),
               scores_RF['test_roc_auc'].mean()]
    total_score.append(SVM_list)
    total_score.append(LR_list)
    total_score.append(RF_list)
    df_scores = pd.DataFrame(total_score, columns=["accuracy", "recall", "precision", "f1_score",'AUC'],index=['SVM','LR','RF'])
    return df_scores

## 模型一、value

In [386]:
df_value.head()

Unnamed: 0,label,filtered_noun
0,1.0,價格 合理 舒適 房間 老闆娘 人 好 做 早餐 旅客 重點 早餐 吃到飽
1,0.0,內部 房間 乾淨 場地 團體 使用 覺得 棒
2,0.0,房間 小 美中不足 乾 濕 分離
3,1.0,房子 設計 棒 房間 採光 好 大廳 挑高 氣派 房價 合理 台東 住 民宿
4,1.0,Cp值 高 乾淨 舒適 空間 大樓 下 免費 吐司 咖啡 老闆 回復 速度


In [387]:
x, y = split_data(df_value)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

In [388]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score,AUC
SVM,0.910402,0.69245,0.777919,0.726484,0.903338
LR,0.905173,0.688746,0.747411,0.715407,0.898178
RF,0.867909,0.402991,0.755979,0.513457,0.834358


## 模型二、comfort

In [389]:
df_comfort.head()

Unnamed: 0,label,filtered_noun
0,1.0,價格 合理 舒適 房間 老闆娘 人 好 做 早餐 旅客 重點 早餐 吃到飽
1,1.0,內部 房間 乾淨 場地 團體 使用 覺得 棒
2,1.0,房間 小 美中不足 乾 濕 分離
3,1.0,房子 設計 棒 房間 採光 好 大廳 挑高 氣派 房價 合理 台東 住 民宿
4,1.0,Cp值 高 乾淨 舒適 空間 大樓 下 免費 吐司 咖啡 老闆 回復 速度


In [390]:
x, y = split_data(df_comfort)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

In [391]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score,AUC
SVM,0.752116,0.769676,0.834968,0.799402,0.824699
LR,0.765871,0.767646,0.857291,0.807632,0.839008
RF,0.692647,0.609256,0.879216,0.716653,0.80091


## 模型三、location

In [392]:
df_location.head()

Unnamed: 0,label,filtered_noun
0,0.0,價格 合理 舒適 房間 老闆娘 人 好 做 早餐 旅客 重點 早餐 吃到飽
1,0.0,內部 房間 乾淨 場地 團體 使用 覺得 棒
2,0.0,房間 小 美中不足 乾 濕 分離
3,0.0,房子 設計 棒 房間 採光 好 大廳 挑高 氣派 房價 合理 台東 住 民宿
4,0.0,Cp值 高 乾淨 舒適 空間 大樓 下 免費 吐司 咖啡 老闆 回復 速度


In [393]:
x, y = split_data(df_location)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

In [394]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score,AUC
SVM,0.872454,0.754348,0.810912,0.779634,0.910829
LR,0.875056,0.767391,0.811933,0.787581,0.915442
RF,0.81818,0.769565,0.677654,0.718379,0.881553


## 模型四、cleanliness

In [395]:
df_cleanliness.head()

Unnamed: 0,label,filtered_noun
0,0.0,價格 合理 舒適 房間 老闆娘 人 好 做 早餐 旅客 重點 早餐 吃到飽
1,1.0,內部 房間 乾淨 場地 團體 使用 覺得 棒
2,0.0,房間 小 美中不足 乾 濕 分離
3,0.0,房子 設計 棒 房間 採光 好 大廳 挑高 氣派 房價 合理 台東 住 民宿
4,0.0,Cp值 高 乾淨 舒適 空間 大樓 下 免費 吐司 咖啡 老闆 回復 速度


In [396]:
x, y = split_data(df_cleanliness)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

In [397]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score,AUC
SVM,0.964044,0.888232,0.973694,0.927889,0.976417
LR,0.938528,0.808841,0.951103,0.87357,0.971489
RF,0.885548,0.674451,0.863391,0.749991,0.93435


## 模型五、service

In [398]:
df_service.head()

Unnamed: 0,label,filtered_noun
0,1.0,價格 合理 舒適 房間 老闆娘 人 好 做 早餐 旅客 重點 早餐 吃到飽
1,0.0,內部 房間 乾淨 場地 團體 使用 覺得 棒
2,0.0,房間 小 美中不足 乾 濕 分離
3,0.0,房子 設計 棒 房間 採光 好 大廳 挑高 氣派 房價 合理 台東 住 民宿
4,1.0,Cp值 高 乾淨 舒適 空間 大樓 下 免費 吐司 咖啡 老闆 回復 速度


In [399]:
x, y = split_data(df_service)
scores_SVM = train_SVM(x,y)
scores_LR = train_LR(x,y)
scores_RF = train_RF(x,y)

In [400]:
show_output(scores_SVM,scores_LR,scores_RF)

Unnamed: 0,accuracy,recall,precision,f1_score,AUC
SVM,0.888145,0.891304,0.922456,0.905691,0.947073
LR,0.889469,0.870652,0.94277,0.904527,0.951755
RF,0.847609,0.78913,0.950145,0.861413,0.928545
