## 幾種思路

思路1：TF-IDF + 機器學習分類器
直接使用TF-IDF對文本提取特徵，並使用分類器進行分類。在分類器的選擇上，可以使用SVM、LR、或者XGBoost。

思路2：FastText
FastText是入門款的詞向量，利用Facebook提供的FastText工具，可以快速構建出分類器。

思路3：WordVec + 深度學習分類器
WordVec是進階款的詞向量，並通過構建深度學習分類完成分類。深度學習分類的網絡結構可以選擇TextCNN、TextRNN或者BiLSTM。

思路4：Bert詞向量
Bert是高配款的詞向量，具有強大的建模學習能力。

## 獲取6種模型的資料集

In [1]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import numpy as np
import re
path = '../data/0414/review_data(seg+pos+stopwords)_n+v+f+p+a.csv'
df = pd.read_csv(path)

### 檢查重複值、空值

In [2]:
#印出重複資料
print(df[df.duplicated()])

Empty DataFrame
Columns: [reviews, value, comfort, location, cleanliness, service, facilities, ws_pos_reviews, filtered_word]
Index: []


In [3]:
#移除重複值
#df = df.drop_duplicates()
#print(df.shape)

In [4]:
#印出空值資料
#df_train[df_train.isnull().T.any()]

### 切分為6個資料集

In [5]:
def split_df(df):
    df_value = df[['value','filtered_word']]
    df_value.rename(columns={'value': 'label'}, inplace=True)
    df_comfort = df[['comfort','filtered_word']]
    df_comfort.rename(columns={'comfort': 'label'}, inplace=True)
    df_location = df[['location','filtered_word']]
    df_location.rename(columns={'location': 'label'}, inplace=True)
    df_cleanliness = df[['cleanliness','filtered_word']]
    df_cleanliness.rename(columns={'cleanliness': 'label'}, inplace=True)
    df_service = df[['service','filtered_word']]
    df_service.rename(columns={'service': 'label'}, inplace=True)
    df_facilities = df[['facilities','filtered_word']]
    df_facilities.rename(columns={'facilities': 'label'}, inplace=True)
    return df_value, df_comfort, df_location, df_cleanliness, df_service, df_facilities

In [6]:
#df_value_train, df_comfort_train, df_location_train, df_cleanliness_train, df_service_train ,df_facilities_train = split_df(df_train)
#df_value_test, df_comfort_test, df_location_test, df_cleanliness_test, df_service_test, df_facilities_test = split_df(df_test)
df_value, df_comfort, df_location, df_cleanliness, df_service ,df_facilities = split_df(df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_value.rename(columns={'value': 'label'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_comfort.rename(columns={'comfort': 'label'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_location.rename(columns={'location': 'label'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-

In [7]:
df_value

Unnamed: 0,label,filtered_word
0,1.0,價格(Na)/合理(VH)/舒適(VH)/房間(Nc)/老闆娘(Na)/人(Na)/好(VH...
1,0.0,內部(Ncd)/房間(Nc)/乾淨(VH)/有(V_2)/大(VH)/場地(Na)/讓(VL...
2,0.0,隔音(A)/房間(Nc)/小(VH)/美中不足(VH)/沒有(VJ)/乾(VH)/濕(VH)...
3,1.0,房子(Na)/設計(VC)/棒(VH)/房間(Nc)/採光(Na)/好(VH)/大廳(Nc)...
4,1.0,Cp值(FW)/高(VH)/乾淨(VH)/舒適(VH)/空間(Na)/大(VH)/樓下(Nc...
...,...,...
1267,0.0,港式(Na)/飲茶(VA)/餐廳(Nc)/口味(Na)/棒(VH)/環境(Na)/乾淨(VH...
1268,0.0,場地(Na)/氣派(VH)/丁香魚(Na)/酥脆(VH)/服務(Nv)/親切(VH)/蠟味(...
1269,0.0,交通(Na)/方便(VH)/地下室(Nc)/停車場(Nc)/良好(VH)/菜色(Na)/好(...
1270,0.0,地點(Na)/佳(VH)/離(P)/逢甲(Nb)/夜市(Nc)/近(VH)/老闆娘(Na)/...


### 清理資料(移除詞性標註的文字)

In [8]:
def remove_N_comma(sentence):
    # 把後面(N..)(V..)(F..)拿掉
    sentence = str(sentence)
    pattern = re.compile(r"\([N,V,F,P].*?\)") #移除詞性標示
    sentence = re.sub(pattern, '', sentence)
    pattern = re.compile(r",") #將逗號替換為空格
    sentence = re.sub(pattern, ' ', sentence)
    return sentence
pd.options.mode.chained_assignment = None  # 忽略警告

In [9]:
#訓練集
df_value['filtered_word'] = df_value.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)

In [10]:
print(df_value.shape)
df_value

(1272, 2)


Unnamed: 0,label,filtered_word
0,1.0,價格/合理/舒適/房間/老闆娘/人/好/自己/做/早餐/給/旅客/重點/早餐/吃到飽
1,0.0,內部/房間/乾淨/有/大/場地/讓/團體/使用/我/覺得/棒
2,0.0,隔音(A)/房間/小/美中不足/沒有/乾/濕/分離
3,1.0,房子/設計/棒/房間/採光/好/大廳/挑高/氣派/房價/合理/台東/住/民宿
4,1.0,Cp值/高/乾淨/舒適/空間/大/樓下/有/免費/吐司/咖啡/老闆/回復/速度/快
...,...,...
1267,0.0,港式/飲茶/餐廳/口味/棒/環境/乾淨/機車/汽車/有/停車位/位於/高鐵/附近/適合/宴客
1268,0.0,場地/氣派/丁香魚/酥脆/服務/親切/蠟味/蘿蔔糕/份量/多/一些/好/牛肉粥/好吃
1269,0.0,交通/方便/地下室/停車場/良好/菜色/好/空間/設計好/說/一流/飯店
1270,0.0,地點/佳/離/逢甲/夜市/近/老闆娘/親切/服務/房間/大/舒適/浴室/乾淨


## 模型架構

### 套件引用

In [11]:
#import package
#轉向量用
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer 
from scipy.sparse import coo_matrix

from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_validate
import pickle #儲存模型用
from sklearn.model_selection import train_test_split
#類別採樣
import imblearn.over_sampling as over_sampling
import imblearn.under_sampling as under_sampling
import imblearn.combine as combine
from imblearn.pipeline import make_pipeline as make_pipeline_imb


#模型
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.ensemble import AdaBoostClassifier
#from xgboost.sklearn import XGBClassifier

#模型效能表現
import sklearn.metrics as metrics

### 顯示訓練結果

In [12]:
def classification_report(y_test, pre):
    #混淆矩陣
    confusion = metrics.confusion_matrix(y_test, pre)
    TP = confusion[1,1]
    TN = confusion[0,0]
    FP = confusion[0,1]
    FN = confusion[1,0]
    print("TP:",TP)
    print("TN:",TN)
    print("FP:",FP)
    print("FN:",FN)
    #Accuracy
    accuracy = (TP+TN)/float(TP+TN+FN+FP)
    print("Accuracy：", accuracy)
    #Sensitivity(Recall)
    recall = TP/float(TP+FN)
    print("Recall：", recall)
    #Specificity
    specificity = TN/float(TN+FP)
    print("Specificity：", specificity)
    #Precision
    precision = TP/float(TP+FP)
    print("Precision：", precision)
    #f1-score
    f1_score = ((2*precision*recall)/(precision+recall))
    print("f1_score：", f1_score)
    #AUC
    print("AUC：", metrics.roc_auc_score(y_test, pre))

In [13]:
df_value

Unnamed: 0,label,filtered_word
0,1.0,價格/合理/舒適/房間/老闆娘/人/好/自己/做/早餐/給/旅客/重點/早餐/吃到飽
1,0.0,內部/房間/乾淨/有/大/場地/讓/團體/使用/我/覺得/棒
2,0.0,隔音(A)/房間/小/美中不足/沒有/乾/濕/分離
3,1.0,房子/設計/棒/房間/採光/好/大廳/挑高/氣派/房價/合理/台東/住/民宿
4,1.0,Cp值/高/乾淨/舒適/空間/大/樓下/有/免費/吐司/咖啡/老闆/回復/速度/快
...,...,...
1267,0.0,港式/飲茶/餐廳/口味/棒/環境/乾淨/機車/汽車/有/停車位/位於/高鐵/附近/適合/宴客
1268,0.0,場地/氣派/丁香魚/酥脆/服務/親切/蠟味/蘿蔔糕/份量/多/一些/好/牛肉粥/好吃
1269,0.0,交通/方便/地下室/停車場/良好/菜色/好/空間/設計好/說/一流/飯店
1270,0.0,地點/佳/離/逢甲/夜市/近/老闆娘/親切/服務/房間/大/舒適/浴室/乾淨


### 切分訓練、測試數據

In [14]:
from sklearn.model_selection import train_test_split
def split_label(df,seed):
    X = df.filtered_word.tolist()
    y = df.label
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=1/3,random_state=seed)
    return X_train,X_test,y_train,y_test

In [15]:
X_train_1,X_test_1,y_train_1,y_test_1 = split_label(df_value,1)
X_train_2,X_test_2,y_train_2,y_test_2 = split_label(df_value,2)
X_train_3,X_test_3,y_train_3,y_test_3 = split_label(df_value,3)

In [16]:
s1 = pd.Series(y_train_1)
freq1 = s1.value_counts() 
print(freq1) 
s2 = pd.Series(y_train_2)
freq2 = s2.value_counts() 
print(freq2) 
s3 = pd.Series(y_train_3)
freq3 = s3.value_counts() 
print(freq3) 

0.0    688
1.0    160
Name: label, dtype: int64
0.0    697
1.0    151
Name: label, dtype: int64
0.0    697
1.0    151
Name: label, dtype: int64


### 模型設計

In [17]:
stop_words_list = ['ㄧ些', '一下', '一些','一點','一點點','bb','9oo']

#### (1) baseline

In [18]:
def SVM_model(X_train,X_test,y_train,y_test):
    print("SVM baseline")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(max_df=0.8,min_df=5,stop_words=stop_words_list,max_features=1400), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [19]:
def LR_model(X_train,X_test,y_train,y_test):
    print("LR baseline")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(max_df=0.8,min_df=5,stop_words=stop_words_list,max_features=1400), LogisticRegression(random_state=0))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [20]:
def RF_model(X_train,X_test,y_train,y_test):
    print("RF baseline")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(max_df=0.8,min_df=5,stop_words=stop_words_list,max_features=1400), RandomForestClassifier(max_depth=2, random_state=0))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [21]:
def AdaBoost_model(X_train,X_test,y_train,y_test):
    print("AdaBoost_model")
    
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(stop_words=stop_words_list,max_features=1400), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    
    #取词袋模型中的所有词语
    feature_names = model.named_steps["tfidfvectorizer"].get_feature_names()
    #print(feature_names)
    
    #将tf-idf矩阵抽取出来，元素w[i][j]表示j词在i类文本中的tf-idf权重
    #weight = model.named_steps["tfidfvectorizer"].fit_transform(X_train)
    #print(weight)
    
    
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [22]:
AdaBoost_model(X_train_1,X_test_1,y_train_1,y_test_1)

AdaBoost_model
TP: 47
TN: 346
FP: 15
FN: 16
Accuracy： 0.9268867924528302
Recall： 0.746031746031746
Specificity： 0.9584487534626038
Precision： 0.7580645161290323
f1_score： 0.752
AUC： 0.8522402497471749




#### (2)執行採樣 => 解決類別不平衡 (SVM)

In [23]:
def SVM_model2(X_train,X_test,y_train,y_test):
    print("ADASYN")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(stop_words=stop_words_list,max_features=1400), over_sampling.ADASYN(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [24]:
def SVM_model3(X_train,X_test,y_train,y_test):
    print("SMOTE")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(stop_words=stop_words_list,max_features=1400), over_sampling.SMOTE(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [25]:
def SVM_model4(X_train,X_test,y_train,y_test):
    print("RandomOverSampler")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(stop_words=stop_words_list,max_features=1400), over_sampling.RandomOverSampler(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [26]:
def SVM_model5(X_train,X_test,y_train,y_test):
    print("RandomUnderSampler")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(stop_words=stop_words_list,max_features=1400), under_sampling.RandomUnderSampler(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

#### (3)執行採樣 => 解決類別不平衡 (Adaboost)

In [27]:
def AdaBoost_model2(X_train,X_test,y_train,y_test):
    print("ADASYN")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(stop_words=stop_words_list,max_features=1400), over_sampling.ADASYN(), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [28]:
def AdaBoost_model3(X_train,X_test,y_train,y_test):
    print("SMOTE")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(stop_words=stop_words_list,max_features=1400), over_sampling.SMOTE(), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [29]:
def AdaBoost_model4(X_train,X_test,y_train,y_test):
    print("RandomOverSampler")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(stop_words=stop_words_list,max_features=1400), over_sampling.RandomOverSampler(), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [30]:
def AdaBoost_model5(X_train,X_test,y_train,y_test):
    print("RandomUnderSampler")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(stop_words=stop_words_list,max_features=1400), under_sampling.RandomUnderSampler(), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

## 模型訓練&結果

### 資料集1

#### baseline

In [31]:
SVM_model(X_train_1,X_test_1,y_train_1,y_test_1)
LR_model(X_train_1,X_test_1,y_train_1,y_test_1)
RF_model(X_train_1,X_test_1,y_train_1,y_test_1)
AdaBoost_model(X_train_1,X_test_1,y_train_1,y_test_1)

SVM baseline
TP: 43
TN: 359
FP: 2
FN: 20
Accuracy： 0.9481132075471698
Recall： 0.6825396825396826
Specificity： 0.9944598337950139
Precision： 0.9555555555555556
f1_score： 0.7962962962962964
AUC： 0.8384997581673482


LR baseline
TP: 18
TN: 361
FP: 0
FN: 45
Accuracy： 0.8938679245283019
Recall： 0.2857142857142857
Specificity： 1.0
Precision： 1.0
f1_score： 0.4444444444444445
AUC： 0.6428571428571428


RF baseline
TP: 0
TN: 361
FP: 0
FN: 63
Accuracy： 0.8514150943396226
Recall： 0.0
Specificity： 1.0
Precision： nan
f1_score： nan
AUC： 0.5


AdaBoost_model


  precision = TP/float(TP+FP)


TP: 47
TN: 351
FP: 10
FN: 16
Accuracy： 0.9386792452830188
Recall： 0.746031746031746
Specificity： 0.9722991689750693
Precision： 0.8245614035087719
f1_score： 0.7833333333333334
AUC： 0.8591654575034077




#### 處理類別不平衡(SVM)

In [32]:
SVM_model2(X_train_1,X_test_1,y_train_1,y_test_1)
SVM_model3(X_train_1,X_test_1,y_train_1,y_test_1)
SVM_model4(X_train_1,X_test_1,y_train_1,y_test_1)
SVM_model5(X_train_1,X_test_1,y_train_1,y_test_1)

ADASYN
TP: 46
TN: 350
FP: 11
FN: 17
Accuracy： 0.9339622641509434
Recall： 0.7301587301587301
Specificity： 0.9695290858725761
Precision： 0.8070175438596491
f1_score： 0.7666666666666667
AUC： 0.8498439080156531


SMOTE
TP: 44
TN: 354
FP: 7
FN: 19
Accuracy： 0.9386792452830188
Recall： 0.6984126984126984
Specificity： 0.9806094182825484
Precision： 0.8627450980392157
f1_score： 0.7719298245614035
AUC： 0.8395110583476234


RandomOverSampler
TP: 46
TN: 352
FP: 9
FN: 17
Accuracy： 0.9386792452830188
Recall： 0.7301587301587301
Specificity： 0.9750692520775623
Precision： 0.8363636363636363
f1_score： 0.7796610169491525
AUC： 0.8526139911181463


RandomUnderSampler
TP: 47
TN: 330
FP: 31
FN: 16
Accuracy： 0.8891509433962265
Recall： 0.746031746031746
Specificity： 0.9141274238227147
Precision： 0.6025641025641025
f1_score： 0.6666666666666666
AUC： 0.8300795849272304




#### 處理類別不平衡(Adaboost)

In [33]:
AdaBoost_model2(X_train_1,X_test_1,y_train_1,y_test_1)
AdaBoost_model3(X_train_1,X_test_1,y_train_1,y_test_1)
AdaBoost_model4(X_train_1,X_test_1,y_train_1,y_test_1)
AdaBoost_model5(X_train_1,X_test_1,y_train_1,y_test_1)

ADASYN
TP: 48
TN: 342
FP: 19
FN: 15
Accuracy： 0.9198113207547169
Recall： 0.7619047619047619
Specificity： 0.9473684210526315
Precision： 0.7164179104477612
f1_score： 0.7384615384615385
AUC： 0.8546365914786967


SMOTE
TP: 46
TN: 347
FP: 14
FN: 17
Accuracy： 0.9268867924528302
Recall： 0.7301587301587301
Specificity： 0.961218836565097
Precision： 0.7666666666666667
f1_score： 0.7479674796747968
AUC： 0.8456887833619136


RandomOverSampler
TP: 49
TN: 343
FP: 18
FN: 14
Accuracy： 0.9245283018867925
Recall： 0.7777777777777778
Specificity： 0.9501385041551247
Precision： 0.7313432835820896
f1_score： 0.7538461538461538
AUC： 0.8639581409664512


RandomUnderSampler
TP: 51
TN: 324
FP: 37
FN: 12
Accuracy： 0.8844339622641509
Recall： 0.8095238095238095
Specificity： 0.8975069252077562
Precision： 0.5795454545454546
f1_score： 0.6754966887417219
AUC： 0.8535153673657828




### 資料集2

#### baseline

In [34]:
SVM_model(X_train_2,X_test_2,y_train_2,y_test_2)
LR_model(X_train_2,X_test_2,y_train_2,y_test_2)
RF_model(X_train_2,X_test_2,y_train_2,y_test_2)
AdaBoost_model(X_train_2,X_test_2,y_train_2,y_test_2)

SVM baseline
TP: 49
TN: 352
FP: 0
FN: 23
Accuracy： 0.9457547169811321
Recall： 0.6805555555555556
Specificity： 1.0
Precision： 1.0
f1_score： 0.8099173553719008
AUC： 0.8402777777777778


LR baseline
TP: 20
TN: 352
FP: 0
FN: 52
Accuracy： 0.8773584905660378
Recall： 0.2777777777777778
Specificity： 1.0
Precision： 1.0
f1_score： 0.4347826086956522
AUC： 0.6388888888888888


RF baseline
TP: 0
TN: 352
FP: 0
FN: 72
Accuracy： 0.8301886792452831
Recall： 0.0
Specificity： 1.0
Precision： nan
f1_score： nan
AUC： 0.5


AdaBoost_model


  precision = TP/float(TP+FP)


TP: 57
TN: 340
FP: 12
FN: 15
Accuracy： 0.9363207547169812
Recall： 0.7916666666666666
Specificity： 0.9659090909090909
Precision： 0.8260869565217391
f1_score： 0.8085106382978724
AUC： 0.8787878787878788




#### 處理類別不平衡(SVM)

In [35]:
SVM_model2(X_train_2,X_test_2,y_train_2,y_test_2)
SVM_model3(X_train_2,X_test_2,y_train_2,y_test_2)
SVM_model4(X_train_2,X_test_2,y_train_2,y_test_2)
SVM_model5(X_train_2,X_test_2,y_train_2,y_test_2)

ADASYN
TP: 57
TN: 343
FP: 9
FN: 15
Accuracy： 0.9433962264150944
Recall： 0.7916666666666666
Specificity： 0.9744318181818182
Precision： 0.8636363636363636
f1_score： 0.8260869565217391
AUC： 0.8830492424242424


SMOTE
TP: 55
TN: 346
FP: 6
FN: 17
Accuracy： 0.9457547169811321
Recall： 0.7638888888888888
Specificity： 0.9829545454545454
Precision： 0.9016393442622951
f1_score： 0.8270676691729323
AUC： 0.8734217171717171


RandomOverSampler
TP: 56
TN: 343
FP: 9
FN: 16
Accuracy： 0.9410377358490566
Recall： 0.7777777777777778
Specificity： 0.9744318181818182
Precision： 0.8615384615384616
f1_score： 0.8175182481751826
AUC： 0.8761047979797979


RandomUnderSampler
TP: 64
TN: 316
FP: 36
FN: 8
Accuracy： 0.8962264150943396
Recall： 0.8888888888888888
Specificity： 0.8977272727272727
Precision： 0.64
f1_score： 0.7441860465116279
AUC： 0.8933080808080808




#### 處理類別不平衡(Adaboost)

In [36]:
AdaBoost_model2(X_train_2,X_test_2,y_train_2,y_test_2)
AdaBoost_model3(X_train_2,X_test_2,y_train_2,y_test_2)
AdaBoost_model4(X_train_2,X_test_2,y_train_2,y_test_2)
AdaBoost_model5(X_train_2,X_test_2,y_train_2,y_test_2)

ADASYN
TP: 58
TN: 340
FP: 12
FN: 14
Accuracy： 0.9386792452830188
Recall： 0.8055555555555556
Specificity： 0.9659090909090909
Precision： 0.8285714285714286
f1_score： 0.8169014084507044
AUC： 0.8857323232323232


SMOTE
TP: 56
TN: 342
FP: 10
FN: 16
Accuracy： 0.9386792452830188
Recall： 0.7777777777777778
Specificity： 0.9715909090909091
Precision： 0.8484848484848485
f1_score： 0.8115942028985507
AUC： 0.8746843434343433


RandomOverSampler
TP: 57
TN: 345
FP: 7
FN: 15
Accuracy： 0.9481132075471698
Recall： 0.7916666666666666
Specificity： 0.9801136363636364
Precision： 0.890625
f1_score： 0.8382352941176471
AUC： 0.8858901515151514


RandomUnderSampler
TP: 60
TN: 329
FP: 23
FN: 12
Accuracy： 0.9174528301886793
Recall： 0.8333333333333334
Specificity： 0.9346590909090909
Precision： 0.7228915662650602
f1_score： 0.7741935483870969
AUC： 0.8839962121212123




### 資料集3

#### baseline

In [37]:
SVM_model(X_train_3,X_test_3,y_train_3,y_test_3)
LR_model(X_train_3,X_test_3,y_train_3,y_test_3)
RF_model(X_train_3,X_test_3,y_train_3,y_test_3)
AdaBoost_model(X_train_3,X_test_3,y_train_3,y_test_3)

SVM baseline
TP: 49
TN: 351
FP: 1
FN: 23
Accuracy： 0.9433962264150944
Recall： 0.6805555555555556
Specificity： 0.9971590909090909
Precision： 0.98
f1_score： 0.8032786885245902
AUC： 0.8388573232323232


LR baseline
TP: 16
TN: 352
FP: 0
FN: 56
Accuracy： 0.8679245283018868
Recall： 0.2222222222222222
Specificity： 1.0
Precision： 1.0
f1_score： 0.3636363636363636
AUC： 0.6111111111111112


RF baseline
TP: 0
TN: 352
FP: 0
FN: 72
Accuracy： 0.8301886792452831
Recall： 0.0
Specificity： 1.0
Precision： nan
f1_score： nan
AUC： 0.5


AdaBoost_model


  precision = TP/float(TP+FP)


TP: 51
TN: 342
FP: 10
FN: 21
Accuracy： 0.9268867924528302
Recall： 0.7083333333333334
Specificity： 0.9715909090909091
Precision： 0.8360655737704918
f1_score： 0.7669172932330828
AUC： 0.8399621212121212




#### 處理類別不平衡(SVM)

In [38]:
SVM_model2(X_train_3,X_test_3,y_train_3,y_test_3)
SVM_model3(X_train_3,X_test_3,y_train_3,y_test_3)
SVM_model4(X_train_3,X_test_3,y_train_3,y_test_3)
SVM_model5(X_train_3,X_test_3,y_train_3,y_test_3)

ADASYN
TP: 55
TN: 347
FP: 5
FN: 17
Accuracy： 0.9481132075471698
Recall： 0.7638888888888888
Specificity： 0.9857954545454546
Precision： 0.9166666666666666
f1_score： 0.8333333333333334
AUC： 0.8748421717171717


SMOTE
TP: 54
TN: 349
FP: 3
FN: 18
Accuracy： 0.9504716981132075
Recall： 0.75
Specificity： 0.9914772727272727
Precision： 0.9473684210526315
f1_score： 0.8372093023255814
AUC： 0.8707386363636364


RandomOverSampler
TP: 52
TN: 349
FP: 3
FN: 20
Accuracy： 0.9457547169811321
Recall： 0.7222222222222222
Specificity： 0.9914772727272727
Precision： 0.9454545454545454
f1_score： 0.8188976377952756
AUC： 0.8568497474747475


RandomUnderSampler
TP: 58
TN: 311
FP: 41
FN: 14
Accuracy： 0.8702830188679245
Recall： 0.8055555555555556
Specificity： 0.8835227272727273
Precision： 0.5858585858585859
f1_score： 0.6783625730994152
AUC： 0.8445391414141414




#### 處理類別不平衡(Adaboost)

In [39]:
AdaBoost_model2(X_train_3,X_test_3,y_train_3,y_test_3)
AdaBoost_model3(X_train_3,X_test_3,y_train_3,y_test_3)
AdaBoost_model4(X_train_3,X_test_3,y_train_3,y_test_3)
AdaBoost_model5(X_train_3,X_test_3,y_train_3,y_test_3)

ADASYN
TP: 51
TN: 346
FP: 6
FN: 21
Accuracy： 0.9363207547169812
Recall： 0.7083333333333334
Specificity： 0.9829545454545454
Precision： 0.8947368421052632
f1_score： 0.7906976744186046
AUC： 0.8456439393939393


SMOTE
TP: 51
TN: 348
FP: 4
FN: 21
Accuracy： 0.9410377358490566
Recall： 0.7083333333333334
Specificity： 0.9886363636363636
Precision： 0.9272727272727272
f1_score： 0.8031496062992126
AUC： 0.8484848484848486


RandomOverSampler
TP: 51
TN: 342
FP: 10
FN: 21
Accuracy： 0.9268867924528302
Recall： 0.7083333333333334
Specificity： 0.9715909090909091
Precision： 0.8360655737704918
f1_score： 0.7669172932330828
AUC： 0.8399621212121212


RandomUnderSampler
TP: 64
TN: 304
FP: 48
FN: 8
Accuracy： 0.8679245283018868
Recall： 0.8888888888888888
Specificity： 0.8636363636363636
Precision： 0.5714285714285714
f1_score： 0.6956521739130435
AUC： 0.8762626262626262


