## 幾種思路

思路1：TF-IDF + 機器學習分類器
直接使用TF-IDF對文本提取特徵，並使用分類器進行分類。在分類器的選擇上，可以使用SVM、LR、或者XGBoost。

思路2：FastText
FastText是入門款的詞向量，利用Facebook提供的FastText工具，可以快速構建出分類器。

思路3：WordVec + 深度學習分類器
WordVec是進階款的詞向量，並通過構建深度學習分類完成分類。深度學習分類的網絡結構可以選擇TextCNN、TextRNN或者BiLSTM。

思路4：Bert詞向量
Bert是高配款的詞向量，具有強大的建模學習能力。

## 獲取6種模型的資料集

In [1]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import numpy as np
import re
path = '../data/0414/review_data(seg+pos+stopwords)_n+v+f+p.csv'
df = pd.read_csv(path)

### 檢查重複值、空值

In [2]:
#印出重複資料
print(df[df.duplicated()])

Empty DataFrame
Columns: [reviews, value, comfort, location, cleanliness, service, facilities, ws_pos_reviews, filtered, filtered_word]
Index: []


In [3]:
#移除重複值
#df = df.drop_duplicates()
#print(df.shape)

In [4]:
#印出空值資料
#df_train[df_train.isnull().T.any()]

### 切分為6個資料集

In [5]:
def split_df(df):
    df_value = df[['value','filtered_word']]
    df_value.rename(columns={'value': 'label'}, inplace=True)
    df_comfort = df[['comfort','filtered_word']]
    df_comfort.rename(columns={'comfort': 'label'}, inplace=True)
    df_location = df[['location','filtered_word']]
    df_location.rename(columns={'location': 'label'}, inplace=True)
    df_cleanliness = df[['cleanliness','filtered_word']]
    df_cleanliness.rename(columns={'cleanliness': 'label'}, inplace=True)
    df_service = df[['service','filtered_word']]
    df_service.rename(columns={'service': 'label'}, inplace=True)
    df_facilities = df[['facilities','filtered_word']]
    df_facilities.rename(columns={'facilities': 'label'}, inplace=True)
    return df_value, df_comfort, df_location, df_cleanliness, df_service, df_facilities

In [6]:
#df_value_train, df_comfort_train, df_location_train, df_cleanliness_train, df_service_train ,df_facilities_train = split_df(df_train)
#df_value_test, df_comfort_test, df_location_test, df_cleanliness_test, df_service_test, df_facilities_test = split_df(df_test)
df_value, df_comfort, df_location, df_cleanliness, df_service ,df_facilities = split_df(df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_value.rename(columns={'value': 'label'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_comfort.rename(columns={'comfort': 'label'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_location.rename(columns={'location': 'label'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-

In [7]:
df_facilities

Unnamed: 0,label,filtered_word
0,0.0,"價格(Na),合理(VH),舒適(VH),房間(Nc),老闆娘(Na),人(Na),好(VH..."
1,1.0,"內部(Ncd),房間(Nc),乾淨(VH),場地(Na),團體(Na),使用(VC),覺得(..."
2,0.0,"房間(Nc),小(VH),美中不足(VH),乾(VH),濕(VH),分離(VHC)"
3,1.0,"房子(Na),設計(VC),棒(VH),房間(Nc),採光(Na),好(VH),大廳(Nc)..."
4,0.0,"Cp值(FW),高(VH),乾淨(VH),舒適(VH),空間(Na),大樓(Na),下(Nc..."
...,...,...
1267,1.0,"港式(Na),飲茶(VA),餐廳(Nc),口味(Na),棒(VH),環境(Na),乾淨(VH..."
1268,0.0,"場地(Na),氣派(Na),丁香魚(Na),酥脆(VH),服務(VC),親切(VH),蠟味(..."
1269,1.0,"交通(Na),方便(VH),地下室(Nc),停車場(Nc),良好(VH),菜色(Na),好(..."
1270,0.0,"地點(Na),佳(VH),離(P),逢甲(Nb),夜市(Nc),老闆娘(Na),親切(VH)..."


### 清理資料(移除詞性標註的文字)

In [8]:
def remove_N_comma(sentence):
    # 把後面(N..)(V..)(F..)拿掉
    sentence = str(sentence)
    pattern = re.compile(r"\([N,V,F,P].*?\)") #移除詞性標示
    sentence = re.sub(pattern, '', sentence)
    pattern = re.compile(r",") #將逗號替換為空格
    sentence = re.sub(pattern, ' ', sentence)
    return sentence
pd.options.mode.chained_assignment = None  # 忽略警告

In [9]:
#訓練集
df_facilities['filtered_word'] = df_facilities.apply(lambda x: remove_N_comma(x['filtered_word']),axis=1)

In [10]:
print(df_facilities.shape)
df_facilities

(1272, 2)


Unnamed: 0,label,filtered_word
0,0.0,價格 合理 舒適 房間 老闆娘 人 好 做 早餐 旅客 重點 早餐 吃到飽
1,1.0,內部 房間 乾淨 場地 團體 使用 覺得 棒
2,0.0,房間 小 美中不足 乾 濕 分離
3,1.0,房子 設計 棒 房間 採光 好 大廳 挑高 氣派 房價 合理 台東 住 民宿
4,0.0,Cp值 高 乾淨 舒適 空間 大樓 下 免費 吐司 咖啡 老闆 回復 速度
...,...,...
1267,1.0,港式 飲茶 餐廳 口味 棒 環境 乾淨 機車 汽車 停車位 位於 高鐵 附近 適合 宴客
1268,0.0,場地 氣派 丁香魚 酥脆 服務 親切 蠟味 蘿蔔糕 份量 一些 好 牛肉粥 好吃
1269,1.0,交通 方便 地下室 停車場 良好 菜色 好 空間 設計好 說 一流 飯店
1270,0.0,地點 佳 離 逢甲 夜市 老闆娘 親切 服務 房間 舒適 浴室 乾淨


## 模型架構

### 套件引用

In [11]:
#import package
#轉向量用
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer 
from scipy.sparse import coo_matrix

from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_validate
import pickle #儲存模型用
from sklearn.model_selection import train_test_split
#類別採樣
import imblearn.over_sampling as over_sampling
import imblearn.under_sampling as under_sampling
import imblearn.combine as combine
from imblearn.pipeline import make_pipeline as make_pipeline_imb


#模型
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.ensemble import AdaBoostClassifier
#from xgboost.sklearn import XGBClassifier

#模型效能表現
import sklearn.metrics as metrics

### 顯示訓練結果

In [12]:
def classification_report(y_test, pre):
    #混淆矩陣
    confusion = metrics.confusion_matrix(y_test, pre)
    TP = confusion[1,1]
    TN = confusion[0,0]
    FP = confusion[0,1]
    FN = confusion[1,0]
    print("TP:",TP)
    print("TN:",TN)
    print("FP:",FP)
    print("FN:",FN)
    #Accuracy
    accuracy = (TP+TN)/float(TP+TN+FN+FP)
    print("Accuracy：", accuracy)
    #Sensitivity(Recall)
    recall = TP/float(TP+FN)
    print("Recall：", recall)
    #Specificity
    specificity = TN/float(TN+FP)
    print("Specificity：", specificity)
    #Precision
    precision = TP/float(TP+FP)
    print("Precision：", precision)
    #f1-score
    f1_score = ((2*precision*recall)/(precision+recall))
    print("f1_score：", f1_score)
    #AUC
    print("AUC：", metrics.roc_auc_score(y_test, pre))

In [13]:
df_facilities

Unnamed: 0,label,filtered_word
0,0.0,價格 合理 舒適 房間 老闆娘 人 好 做 早餐 旅客 重點 早餐 吃到飽
1,1.0,內部 房間 乾淨 場地 團體 使用 覺得 棒
2,0.0,房間 小 美中不足 乾 濕 分離
3,1.0,房子 設計 棒 房間 採光 好 大廳 挑高 氣派 房價 合理 台東 住 民宿
4,0.0,Cp值 高 乾淨 舒適 空間 大樓 下 免費 吐司 咖啡 老闆 回復 速度
...,...,...
1267,1.0,港式 飲茶 餐廳 口味 棒 環境 乾淨 機車 汽車 停車位 位於 高鐵 附近 適合 宴客
1268,0.0,場地 氣派 丁香魚 酥脆 服務 親切 蠟味 蘿蔔糕 份量 一些 好 牛肉粥 好吃
1269,1.0,交通 方便 地下室 停車場 良好 菜色 好 空間 設計好 說 一流 飯店
1270,0.0,地點 佳 離 逢甲 夜市 老闆娘 親切 服務 房間 舒適 浴室 乾淨


### 切分訓練、測試數據

In [14]:
from sklearn.model_selection import train_test_split
def split_label(df,seed):
    X = df.filtered_word.tolist()
    y = df.label
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=1/3,random_state=seed)
    return X_train,X_test,y_train,y_test

In [15]:
X_train_1,X_test_1,y_train_1,y_test_1 = split_label(df_facilities,1)
X_train_2,X_test_2,y_train_2,y_test_2 = split_label(df_facilities,2)
X_train_3,X_test_3,y_train_3,y_test_3 = split_label(df_facilities,3)

In [16]:
s1 = pd.Series(y_train_1)
freq1 = s1.value_counts() 
print(freq1) 
s2 = pd.Series(y_train_2)
freq2 = s2.value_counts() 
print(freq2) 
s3 = pd.Series(y_train_3)
freq3 = s3.value_counts() 
print(freq3) 

0.0    638
1.0    210
Name: label, dtype: int64
0.0    639
1.0    209
Name: label, dtype: int64
0.0    652
1.0    196
Name: label, dtype: int64


### 模型設計

#### (1) baseline

In [17]:
def SVM_model(X_train,X_test,y_train,y_test):
    print("SVM baseline")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(max_df=0.8,min_df=5,dtype=np.float32), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [18]:
def LR_model(X_train,X_test,y_train,y_test):
    print("LR baseline")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(max_df=0.8,min_df=5,dtype=np.float32), LogisticRegression(random_state=0))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [19]:
def RF_model(X_train,X_test,y_train,y_test):
    print("RF baseline")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(max_df=0.8,min_df=5,dtype=np.float32), RandomForestClassifier(max_depth=2, random_state=0))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [20]:
def AdaBoost_model(X_train,X_test,y_train,y_test):
    print("AdaBoost_model")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

#### (2)執行採樣 => 解決類別不平衡 (SVM)

In [60]:
def SVM_model2(X_train,X_test,y_train,y_test):
    print("ADASYN")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), over_sampling.ADASYN(), svm.SVC(kernel='linear'))
    print(model)
    # Get the names of each feature
    model.fit(X_train, y_train)
    feature_names = model.named_steps["tfidfvectorizer"].get_feature_names()
    print(feature_names)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [61]:
SVM_model2(X_train_1,X_test_1,y_train_1,y_test_1)

ADASYN
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()), ('adasyn', ADASYN()),
                ('svc', SVC(kernel='linear'))])


NotFittedError: Vocabulary not fitted or provided

In [22]:
def SVM_model3(X_train,X_test,y_train,y_test):
    print("SMOTE")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), over_sampling.SMOTE(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [23]:
def SVM_model4(X_train,X_test,y_train,y_test):
    print("RandomOverSampler")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), over_sampling.RandomOverSampler(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [24]:
def SVM_model5(X_train,X_test,y_train,y_test):
    print("RandomUnderSampler")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), under_sampling.RandomUnderSampler(), svm.SVC(kernel='linear'))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

#### (3)執行採樣 => 解決類別不平衡 (Adaboost)

In [25]:
def AdaBoost_model2(X_train,X_test,y_train,y_test):
    print("ADASYN")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), over_sampling.ADASYN(), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [26]:
def AdaBoost_model3(X_train,X_test,y_train,y_test):
    print("SMOTE")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), over_sampling.SMOTE(), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [27]:
def AdaBoost_model4(X_train,X_test,y_train,y_test):
    print("RandomOverSampler")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), over_sampling.RandomOverSampler(), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

In [28]:
def AdaBoost_model5(X_train,X_test,y_train,y_test):
    print("RandomUnderSampler")
    #切分數據集
    #X_train,y_train,X_test,y_test = split_label(df_train, df_test)
    #模型架構
    model = make_pipeline_imb(TfidfVectorizer(), under_sampling.RandomUnderSampler(), AdaBoostClassifier(n_estimators=200))
    model.fit(X_train, y_train)
    #模型預測
    pre = model.predict(X_test)
    #麼行評估
    classification_report(y_test, pre)
    print("\n")

## 模型訓練&結果

### 資料集1

#### baseline

In [29]:
SVM_model(X_train_1,X_test_1,y_train_1,y_test_1)
LR_model(X_train_1,X_test_1,y_train_1,y_test_1)
RF_model(X_train_1,X_test_1,y_train_1,y_test_1)
AdaBoost_model(X_train_1,X_test_1,y_train_1,y_test_1)

SVM baseline
TP: 38
TN: 316
FP: 8
FN: 62
Accuracy： 0.8349056603773585
Recall： 0.38
Specificity： 0.9753086419753086
Precision： 0.8260869565217391
f1_score： 0.5205479452054794
AUC： 0.6776543209876542


LR baseline
TP: 18
TN: 321
FP: 3
FN: 82
Accuracy： 0.7995283018867925
Recall： 0.18
Specificity： 0.9907407407407407
Precision： 0.8571428571428571
f1_score： 0.2975206611570248
AUC： 0.5853703703703703


RF baseline
TP: 0
TN: 324
FP: 0
FN: 100
Accuracy： 0.7641509433962265
Recall： 0.0
Specificity： 1.0
Precision： nan
f1_score： nan
AUC： 0.5


AdaBoost_model


  precision = TP/float(TP+FP)


TP: 53
TN: 294
FP: 30
FN: 47
Accuracy： 0.8183962264150944
Recall： 0.53
Specificity： 0.9074074074074074
Precision： 0.6385542168674698
f1_score： 0.5792349726775956
AUC： 0.7187037037037037




#### 處理類別不平衡(SVM)

In [30]:
SVM_model2(X_train_1,X_test_1,y_train_1,y_test_1)
SVM_model3(X_train_1,X_test_1,y_train_1,y_test_1)
SVM_model4(X_train_1,X_test_1,y_train_1,y_test_1)
SVM_model5(X_train_1,X_test_1,y_train_1,y_test_1)

ADASYN
TP: 61
TN: 294
FP: 30
FN: 39
Accuracy： 0.8372641509433962
Recall： 0.61
Specificity： 0.9074074074074074
Precision： 0.6703296703296703
f1_score： 0.6387434554973822
AUC： 0.7587037037037038


SMOTE
TP: 59
TN: 297
FP: 27
FN: 41
Accuracy： 0.839622641509434
Recall： 0.59
Specificity： 0.9166666666666666
Precision： 0.686046511627907
f1_score： 0.6344086021505376
AUC： 0.7533333333333332


RandomOverSampler
TP: 59
TN: 299
FP: 25
FN: 41
Accuracy： 0.8443396226415094
Recall： 0.59
Specificity： 0.9228395061728395
Precision： 0.7023809523809523
f1_score： 0.6413043478260868
AUC： 0.7564197530864197


RandomUnderSampler
TP: 76
TN: 236
FP: 88
FN: 24
Accuracy： 0.7358490566037735
Recall： 0.76
Specificity： 0.7283950617283951
Precision： 0.4634146341463415
f1_score： 0.5757575757575758
AUC： 0.7441975308641976




#### 處理類別不平衡(Adaboost)

In [31]:
AdaBoost_model2(X_train_1,X_test_1,y_train_1,y_test_1)
AdaBoost_model3(X_train_1,X_test_1,y_train_1,y_test_1)
AdaBoost_model4(X_train_1,X_test_1,y_train_1,y_test_1)
AdaBoost_model5(X_train_1,X_test_1,y_train_1,y_test_1)

ADASYN
TP: 56
TN: 292
FP: 32
FN: 44
Accuracy： 0.8207547169811321
Recall： 0.56
Specificity： 0.9012345679012346
Precision： 0.6363636363636364
f1_score： 0.5957446808510639
AUC： 0.7306172839506173


SMOTE
TP: 58
TN: 289
FP: 35
FN: 42
Accuracy： 0.8183962264150944
Recall： 0.58
Specificity： 0.8919753086419753
Precision： 0.6236559139784946
f1_score： 0.6010362694300518
AUC： 0.7359876543209876


RandomOverSampler
TP: 55
TN: 290
FP: 34
FN: 45
Accuracy： 0.8136792452830188
Recall： 0.55
Specificity： 0.8950617283950617
Precision： 0.6179775280898876
f1_score： 0.582010582010582
AUC： 0.7225308641975309


RandomUnderSampler
TP: 66
TN: 197
FP: 127
FN: 34
Accuracy： 0.6202830188679245
Recall： 0.66
Specificity： 0.6080246913580247
Precision： 0.34196891191709844
f1_score： 0.4505119453924915
AUC： 0.6340123456790124




### 資料集2

#### baseline

In [32]:
SVM_model(X_train_2,X_test_2,y_train_2,y_test_2)
LR_model(X_train_2,X_test_2,y_train_2,y_test_2)
RF_model(X_train_2,X_test_2,y_train_2,y_test_2)
AdaBoost_model(X_train_2,X_test_2,y_train_2,y_test_2)

SVM baseline
TP: 28
TN: 316
FP: 7
FN: 73
Accuracy： 0.8113207547169812
Recall： 0.27722772277227725
Specificity： 0.978328173374613
Precision： 0.8
f1_score： 0.411764705882353
AUC： 0.6277779480734451


LR baseline
TP: 17
TN: 320
FP: 3
FN: 84
Accuracy： 0.7948113207547169
Recall： 0.16831683168316833
Specificity： 0.9907120743034056
Precision： 0.85
f1_score： 0.2809917355371901
AUC： 0.579514452993287


RF baseline
TP: 0
TN: 323
FP: 0
FN: 101
Accuracy： 0.7617924528301887
Recall： 0.0
Specificity： 1.0
Precision： nan
f1_score： nan
AUC： 0.5


AdaBoost_model


  precision = TP/float(TP+FP)


TP: 56
TN: 289
FP: 34
FN: 45
Accuracy： 0.8136792452830188
Recall： 0.5544554455445545
Specificity： 0.8947368421052632
Precision： 0.6222222222222222
f1_score： 0.5863874345549739
AUC： 0.7245961438249089




#### 處理類別不平衡(SVM)

In [33]:
SVM_model2(X_train_2,X_test_2,y_train_2,y_test_2)
SVM_model3(X_train_2,X_test_2,y_train_2,y_test_2)
SVM_model4(X_train_2,X_test_2,y_train_2,y_test_2)
SVM_model5(X_train_2,X_test_2,y_train_2,y_test_2)

ADASYN
TP: 57
TN: 289
FP: 34
FN: 44
Accuracy： 0.8160377358490566
Recall： 0.5643564356435643
Specificity： 0.8947368421052632
Precision： 0.6263736263736264
f1_score： 0.5937499999999999
AUC： 0.7295466388744138


SMOTE
TP: 58
TN: 291
FP: 32
FN: 43
Accuracy： 0.8231132075471698
Recall： 0.5742574257425742
Specificity： 0.9009287925696594
Precision： 0.6444444444444445
f1_score： 0.6073298429319371
AUC： 0.7375931091561169


RandomOverSampler
TP: 53
TN: 293
FP: 30
FN: 48
Accuracy： 0.8160377358490566
Recall： 0.5247524752475248
Specificity： 0.9071207430340558
Precision： 0.6385542168674698
f1_score： 0.5760869565217391
AUC： 0.7159366091407903


RandomUnderSampler
TP: 72
TN: 234
FP: 89
FN: 29
Accuracy： 0.7216981132075472
Recall： 0.7128712871287128
Specificity： 0.7244582043343654
Precision： 0.4472049689440994
f1_score： 0.549618320610687
AUC： 0.718664745731539




#### 處理類別不平衡(Adaboost)

In [None]:
AdaBoost_model2(X_train_2,X_test_2,y_train_2,y_test_2)
AdaBoost_model3(X_train_2,X_test_2,y_train_2,y_test_2)
AdaBoost_model4(X_train_2,X_test_2,y_train_2,y_test_2)
AdaBoost_model5(X_train_2,X_test_2,y_train_2,y_test_2)

ADASYN
TP: 54
TN: 294
FP: 29
FN: 47
Accuracy： 0.8207547169811321
Recall： 0.5346534653465347
Specificity： 0.9102167182662538
Precision： 0.6506024096385542
f1_score： 0.5869565217391304
AUC： 0.7224350918063942


SMOTE
TP: 52
TN: 301
FP: 22
FN: 49
Accuracy： 0.8325471698113207
Recall： 0.5148514851485149
Specificity： 0.9318885448916409
Precision： 0.7027027027027027
f1_score： 0.5942857142857144
AUC： 0.7233700150200779


RandomOverSampler
TP: 57
TN: 287
FP: 36
FN: 44
Accuracy： 0.8113207547169812
Recall： 0.5643564356435643
Specificity： 0.8885448916408669
Precision： 0.6129032258064516
f1_score： 0.5876288659793815
AUC： 0.7264506636422157


RandomUnderSampler
TP: 65
TN: 255
FP: 68
FN: 36
Accuracy： 0.7547169811320755
Recall： 0.6435643564356436
Specificity： 0.7894736842105263
Precision： 0.48872180451127817
f1_score： 0.5555555555555556
AUC： 0.716519020323085




### 資料集3

#### baseline

In [None]:
SVM_model(X_train_3,X_test_3,y_train_3,y_test_3)
LR_model(X_train_3,X_test_3,y_train_3,y_test_3)
RF_model(X_train_3,X_test_3,y_train_3,y_test_3)
AdaBoost_model(X_train_3,X_test_3,y_train_3,y_test_3)

SVM baseline
TP: 37
TN: 305
FP: 5
FN: 77
Accuracy： 0.8066037735849056
Recall： 0.32456140350877194
Specificity： 0.9838709677419355
Precision： 0.8809523809523809
f1_score： 0.4743589743589744
AUC： 0.6542161856253537


LR baseline
TP: 24
TN: 306
FP: 4
FN: 90
Accuracy： 0.7783018867924528
Recall： 0.21052631578947367
Specificity： 0.9870967741935484
Precision： 0.8571428571428571
f1_score： 0.33802816901408456
AUC： 0.598811544991511


RF baseline
TP: 0
TN: 310
FP: 0
FN: 114
Accuracy： 0.7311320754716981
Recall： 0.0
Specificity： 1.0
Precision： nan
f1_score： nan
AUC： 0.5


AdaBoost_model


  precision = TP/float(TP+FP)


TP: 59
TN: 277
FP: 33
FN: 55
Accuracy： 0.7924528301886793
Recall： 0.5175438596491229
Specificity： 0.8935483870967742
Precision： 0.6413043478260869
f1_score： 0.5728155339805826
AUC： 0.7055461233729485




#### 處理類別不平衡(SVM)

In [None]:
SVM_model2(X_train_3,X_test_3,y_train_3,y_test_3)
SVM_model3(X_train_3,X_test_3,y_train_3,y_test_3)
SVM_model4(X_train_3,X_test_3,y_train_3,y_test_3)
SVM_model5(X_train_3,X_test_3,y_train_3,y_test_3)

ADASYN
TP: 65
TN: 286
FP: 24
FN: 49
Accuracy： 0.8278301886792453
Recall： 0.5701754385964912
Specificity： 0.9225806451612903
Precision： 0.7303370786516854
f1_score： 0.6403940886699507
AUC： 0.7463780418788908


SMOTE
TP: 60
TN: 290
FP: 20
FN: 54
Accuracy： 0.8254716981132075
Recall： 0.5263157894736842
Specificity： 0.9354838709677419
Precision： 0.75
f1_score： 0.6185567010309279
AUC： 0.7308998302207131


RandomOverSampler
TP: 60
TN: 292
FP: 18
FN: 54
Accuracy： 0.8301886792452831
Recall： 0.5263157894736842
Specificity： 0.9419354838709677
Precision： 0.7692307692307693
f1_score： 0.625
AUC： 0.7341256366723259


RandomUnderSampler
TP: 80
TN: 228
FP: 82
FN: 34
Accuracy： 0.7264150943396226
Recall： 0.7017543859649122
Specificity： 0.7354838709677419
Precision： 0.49382716049382713
f1_score： 0.5797101449275363
AUC： 0.718619128466327




#### 處理類別不平衡(Adaboost)

In [None]:
AdaBoost_model2(X_train_3,X_test_3,y_train_3,y_test_3)
AdaBoost_model3(X_train_3,X_test_3,y_train_3,y_test_3)
AdaBoost_model4(X_train_3,X_test_3,y_train_3,y_test_3)
AdaBoost_model5(X_train_3,X_test_3,y_train_3,y_test_3)

ADASYN
TP: 63
TN: 285
FP: 25
FN: 51
Accuracy： 0.8207547169811321
Recall： 0.5526315789473685
Specificity： 0.9193548387096774
Precision： 0.7159090909090909
f1_score： 0.6237623762376238
AUC： 0.7359932088285229


SMOTE
TP: 59
TN: 280
FP: 30
FN: 55
Accuracy： 0.7995283018867925
Recall： 0.5175438596491229
Specificity： 0.9032258064516129
Precision： 0.6629213483146067
f1_score： 0.5812807881773399
AUC： 0.7103848330503679


RandomOverSampler
TP: 59
TN: 281
FP: 29
FN: 55
Accuracy： 0.8018867924528302
Recall： 0.5175438596491229
Specificity： 0.9064516129032258
Precision： 0.6704545454545454
f1_score： 0.5841584158415842
AUC： 0.7119977362761745


RandomUnderSampler
TP: 78
TN: 230
FP: 80
FN: 36
Accuracy： 0.7264150943396226
Recall： 0.6842105263157895
Specificity： 0.7419354838709677
Precision： 0.4936708860759494
f1_score： 0.5735294117647058
AUC： 0.7130730050933787


