## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [461]:
from sklearn import datasets, metrics

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

建立模型四步驟
在 Scikit-learn 中，建立一個機器學習的模型其實非常簡單，流程大略是以下四個步驟

讀進資料，並檢查資料的 shape (有多少 samples (rows), 多少 features (columns)，label 的型態是什麼？)
讀取資料的方法：
使用 pandas 讀取 .csv 檔：pd.read_csv
使用 numpy 讀取 .txt 檔：np.loadtxt
使用 Scikit-learn 內建的資料集：sklearn.datasets.load_xxx
檢查資料數量：data.shape (data should be np.array or dataframe)
將資料切為訓練 (train) / 測試 (test)
train_test_split(data)
建立模型，將資料 fit 進模型開始訓練
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
將測試資料 (features) 放進訓練好的模型中，得到 prediction，與測試資料的 label (y_test) 做評估
clf.predict(x_test)
accuracy_score(y_test, y_pred)
f1_score(y_test, y_pred)

In [462]:
from functools import wraps
def LoadData(dataload_fun,test_size=0.25,random_state=4):
    print("Load data from Data Funcion:",dataload_fun.__name__.upper())
    
    # 讀取資料集
    datas = dataload_fun()
    print("Fetures=",datas.feature_names)
    
    # 切分訓練集/測試集
    print(" Data Rows=",datas.data.shape[0]," Featuers=",datas.data.shape[1])
    print(" test_size=",test_size, "Random_state=",random_state)
    
    x_train, x_test, y_train, y_test = train_test_split(datas.data, datas.target, test_size=test_size, random_state=random_state)
    print(" Train size=",x_train.shape[0]," Test size=",x_test.shape[0])
    
    return x_train, x_test, y_train, y_test
    

In [451]:
#Classifi
#IRIS鳶委花
#LoadData(datasets.load_iris())

#Regresion
#Boston房價 
#LoadData(datasets.load_boston())

In [463]:
#載入資料
x_train, x_test, y_train, y_test=LoadData(datasets.load_iris)

Load data from Data Funcion: LOAD_IRIS
Fetures= ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
 Data Rows= 150  Featuers= 4
 test_size= 0.25 Random_state= 4
 Train size= 112  Test size= 38


In [464]:
def RunModel_Class(datafun,x_train,x_test,y_train,y_test,_Criterion='gini',_Max_depth=None,_Min_samples_split=2,_Min_samples_lear=1):
    #Criterion 衡量量資料相似程度的 metric='gini' / 'entropy'
    #Max_depth 樹能⽣生長的最深限制 None
    #Min_samples_split ⾄至少要多少樣本以上才進⾏行行切分 2
    #Min_samples_lear 最終的葉⼦子 (節點) 上⾄至少要有多少樣本 1
    print("====DecisionTreeClassifier===")
    print(" Train Rows=",x_train.shape[0]," Train Feuthers=",x_train.shape[1])
    print(" Sample data=\n",x_train[0:3])
    print(" Sample result=",y_train[0:10])#檢視貼標示是Classifi 還是 Regresion 
    
    print(" Parameter: Criterion=",_Criterion," Max_Depth=",_Max_depth," Min_Samples=",_Min_samples_split," Min_Samples_Split=",_Min_samples_lear)
    # 建立模型
    #clf = DecisionTreeClassifier(Criterion,Max_depth,Min_samples_split,Min_samples_lear)
    clf = DecisionTreeClassifier(criterion=_Criterion,max_depth=_Max_depth,min_samples_split=_Min_samples_split)
    #clf = DecisionTreeClassifier(Criterion,None,1,1)
    print(" Model x_train shape",x_train.shape)
    # 訓練模型
    clf.fit(x_train, y_train)

    # 預測測試集
    y_pred = clf.predict(x_test)
    
    acc = metrics.accuracy_score(y_test, y_pred)
    print(" Acuuracy: [", acc,"]")
    
    print(" Data feature",datafun().feature_names)
    print(" Feature importance: ",end='')
    for i in clf.feature_importances_:
        print(i,"\t",end='')

In [465]:
RunModel_Class(datasets.load_iris,x_train,x_test,y_train,y_test,_Min_samples_split=2)

====DecisionTreeClassifier===
 Train Rows= 112  Train Feuthers= 4
 Sample data=
 [[ 6.7  3.1  4.7  1.5]
 [ 5.1  3.8  1.6  0.2]
 [ 7.7  3.   6.1  2.3]]
 Sample result= [1 0 2 0 1 0 2 0 0 1]
 Parameter: Criterion= gini  Max_Depth= None  Min_Samples= 2  Min_Samples_Split= 1
 Model x_train shape (112, 4)
 Acuuracy: [ 0.973684210526 ]
 Data feature ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
 Feature importance: 0.0 	0.0179659929419 	0.0599236810462 	0.922110326012 	

In [466]:
RunModel_Class(datasets.load_iris,x_train,x_test,y_train,y_test,_Min_samples_split=3)

====DecisionTreeClassifier===
 Train Rows= 112  Train Feuthers= 4
 Sample data=
 [[ 6.7  3.1  4.7  1.5]
 [ 5.1  3.8  1.6  0.2]
 [ 7.7  3.   6.1  2.3]]
 Sample result= [1 0 2 0 1 0 2 0 0 1]
 Parameter: Criterion= gini  Max_Depth= None  Min_Samples= 3  Min_Samples_Split= 1
 Model x_train shape (112, 4)
 Acuuracy: [ 0.973684210526 ]
 Data feature ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
 Feature importance: 0.0 	0.0179659929419 	0.522291342259 	0.459742664799 	

In [467]:
RunModel_Class(datasets.load_iris,x_train,x_test,y_train,y_test,_Min_samples_split=50)

====DecisionTreeClassifier===
 Train Rows= 112  Train Feuthers= 4
 Sample data=
 [[ 6.7  3.1  4.7  1.5]
 [ 5.1  3.8  1.6  0.2]
 [ 7.7  3.   6.1  2.3]]
 Sample result= [1 0 2 0 1 0 2 0 0 1]
 Parameter: Criterion= gini  Max_Depth= None  Min_Samples= 50  Min_Samples_Split= 1
 Model x_train shape (112, 4)
 Acuuracy: [ 0.973684210526 ]
 Data feature ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
 Feature importance: 0.0 	0.0 	0.528053933902 	0.471946066098 	

def RunModel_Regress(_Criterion='gini',_Max_depth=None,_Min_samples_split=2,_Min_samples_lear=1):
    #Criterion 衡量量資料相似程度的 metric='gini' / 'entropy'
    #Max_depth 樹能⽣生長的最深限制 None
    #Min_samples_split ⾄至少要多少樣本以上才進⾏行行切分 2
    #Min_samples_lear 最終的葉⼦子 (節點) 上⾄至少要有多少樣本 1
    print("x_Train.Shape=",x_train.shape)
    print("y_Train.Shape=",y_train.shape)
    
    print(_Criterion,_Max_depth,_Min_samples_split,_Min_samples_lear)
    # 建立模型
    #clf = DecisionTreeClassifier(Criterion,Max_depth,Min_samples_split,Min_samples_lear)
    #clf = DecisionTreeRegressor(criterion=_Criterion,max_depth=_Max_depth,min_samples_split=_Min_samples_split)
    clf = DecisionTreeRegressor()
    #clf = DecisionTreeClassifier(Criterion,None,1,1)
    print("model x_train shape",x_train.shape)
    # 訓練模型
    clf.fit(x_train, y_train)

    # 預測測試集
    y_pred = clf.predict(x_test)
    
    acc = metrics.accuracy_score(y_test, y_pred)
    print("Acuuracy: ", acc)
    
    print(iris.feature_names)
    print("Feature importance: ", clf.feature_importances_)

LoadData(datasets.load_iris())
print(y_test)
LoadData(datasets.load_boston())
print(y_test)


#Wines

＃DecisionTreeeRegressor的範例
＃https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html