## [範例重點]
了解機器學習建模的步驟、資料型態以及評估結果等流程

In [1]:
from sklearn import datasets, metrics

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

## 建立模型四步驟

在 Scikit-learn 中，建立一個機器學習的模型其實非常簡單，流程大略是以下四個步驟

1. 讀進資料，並檢查資料的 shape (有多少 samples (rows), 多少 features (columns)，label 的型態是什麼？)
    - 讀取資料的方法：
        - **使用 pandas 讀取 .csv 檔：**pd.read_csv
        - **使用 numpy 讀取 .txt 檔：**np.loadtxt 
        - **使用 Scikit-learn 內建的資料集：**sklearn.datasets.load_xxx
    - **檢查資料數量：**data.shape (data should be np.array or dataframe)
2. 將資料切為訓練 (train) / 測試 (test)
    - train_test_split(data)
3. 建立模型，將資料 fit 進模型開始訓練
    - clf = DecisionTreeClassifier()
    - clf.fit(x_train, y_train)
4. 將測試資料 (features) 放進訓練好的模型中，得到 prediction，與測試資料的 label (y_test) 做評估
    - clf.predict(x_test)
    - accuracy_score(y_test, y_pred)
    - f1_score(y_test, y_pred)

In [54]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier(criterion='entropy',max_depth=5,min_samples_split=70)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [55]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9736842105263158


In [56]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [57]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0. 0. 0. 1.]


## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [70]:
wine = datasets.load_wine()

x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

clf = DecisionTreeClassifier()

clf.fit(x_train,y_train)

y_pred = clf.predict(x_test)

In [63]:
print(metrics.accuracy_score(y_test, y_pred))

0.8888888888888888


In [64]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.01364138 0.03076567 0.         0.         0.04405085 0.04296585
 0.10004551 0.         0.         0.33702516 0.         0.04285558
 0.38865   ]


In [65]:
Regress = DecisionTreeRegressor()

Regress.fit(x_train,y_train)

y_pred = Regress.predict(x_test)

In [66]:
print(metrics.accuracy_score(y_test, y_pred))

0.9555555555555556


In [75]:
print("Feature importance: ", Regress.feature_importances_)

Feature importance:  [0.         0.         0.         0.00299987 0.03685556 0.
 0.61367513 0.00985672 0.         0.08462354 0.         0.
 0.25198917]


In [78]:
from sklearn.linear_model import LogisticRegression
logic = LogisticRegression(max_iter=3000)

logic.fit(x_train,y_train)

y_pred = Regress.predict(x_test)

In [79]:
print(metrics.accuracy_score(y_test, y_pred))

0.9555555555555556


In [80]:
print(logic.coef_)

[[ 0.56700416  0.49137144  0.70426498 -0.20653685 -0.02199934  0.29348082
   0.79242076  0.08027977  0.0835099   0.21299119  0.01556893  0.61757306
   0.00928195]
 [-0.57525603 -0.76734927 -0.64690129  0.13122416 -0.02556365  0.12342167
   0.18664228  0.03300076  0.4944253  -1.09348377  0.2553346   0.08643239
  -0.00731922]
 [ 0.00825187  0.27597783 -0.05736369  0.07531269  0.04756299 -0.41690249
  -0.97906304 -0.11328054 -0.5779352   0.88049258 -0.27090352 -0.70400545
  -0.00196273]]
