## 隨機森林
- Radom forest classifier: 分類問題 -> 欲預測的答案為離散 ex.titanic: 死&活
- Radom forest regressor: 迴歸問題 -> 欲預測的答案為連續 ex.house price: 房價

#### *處理資料前，先觀察欄位，並將之分成數字型態和類別型態                       
ex. house_price資料共81欄，其中MSSubClass為類別型態，須對它做One-Hot encoding

> #### 步驟一: 讀入house_price資料集

In [1]:
import pandas as pd
train_df = pd.read_csv("../house_train.csv", encoding="utf-8")
test_df = pd.read_csv("../house_test.csv", encoding="utf-8")

> #### 步驟二: 將訓練和測試資料集合併，並分出訓練集答案

In [2]:
total_df = pd.concat([train_df, test_df], axis=0)
data = total_df.drop(["SalePrice"], axis=1)

y_train = train_df["SalePrice"]

### 現在的資料處理對象為data(包含訓練集和測試集，但不含訓練及解答)
- 先合併處理原因: 避免之後One-Hot encoding時，欄位值並不含有所有值，而造成展開時缺少欄位
- 若不先合併，則可用pandas.DataFrame.align

> #### 步驟三: 補表格空值，並對類別型態欄位做One-Hot encoding

In [3]:
# 補數字型態欄位
med = data.median().drop(["MSSubClass"])
data = data.fillna(med)

# 類別型態欄位不補最多的值，直接做One-Hot encoding
data = pd.get_dummies(data)
# 對MSSubClass欄做One-Hot encoding
data = pd.get_dummies(data, columns=["MSSubClass"])
data = data.drop(["Id"], axis=1)
data

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,MSSubClass_70,MSSubClass_75,MSSubClass_80,MSSubClass_85,MSSubClass_90,MSSubClass_120,MSSubClass_150,MSSubClass_160,MSSubClass_180,MSSubClass_190
0,65.0,8450,7,5,2003,2003,196.0,706.0,0.0,150.0,...,0,0,0,0,0,0,0,0,0,0
1,80.0,9600,6,8,1976,1976,0.0,978.0,0.0,284.0,...,0,0,0,0,0,0,0,0,0,0
2,68.0,11250,7,5,2001,2002,162.0,486.0,0.0,434.0,...,0,0,0,0,0,0,0,0,0,0
3,60.0,9550,7,5,1915,1970,0.0,216.0,0.0,540.0,...,1,0,0,0,0,0,0,0,0,0
4,84.0,14260,8,5,2000,2000,350.0,655.0,0.0,490.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,21.0,1936,4,7,1970,1970,0.0,0.0,0.0,546.0,...,0,0,0,0,0,0,0,1,0,0
1455,21.0,1894,4,5,1970,1970,0.0,252.0,0.0,294.0,...,0,0,0,0,0,0,0,1,0,0
1456,160.0,20000,5,7,1960,1996,0.0,1224.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1457,62.0,10441,5,5,1992,1992,0.0,337.0,0.0,575.0,...,0,0,0,1,0,0,0,0,0,0


In [4]:
# 檢查還有沒有缺失值
s = data.isna().sum()
s[s > 0]

Series([], dtype: int64)

> #### 步驟四: 將處理完的data表分回訓練集和測試集

In [9]:
# .shape為一個tuple(row, column)
# 訓練集的row數
train_df.shape[0]

# loc:列標籤, iloc:0,1,......
# iloc -> ["第一筆", "第二筆",......]
x_train = data.iloc[0:train_df.shape[0]]
x_test = data.iloc[train_df.shape[0]:]

> #### 步驟五: 用GridSearchCV找最佳參數

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
reg = RandomForestRegressor()
params = {
    "n_estimators": range(20, 150, 20),
    "max_depth": range(5, 15)
}
# 評斷標準(Scoring)換成r2_score
cv = GridSearchCV(reg, params, 
                  scoring="r2", 
                  cv=10, 
                  n_jobs=4)
cv.fit(x_train, y_train)
print(cv.best_params_)
print(cv.best_score_)

> #### 步驟六: 用RandomForestRegressor建模，並代入上步試出的最佳參數；預測

In [11]:
reg = RandomForestRegressor(n_estimators=75, max_depth=8)
reg.fit(x_train, y_train)

pre = reg.predict(x_test)
test_id = test_df["Id"]
result = pd.DataFrame({
    "Id": test_id,
    "SalePrice": pre
})
result

Unnamed: 0,Id,SalePrice
0,1461,126369.887590
1,1462,151969.023825
2,1463,178944.169988
3,1464,181210.354497
4,1465,202144.335042
...,...,...
1454,2915,88077.204969
1455,2916,90891.891429
1456,2917,149569.421147
1457,2918,115866.120950


### 選擇演算法
- 使用隨機森林(決策樹)解決迴歸問題會有過擬合問題(對資料想太多)，迴歸曲線呈階梯型態；實際上，用一條直線表示更好
- 若為分類，則不會有以上問題，因為最後會消弭
- 因此建議，分類 -> 可用複雜的演算法； 迴歸 -> 簡單的演算法