# 本筆記目標是學習使用強大的XGBoost演算法

---

# 參考連結

[XGBOOST參數設定](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md)

[XGBOOST簡介](http://xgboost.readthedocs.io/en/latest/python/python_intro.html)

[XGBOOST支持GPU的說明](https://xgboost.readthedocs.io/en/latest/gpu/index.html)

# 索引

[1. 整理資料](#1.-整理資料)

[2. 訓練資料](#2.-訓練資料)

[3. 檢視訓練情形](#3.-檢視訓練情形)

[4. 檢視各欄位重要性](#4.-檢視各欄位重要性)

[5. 以$R^2$評估回歸結果](#5.-以$R^2$評估回歸結果)

---

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
import xgboost as xgb
xgb.__version__

'0.90'

### 1. 整理資料

In [3]:
df=pd.read_csv('../datasets/blFriday/train.csv') # 載入資料

In [4]:
df.head(5).round(3)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
User_ID                       550068 non-null int64
Product_ID                    550068 non-null object
Gender                        550068 non-null object
Age                           550068 non-null object
Occupation                    550068 non-null int64
City_Category                 550068 non-null object
Stay_In_Current_City_Years    550068 non-null object
Marital_Status                550068 non-null int64
Product_Category_1            550068 non-null int64
Product_Category_2            376430 non-null float64
Product_Category_3            166821 non-null float64
Purchase                      550068 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB


In [6]:
df.head(3)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422


In [7]:
print('補空值前:')
print(df.isnull().sum())  # 查看各欄位空值狀態

df=df.fillna(0)           # 補空值

print('\n補空值後:')
print(df.isnull().sum())  # 查看各欄位空值狀態

補空值前:
User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64

補空值後:
User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Product_Category_2            0
Product_Category_3            0
Purchase                      0
dtype: int64


In [8]:
# 除了最後一欄位是目標，其餘欄位皆為用來預測目標的特徵
x=df.iloc[:,0:11]
target=df['Purchase']

In [9]:
# 以pd.factorize()方法，將類別資料編碼為dummy code
dataEncoded=pd.DataFrame()
encInfo={}
for col in x.columns:
    facorized=pd.factorize(x[col])

    dataEncoded[col]=facorized[0]
    encInfo[col]=facorized[1]

In [10]:
# 以scikit-learn內建的train, test split, 將資料分成70%訓練，30%測試
trainX,testX,trainY,testY=train_test_split(dataEncoded,target,
                                           test_size=0.3)

In [11]:
# 確認一下資料形狀
print('size of the train data (x):\t',trainX.shape)
print('size of the train data (x):\t',trainY.shape)
print('size of the test data (y):\t',testX.shape)
print('size of the test data (y):\t',testY.shape)

size of the train data (x):	 (385047, 11)
size of the train data (x):	 (385047,)
size of the test data (y):	 (165021, 11)
size of the test data (y):	 (165021,)


In [12]:
# 將資料存成xgboost要求的型態
data_train = xgb.DMatrix( trainX, label=trainY)
data_test  = xgb.DMatrix( testX, label=testY)

  if getattr(data, 'base', None) is not None and \


[回索引](#索引)

### 2. 訓練資料

In [13]:
%%time

#給予模型參數，告知演算法該如何訓練模型
param = {}
param['objective'] = 'reg:linear' # 做線性回歸
param['tree_method'] = 'hist'
param['silent']=1
param['max_depth']=10
eval_list  = [(data_train,'train'),(data_test,'test')]
num_round = 50
eval_history={}

# 訓練模型
model = xgb.train( param, data_train, num_round,eval_list,
                  evals_result=eval_history,verbose_eval=False)

Wall time: 5.7 s


In [None]:
# 若有GPU, 可則執行以下程式碼來加速訓練。

# %%time

# #給予模型參數，告知演算法該如何訓練模型
# param = {}
# param['objective'] = 'reg:linear'
# param['n_gpus']=1
# param['gpu_id']=0
# param['tree_method'] = 'gpu_hist'
# param['silent']=1
# param['max_depth']=6
# eval_list  = [(data_train,'train'),(data_test,'test')]
# num_round = 50
# eval_history={}

# # 訓練模型
# model = xgb.train( param, data_train, num_round,eval_list,
#                   evals_result=eval_history,verbose_eval=False)

In [None]:
rmse_train=eval_history['train']['rmse']
rmse_test=eval_history['test']['rmse']

[回索引](#索引)

### 3. 檢視訓練情形

In [None]:
plt.plot(rmse_train,ms=10,marker='.',label='train_eval')
plt.plot(rmse_test,ms=10,marker='v',label='test_eval')
plt.legend()
plt.show()

In [None]:
# 檢視最後rms error
model.eval(data_test)

[回索引](#索引)

### 4. 檢視各欄位重要性

In [None]:
from xgboost import plot_importance
plot_importance(model)
plt.show()

[回索引](#索引)

### 5. 以$R^2$評估回歸結果

In [None]:
from sklearn.metrics import r2_score
testY_pred=model.predict(data_test)
r2_score(testY, testY_pred)

In [None]:
trainY_pred = model.predict(data_train)
r2_score(trainY, trainY_pred)

在訓練資料的表現是$R^2 =0.78$, 在測試資料的表現是$R^2 = 0.73$

[回索引](#索引)

---

#### 練習1：增加樹的深度，看模型準確率有什麼變化

In [None]:
# 練習於此
# ..

#### 練習2：控制L1/L2規範項的強度，看模型準確率有無變化

In [None]:
# 練習於此
# ..

#### 練習3：去掉Purchase欄位中的離群值再來建立模型

In [None]:
# 練習於此
# ..