# Extreme Gradient Boosting with XGBoost

### [C2] Regression with XGBoost

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb

from sklearn.model_selection import train_test_split

In [2]:
URL = 'https://assets.datacamp.com/production/repositories/943/datasets/4dbcaee889ef06fb0763e4a8652a4c1f268359b2/ames_housing_trimmed_processed.csv'

In [3]:
df = pd.read_csv(URL)
df.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,Remodeled,GrLivArea,BsmtFullBath,BsmtHalfBath,...,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,PavedDrive_P,PavedDrive_Y,SalePrice
0,60,65.0,8450,7,5,2003,0,1710,1,0,...,0,0,0,0,1,0,0,0,1,208500
1,20,80.0,9600,6,8,1976,0,1262,0,1,...,0,1,0,0,0,0,0,0,1,181500
2,60,68.0,11250,7,5,2001,1,1786,1,0,...,0,0,0,0,1,0,0,0,1,223500
3,70,60.0,9550,7,5,1915,1,1717,1,0,...,0,0,0,0,1,0,0,0,1,140000
4,60,84.0,14260,8,5,2000,0,2198,1,0,...,0,0,0,0,1,0,0,0,1,250000


Creating features and target arrays:

In [4]:
X, y = df.iloc[:, :-1], df.iloc[:, -1]

Splitting the data, fitting the model and predicting values using trees for regression:

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [6]:
xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)

xg_reg.fit(X_train, y_train)
y_pred = xg_reg.predict(X_test)



In [7]:
from sklearn.metrics import mean_squared_error

In [8]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'RMSE: {round(rmse, 2)}')

RMSE: 28106.46


As we saw, we can use decision trees for regression as base model, but can use a less common model named __linear learner__ to create a regularized linear regression using XGBoost.

Let's create DMatrices:

In [9]:
dm_train = xgb.DMatrix(data=X_train, label=y_train)
dm_test = xgb.DMatrix(data=X_test, label=y_test)

Setting model parameters:

In [10]:
params = {
    'booster': 'gblinear',
    'objective': 'reg:linear'
}

In [11]:
xg_reg = xgb.train(params=params, dtrain=dm_train, num_boost_round=5)



In [12]:
y_pred = xg_reg.predict(dm_test)

In [13]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'RMSE: {round(rmse, 2)}')

RMSE: 45337.82


#### __Evaluating model quality__

We will compare the RMSE and MAE of a cross-validated XGBoost model:

In [14]:
df_dmatrix = xgb.DMatrix(data=X, label=y)

In [15]:
params = {
    'objective': 'reg:linear',
    'max_depth': 4
}

Computing for RMSE:

In [16]:
cv_results_rmse = xgb.cv(dtrain=df_dmatrix, params=params, nfold=4, num_boost_round=5,
                    metrics='rmse', as_pandas=True, seed=123)

cv_results_rmse



Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,141767.535156,429.442896,142980.433594,1193.789595
1,102832.541015,322.467623,104891.394531,1223.154578
2,75872.617188,266.47325,79478.9375,1601.344539
3,57245.651368,273.626997,62411.924804,2220.148314
4,44401.295899,316.422824,51348.280274,2963.380073


In [17]:
rmse = cv_results_rmse['test-rmse-mean'].iloc[-1]
print(f'RMSE: {round(rmse, 2)}')

RMSE: 51348.28


Computing for MAE:

In [18]:
cv_results_mae = xgb.cv(dtrain=df_dmatrix, params=params, nfold=4, num_boost_round=5,
                    metrics='mae', as_pandas=True, seed=123)

cv_results_mae



Unnamed: 0,train-mae-mean,train-mae-std,test-mae-mean,test-mae-std
0,127343.56836,668.348346,127633.992188,2403.993258
1,89770.05664,456.94963,90122.496093,2107.910017
2,63580.791016,263.405561,64278.557617,1887.567262
3,45633.140625,151.885298,46819.167969,1459.813514
4,33587.090821,87.001007,35670.650391,1140.607288


In [19]:
mae = cv_results_mae['test-mae-mean'].iloc[-1]
print(f'MAE: {round(mae, 2)}')

MAE: 35670.65


#### __Regularization and base learnes__

Regularization parameters in XGBoost:

- __gamma__: minimum loss reduction allowed for a split to occur
- __alpha (L1)__: weights can go between 1 and 0
- __lambda (L2)__: similar to alpha but weights never get zero

The next exemple applys excactly the same for alpha and lambda:

In [20]:
reg_params = [1, 10, 100]

In [21]:
params = {
    'objective': 'reg:linear',
    'max_depth': 3
}

In [22]:
rmse_l2 = []

for reg in reg_params:
    # update l2 strenght
    params['lambda'] = reg

    cv_results = xgb.cv(dtrain=df_dmatrix, params=params, nfold=2, num_boost_round=5,
                        metrics='rmse', as_pandas=True, seed=123)
    
    # append the best rmse to rmse_l2
    rmse_l2.append(cv_results['train-rmse-mean'].iloc[-1])

df_results = pd.DataFrame(list(zip(reg_params, rmse_l2)), columns=['L2', 'rmse'])



In [23]:
print(f'Best rmse as a function of L2:')
print(df_results)

Best rmse as a function of L2:
    L2          rmse
0    1  46935.978515
1   10  54721.828125
2  100  75796.894532


#### __Visualizing individual XGBoost trees__

Let's use the `plot_tree()` function:

In [24]:
params = {"objective":"reg:linear", "max_depth":2}

In [25]:
xg_reg = xgb.train(params=params, dtrain=df_dmatrix, num_boost_round=10)



Plotting the first tree:

In [26]:
import matplotlib.pyplot as plt

In [27]:
#xgb.plot_tree(xg_reg, num_trees=0)
#plt.show()