# XGBoost Regression

XGBoost belongs to a family of boosting algorithms that convert weak learners into strong learners. A weak learner is one which is slightly better than random guessing.
We have two methods: booster = gbtree and booster = gblinear. You already know gbtree. In gblinear, it builds generalized linear model and optimizes it using regularization (L1,L2) and gradient descent. In this, the subsequent models are built on residuals (actual - predicted) generated by previous iterations.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xgboost
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# KC_House_Data

## Data Description

Online property companies offer valuations of houses using machine learning techniques. The aim of this report is to predict the house sales in King County, Washington State, USA using Multiple Linear Regression (MLR). The dataset consisted of historic data of houses sold between May 2014 to May 2015.
Price is the target variable.

id
date: Date house was sold
price: Price of the sold house
bedrooms: Number of Bedrooms
bathrooms: Number of bathrooms
sqft_living: Square footage of the living space
sqrt_lot: Square footage of the lot
floors: Total floors in the house
waterfront: Whether the house is on a waterfront(1: yes, 0: no)
view: special view?
condition: Condition of the house
grade: unknown
sqft_above: Square footage of house apart from basement
sqft_basement: Square footage of the basement
yr_built: Built year
yr_renovated: Year when the house was renovated
zipcode: zipcode of the house
lat: Latitude coordinate
long Longitude coordinate
sqft_living15: Living room area in 2015(implies some renovations)
sqrt_lot15: Lot area in 2015(implies some renovations)

In [3]:
data = pd.read_csv("kc_house_data.csv")

In [4]:
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [5]:
data.shape

(21613, 21)

Checking for Null Values

In [6]:
data.isnull().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

Date variable is a string and is not required to predict the target variable.So we drop it from the data.

In [7]:
x = data.drop("price",axis=1)
X = x.drop("date",axis=1)
Y = data.price.values

Splitting The data into train and test sets

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, Y ,test_size=0.2)

Applying Linear Regression

In [9]:
lr = LinearRegression()
lr.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Applying XGBoost Regression

In [10]:
xg = xgboost.XGBRegressor(radnome_state=3)
xg.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', radnome_state=3,
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

Scoring Linear Regression and XGBoost with default Parameters

In [11]:
print("Linear Regression train score:",lr.score(X_train,y_train))
print("Xgboost Regression train score:",xg.score(X_train,y_train))
print("Linear Regression test score:",lr.score(X_test,y_test))
print("Xgboost Regression test score:",xg.score(X_test,y_test))

Linear Regression train score: 0.7055845307421902
Xgboost Regression train score: 0.8990612034704353
Linear Regression test score: 0.6787861395754815
Xgboost Regression test score: 0.8698031037815236


As we can see, the score produced by XGBoost is much better than Linear Regression.But the difference between the train and test score is more than the linear regression. Let us go through the paramaters and try to tune them.

__max_depth (int)__ (default = 3)– Maximum tree depth for base learners. Decreasing depth might cause underfitting.

__learning_rate (float)__ (default = 0.1)– Boosting learning rate (xgb’s “eta”). Usually between 0.01-0.2

__n_estimators (int)__ (default = 100)– Number of boosted trees to fit. 

__silent (boolean)__ (default = 1)– Whether to print messages while running boosting.

__objective (string or callable)__(default = "reg:linear") – Specify the learning task and the corresponding learning objective or a custom objective function to be used.
reg:linear - for linear regression <br>
binary:logistic - logistic regression for binary classification. It returns class probabilities <br>
multi:softmax - multiclassification using softmax objective. It returns predicted class labels. It requires setting num_class parameter denoting number of unique prediction classes. <br>
multi:softprob - multiclassification using softmax objective. It returns predicted class probabilities.

__booster (string)__ (default = 'gbtree')– Specify which booster to use: gbtree, gblinear or dart.

__nthread (int)__(default = None) – Number of parallel threads used to run xgboost. (Deprecated, please use n_jobs)

__n_jobs (int)__ (default = 1)– Number of parallel threads used to run xgboost. (replaces nthread

__gamma (float)__(default = 0) – Minimum loss reduction required to make a further partition on a leaf node of the tree.Increasing the value of gamma regularizes and prevents overfitting.

__min_child_weight (int)__ (default = 1)– Minimum sum of instance weight(hessian) needed in a child. Larger the depth, more complex the model; higher chances of overfitting.

__max_delta_step (int)__ (default = 0)– Maximum delta step we allow each tree’s weight estimation to be.

__subsample (float)__ (default = 1)– Subsample ratio of the training instance.

__colsample_bytree (float)__ (default = 1)– Subsample ratio of columns when constructing each tree.

__colsample_bylevel (float)__(default = 1) – Subsample ratio of columns for each split, in each level.

__reg_alpha (float (xgb's alpha))__(default = 0) – L1 regularization term on weights.

__reg_lambda (float (xgb's lambda))__ (default = 1)– L2 regularization term on weights.

__scale_pos_weight (float)__ (default = 1)– Balancing of positive and negative weights.

__base_score__(default = 0.5) – The initial prediction score of all instances, global bias.

__seed (int)__(default = None) – Random number seed. (Deprecated, please use random_state)

__random_state (int)__ (default = 0)– Random number seed. (replaces seed)

__missing (float, optional)__(default = None) – Value in the data which needs to be present as a missing value. If None, defaults to np.nan.



In [12]:
xg = xgboost.XGBRegressor(colsample_bytree=0.4603, gamma=0, 
                             learning_rate=0.1, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=100,
                             reg_alpha=0.4640, reg_lambda=0.8571,silent=1,
                             random_state =3)
xg.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.4603, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1.7817,
       missing=None, n_estimators=100, n_jobs=1, nthread=None,
       objective='reg:linear', random_state=3, reg_alpha=0.464,
       reg_lambda=0.8571, scale_pos_weight=1, seed=None, silent=1,
       subsample=1)

In [13]:
print("Xgboost Regression train score:",xg.score(X_train,y_train))
print("Xgboost Regression test score:",xg.score(X_test,y_test))

Xgboost Regression train score: 0.897150668962933
Xgboost Regression test score: 0.8673451395220376


## Visualizing the Xgboost tree

In [16]:
xgboost.plot_tree(xg,num_trees=2)
plt.rcParams['figure.figsize'] = [50, 40]
plt.show()

Reference - https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/