<h2>XGBoost Algorithm</h2>
XGBoost is an ensemble learning strategy with the aim of reducing training errors by combining a set of weak learners and converting them to strong learners. It is used for supervised learning problems.

In this lesson, you'll learn how to tackle a regression problem with XGBoost. The dataset comes from the UCI Machine Learning Repository and is also available in the datasets module of sklearn. The task is to predict the median value of owner-occupied homes per $1000s using 14 explanatory variables that describe various attributes of residential homes in Boston.

In [1]:
#Importing essential standard Libraries for computing our data
import numpy as np
import pandas as pd

In [2]:
#Importing Boston Housing dataset from sklearn library
from sklearn.datasets import load_boston
df = load_boston()

Data Exploration

In [3]:
#Features present in our dataset
df.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

In [4]:
#Brief Description of open source Boston Housing Dataset
print(df.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [5]:
#Exploring the attributes (columns) of our dataset.
print(df.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [6]:
#Converting the data present in dictionary format to Dataframe Format.
boston = pd.DataFrame(df.data,columns = df.feature_names)
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


You'll note that the DataFrame doesn't include a PRICE column. Because the target column is available in a separate attribute called boston.target, this is the case. To your pandas DataFrame, append boston.target.

In [7]:
boston['TARGET_VALUE'] = df.target
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,TARGET_VALUE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [8]:
#Checking for presence of null values in our dataset.
boston.isnull().sum()

CRIM            0
ZN              0
INDUS           0
CHAS            0
NOX             0
RM              0
AGE             0
DIS             0
RAD             0
TAX             0
PTRATIO         0
B               0
LSTAT           0
TARGET_VALUE    0
dtype: int64

Splitting our dataset into training and testing sets which are further used to evaluate the performance of XGBoost Algorithm

In [9]:
from sklearn.model_selection import train_test_split

# Here X represents our features using which we will train our model and 
# y represents the target variable which our model has to predict.
# for x(features): Considering all columns except the "TARGET_VALUE" column
# for y(labels): "TARGET_VALUE" column
X = boston.drop('TARGET_VALUE',axis=1)
Y = boston['TARGET_VALUE']

X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=100)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(404, 13)
(102, 13)
(404,)
(102,)


Creating the Model

In [10]:
import xgboost as xgb

xgboost_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3,  
                learning_rate = 0.1, max_depth = 5, alpha = 10, n_estimators = 10)

Fitting our Model to XGBoost Regressor and making Predictions.

In [11]:
xgboost_reg.fit(X_train,y_train)

predictions = xgboost_reg.predict(X_test)



Evaluating the performance of our Model.

In [12]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, predictions))
print("RMSE: %f" % (rmse))

RMSE: 10.032320


Well, you can see that your RMSE for the price prediction came out to be around 10.03 per 1000$