# **What is XG Boost?**

XGBoost, an abbreviation for eXtreme Gradient Boosting is one of the most commonly used machine learning algorithms. Be it for classification or regression problems, XGBoost has been successfully relied upon by many since its release in 2014. It is a library for implementing optimised and distributed gradient boosting and provides a great framework for C++, Java, Python, R and Julia.

Unlike other boosting algorithms where weights of misclassified branches are increased, in Gradient Boosted algorithms the loss function is optimised. XGBoost is an advanced implementation of gradient boosting along with some regularization factors.

For complete understanding of this algorithm please refer this post [Understanding XGBoost in Detail](https://analyticsindiamag.com/xgboost-internal-working-to-make-decision-trees-and-deduce-predictions/).

# **The Reason Behind Its Popularity**

XGBoost provides a number of features that can greatly impact the efficiency and performance of a model.


Parallelisation, distributed computing, cache optimisation, automatic handling of missing data are some of its features that stand out compared to other algorithms and libraries.



Another set of factors include the speed of execution and the performance of the model in both regression and classification type problems. The algorithm is most effective in producing a model with lesser variance and a more stable prediction.

# **Implement XGBoost In Python**

We will be using a dataset given for the hackathon “Predicting House Prices In Bengaluru” at MachineHack.com. (Click [here](https://machinehack.com/hackathons/predicting_house_prices_in_bengaluru/data) to download the complete dataset).

## **Installing XGBoost**

Use the python pip installer to install the XGBoost library from your terminal.

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy sklearn statsmodels xgboost --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

## **Let’s get coding:**

## **Importing the libraries**

In [None]:
import numpy as np
import pandas as pd

## **Importing the dataset**

In [None]:
dataset = pd.read_csv('XGB_Train.csv')

We will check what is there in the data and its shape. Refer to the below code for the same.

In [None]:
dataset.head()

In [None]:
dataset.isnull().any()

In [None]:
#Dealing with NA data values
dataset = dataset.dropna()

In [None]:
dataset.dtypes

In [None]:
dataset.shape

In [None]:
def convert_to_ft(x):
  """
  """
  if "Sq. Meter" in x:
    x = x.replace("Sq. Meter", "")
    x = float(x)
    x_ft  = x * 10.764
  elif "Sq. Yards" in x:
    x = x.replace("Sq. Yards", "")
    x = float(x)
    x_ft = x * 3.0
  elif "Acres" in x:
    x = x.replace("Acres", "")  
    x = float(x)
    x_ft = x * 43560
  elif "-" in x:
    o, t = x.split("-")
    o = float(o)
    t = float(t)
    x_ft = (o+t)/2
  else:
    x_ft = float(x)

  return x_ft

In [None]:
#convert Object datatype of total_sqft to integer
def func(x):
  """
  """
  x = convert_to_ft(x)
  # try:
  #   x = convert_to_ft(x)
  # except:
  #   print("Except",x)
  return x

dataset['total_sqft'] = dataset["total_sqft"].apply(lambda x : func(x))

The data consists of features of Houses in locations across Bangalore. The problem is to predict the ‘price’ of the houses from  ‘total_sqfeet’, ‘size’, ‘bath’, ‘balcony’, ‘area_type’ and ‘location’.

The data has been preprocessed to some extent. Click [here](https://analyticsindiamag.com/data-pre-processing-in-python/) for instructions on Data preprocessing in python. 

## **Extracting the values and categorising the features to dependent and Independent Variables**

In [None]:
X = dataset.iloc[:,[0,2,3,5,6,7]].values
y = dataset.iloc[:, 8].values


> * **X**: Set of Independent features(‘total_sqfeet’, ‘size’, ‘bath’, ‘balcony’, ‘area_type’ and ‘location’).
> * **Y**: Dependendent Feature (price)


## **Handling categorical Variables**

In [None]:
X

The first three features ‘size’, ‘area_type’ and ‘location’ in the dataset consist of categorical values and hence is required to encode them to numbers.

We will use the label encoder to encode the features into numerical values.

In [None]:
from sklearn.preprocessing import LabelEncoder

le_X_0= LabelEncoder()
le_X_1= LabelEncoder()
le_X_2= LabelEncoder()

X[:, 2] = le_X_0.fit_transform(X[:, 2])
X[:, 0] = le_X_1.fit_transform(X[:, 0])
X[:, 1] = le_X_2.fit_transform(X[:, 1])

In [None]:
X

## **Splitting the data set into test and training set**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## **Creating and initialising  XGBoost Regressor**

In [None]:
from xgboost import XGBRegressor
regressor  = XGBRegressor()

## **Fitting the regressor with data**

After initialising the regressor we need to fit the regressor with training data so that it can learn the correlations between the features to give an accurate prediction for new inputs.

In [None]:
regressor.fit(X_train, y_train)

## **Predicting the prices**

Predicting for training set:

In [None]:
Y_pred_train = regressor.predict(X_train)

The above code will give the X_tain (training set with independent features) as input to the regressor and predicts values for prices. The predicted values are stored in the numpy array Y_pred_train.

Predicting for test set:

In [None]:
y_pred = regressor.predict(X_test)

The above code will give the X_test (test set with independent features) as input to the regressor and predicts values for prices. The predicted values are stored in the numpy array Y_pred

## **Evaluating accuracy with RMSLE**

Calculating RMSLE

In [None]:
def rmsle(y_pred,y_test) :
  error = np.square(np.log10(y_pred +1) - np.log10(y_test +1)).mean() ** 0.5
  Acc = 1 - error
  return Acc

In [None]:
print("Accuracy attained on Training Set = ",rmsle(Y_pred_train, y_train))
print("Accuracy attained on Test Set = ",rmsle(y_pred,y_test))

# **Related Articles --**


> * [Understand XG Boost](https://analyticsindiamag.com/xgboost-internal-working-to-make-decision-trees-and-deduce-predictions/)
> * [XGBoost v/s LightGBM](https://analyticsindiamag.com/comparing-the-gradient-boosting-decision-tree-packages-xgboost-vs-lightgbm/)
> * [Random Forest V/s XG Boost](https://analyticsindiamag.com/random-forest-vs-xgboost-comparing-tree-based-algorithms-with-codes/)
> * [Basics of Ensemble Learning](https://analyticsindiamag.com/basics-of-ensemble-learning-in-classification-techniques-explained/) 
> * [Bagging V/S Boosting](https://analyticsindiamag.com/guide-to-ensemble-methods-bagging-vs-boosting/)
> * [Guide to Ensemble Learning](https://analyticsindiamag.com/a-hands-on-guide-to-hybrid-ensemble-learning-models-with-python-code/)
> * [Ensemble Methods](https://analyticsindiamag.com/primer-ensemble-learning-bagging-boosting/)