# XGBOOST

1. The main advantages of XGBoost is its lightning speed compared to other algorithms

2. It uses parallel processing to increase the speed.

3. Sequential tree growing

4. However, XGBoost is more difficult to understand, visualize and to tune compared to AdaBoost and random forests.

5. XGBoost also handles missing values in the dataset. So, in data wrangling, you may or may not do a separate treatment for the missing values, because XGBoost is capable of handling missing values internally.

6. XGBoost is very popular because it has been the winning algorithm in a number of Kaggle competitions

In [1]:
# install xgboost before using
!pip install xgboost



# Used cars price prediction

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# preprocessing
from sklearn.model_selection import train_test_split,GridSearchCV

# metrics and models
from sklearn.metrics import r2_score,mean_squared_error

import xgboost as xgb

import warnings 
warnings.filterwarnings('ignore')

# read datasets

# note-this file is in github but due to its size we have deleted some of the data to accomodate in the github

In [3]:
df=pd.read_csv('vehicles_data_students.csv')
df

Unnamed: 0.1,Unnamed: 0,id,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,county,state,lat,long
0,55483,7315914053,0,2018.0,ram,promaster 2500,excellent,,gas,44244.0,clean,automatic,,,van,,,ca,32.792800,-116.966500
1,162368,7310885048,13995,2017.0,mazda,cx-3,,4 cylinders,gas,7037.0,rebuilt,automatic,,,SUV,white,,ia,41.207382,-96.023096
2,234393,7308243856,19990,2019.0,mitsubishi,eclipse cross sp,good,,gas,35313.0,clean,other,4wd,,hatchback,white,,nc,35.190000,-80.830000
3,276110,7315817729,0,2019.0,honda,cr-v,,,gas,25626.0,clean,automatic,,,SUV,orange,,ny,40.854573,-74.120219
4,349033,7301620999,42900,2015.0,chevrolet,corvette,excellent,8 cylinders,gas,29000.0,clean,automatic,,,convertible,black,,sc,34.755562,-82.906419
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64027,250960,7310954770,3495,1997.0,mercury,grand marquis,good,8 cylinders,gas,106253.0,clean,automatic,rwd,,sedan,white,,nj,40.588352,-74.291039
64028,217058,7305666517,25590,2018.0,lincoln,mkz reserve hybrid sedan,good,,hybrid,33467.0,clean,other,fwd,,sedan,white,,mn,44.010000,-92.470000
64029,323468,7316544829,22421,2015.0,mercedes-benz,e-class,excellent,,gas,55494.0,clean,automatic,rwd,,sedan,black,,or,45.504350,-122.532962
64030,132003,7314796788,29999,2013.0,jeep,wrangler unlimited sport,,,gas,63408.0,clean,automatic,4wd,,SUV,white,,id,43.619740,-116.294690


In [4]:
list(df.columns)

['Unnamed: 0',
 'id',
 'price',
 'year',
 'manufacturer',
 'model',
 'condition',
 'cylinders',
 'fuel',
 'odometer',
 'title_status',
 'transmission',
 'drive',
 'size',
 'type',
 'paint_color',
 'county',
 'state',
 'lat',
 'long']

In [5]:
drop_columns=['Unnamed: 0','id','title_status','transmission', 'drive',  'lat','long','county']
df=df.drop(columns=drop_columns,axis=1)

In [6]:
df.shape

(64032, 12)

In [7]:
df.isna().sum()

price               0
year              158
manufacturer     2569
model             802
condition       26097
cylinders       26511
fuel              424
odometer          669
size            45841
type            13785
paint_color     19505
state               0
dtype: int64

In [8]:
df=df.dropna()
df

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,size,type,paint_color,state
5,0,2006.0,chrysler,300,like new,8 cylinders,gas,149000.0,full-size,sedan,white,fl
9,20995,2011.0,chevrolet,silverado 1500,excellent,8 cylinders,gas,92001.0,mid-size,truck,blue,wi
30,13000,2012.0,infiniti,g37x,good,6 cylinders,gas,49200.0,mid-size,sedan,black,oh
32,12500,2017.0,buick,encore premium sport8,like new,4 cylinders,gas,73057.0,mid-size,SUV,black,ny
39,44500,2015.0,gmc,yukon xl denali 4x4,excellent,8 cylinders,gas,91000.0,full-size,SUV,white,az
...,...,...,...,...,...,...,...,...,...,...,...,...
64008,2150,2002.0,chevrolet,blazer,fair,6 cylinders,gas,151255.0,mid-size,SUV,black,oh
64010,3995,2003.0,acura,mdx,good,6 cylinders,gas,161000.0,full-size,SUV,white,az
64014,15850,2008.0,jeep,rubicon,good,6 cylinders,gas,151628.0,compact,SUV,green,wi
64015,8999,2010.0,honda,odyssey,good,6 cylinders,gas,125989.0,full-size,van,blue,pa


In [9]:
df.shape

(12281, 12)

In [10]:
df.describe()

Unnamed: 0,price,year,odometer
count,12281.0,12281.0,12281.0
mean,14411.55,2008.448416,127443.6
std,154162.5,10.038207,282707.0
min,0.0,1918.0,0.0
25%,4950.0,2006.0,73128.0
50%,8999.0,2011.0,114700.0
75%,16998.0,2014.0,156233.0
max,17000000.0,2021.0,10000000.0


# check if there are any duolicates .remove if there are any duplicates

In [11]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,size,type,paint_color,state
5,0,2006.0,chrysler,300,like new,8 cylinders,gas,149000.0,full-size,sedan,white,fl
9,20995,2011.0,chevrolet,silverado 1500,excellent,8 cylinders,gas,92001.0,mid-size,truck,blue,wi
30,13000,2012.0,infiniti,g37x,good,6 cylinders,gas,49200.0,mid-size,sedan,black,oh
32,12500,2017.0,buick,encore premium sport8,like new,4 cylinders,gas,73057.0,mid-size,SUV,black,ny
39,44500,2015.0,gmc,yukon xl denali 4x4,excellent,8 cylinders,gas,91000.0,full-size,SUV,white,az
...,...,...,...,...,...,...,...,...,...,...,...,...
64008,2150,2002.0,chevrolet,blazer,fair,6 cylinders,gas,151255.0,mid-size,SUV,black,oh
64010,3995,2003.0,acura,mdx,good,6 cylinders,gas,161000.0,full-size,SUV,white,az
64014,15850,2008.0,jeep,rubicon,good,6 cylinders,gas,151628.0,compact,SUV,green,wi
64015,8999,2010.0,honda,odyssey,good,6 cylinders,gas,125989.0,full-size,van,blue,pa


In [12]:
df.isnull().sum()

price           0
year            0
manufacturer    0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
size            0
type            0
paint_color     0
state           0
dtype: int64

# filter categorical features

In [13]:
#separate object data type columns
object_type_columns=list(df.select_dtypes(include='object'))
object_type_columns

['manufacturer',
 'model',
 'condition',
 'cylinders',
 'fuel',
 'size',
 'type',
 'paint_color',
 'state']

In [14]:
# object_type_columns=list(object_columns.columns)
# object_type_columns 

# encoding categorical columns using get_dummies.
note-please visit encoder and imputer live class recordings for more information on get_dummies

In [15]:
df_dummies=pd.get_dummies(df[object_type_columns],drop_first=True)

In [16]:
df_dummies.head()

Unnamed: 0,manufacturer_alfa-romeo,manufacturer_audi,manufacturer_bmw,manufacturer_buick,manufacturer_cadillac,manufacturer_chevrolet,manufacturer_chrysler,manufacturer_datsun,manufacturer_dodge,manufacturer_ferrari,...,state_sd,state_tn,state_tx,state_ut,state_va,state_vt,state_wa,state_wi,state_wv,state_wy
5,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
30,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
32,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
39,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [17]:
df_dummies.shape

(11694, 3508)

In [18]:
df=df.join(df_dummies)

In [19]:
df.shape

(11694, 3520)

In [20]:
df.head()

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,size,type,...,state_sd,state_tn,state_tx,state_ut,state_va,state_vt,state_wa,state_wi,state_wv,state_wy
5,0,2006.0,chrysler,300,like new,8 cylinders,gas,149000.0,full-size,sedan,...,False,False,False,False,False,False,False,False,False,False
9,20995,2011.0,chevrolet,silverado 1500,excellent,8 cylinders,gas,92001.0,mid-size,truck,...,False,False,False,False,False,False,False,True,False,False
30,13000,2012.0,infiniti,g37x,good,6 cylinders,gas,49200.0,mid-size,sedan,...,False,False,False,False,False,False,False,False,False,False
32,12500,2017.0,buick,encore premium sport8,like new,4 cylinders,gas,73057.0,mid-size,SUV,...,False,False,False,False,False,False,False,False,False,False
39,44500,2015.0,gmc,yukon xl denali 4x4,excellent,8 cylinders,gas,91000.0,full-size,SUV,...,False,False,False,False,False,False,False,False,False,False


In [21]:
df.drop(columns=object_type_columns,axis=1,inplace=True)

In [22]:
df.head(2)

Unnamed: 0,price,year,odometer,manufacturer_alfa-romeo,manufacturer_audi,manufacturer_bmw,manufacturer_buick,manufacturer_cadillac,manufacturer_chevrolet,manufacturer_chrysler,...,state_sd,state_tn,state_tx,state_ut,state_va,state_vt,state_wa,state_wi,state_wv,state_wy
5,0,2006.0,149000.0,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
9,20995,2011.0,92001.0,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,True,False,False


While practicing try other encoding techniques as well and compare the final accuracy.

Selecting realistic data. Here Domain knowledge will help a lot to decide what could be the higher and lower price.

Let's consider below prices in this example




Devide dataset into features and label

In [23]:
df=df[df['price']>1000]
df=df[df['price']<25000]

In [24]:
df.shape

(9325, 3511)

# divide dataset into features and label

In [25]:
y=df['price']
X=df.drop(['price'],axis=1)

In [28]:
# data split into train test
train_X,test_x,train_Y,test_y=train_test_split(X,y,test_size=0.25,random_state=10)

# XGB

In [29]:
import xgboost as xgb
xgb=xgb.XGBRFRegressor()
xgb.fit(train_X,train_Y)

In [30]:
y_pred=xgb.predict(test_x)

In [31]:
r2_score(test_y,y_pred)

0.5511315388467979

Tune the parameters and check if you can increase the score

Difference between Bagging and Boosting

Boosting



Training data subsets are drawn randomly with replacement from the entire training dataset.



Bagging attempts to tackle the over-fitting issue.

Every model receives an equal weight.

Objective to decrease variance, not bias.

Every model is built independently.


Bagging
Each new subset contains the components that were misclassified by previous models.

Boosting tries to reduce bias.

Models are weighted by their performance.

Objective to decrease bias, not variance.

New models are affected by the performance of the previously Ideveloped model.