# Extreme Gradient Boosting with XGBoost

### [C4] Using XGBoost in Pipelines

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb

  from pandas import MultiIndex, Int64Index


This time we're going to be using the original dataset:

In [2]:
URL = 'https://assets.datacamp.com/production/repositories/943/datasets/17a7c5c0acd7bfa253827ea53646cf0db7d39649/ames_unprocessed_data.csv'

In [3]:
df = pd.read_csv(URL)
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Neighborhood,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,...,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,Fireplaces,GarageArea,PavedDrive,SalePrice
0,60,RL,65.0,8450,CollgCr,1Fam,2Story,7,5,2003,...,1710,1,0,2,1,3,0,548,Y,208500
1,20,RL,80.0,9600,Veenker,1Fam,1Story,6,8,1976,...,1262,0,1,2,0,3,1,460,Y,181500
2,60,RL,68.0,11250,CollgCr,1Fam,2Story,7,5,2001,...,1786,1,0,2,1,3,1,608,Y,223500
3,70,RL,60.0,9550,Crawfor,1Fam,2Story,7,5,1915,...,1717,1,0,1,0,3,1,642,Y,140000
4,60,RL,84.0,14260,NoRidge,1Fam,2Story,8,5,2000,...,2198,1,0,2,1,4,1,836,Y,250000


Exploring the data:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 21 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MSSubClass    1460 non-null   int64  
 1   MSZoning      1460 non-null   object 
 2   LotFrontage   1201 non-null   float64
 3   LotArea       1460 non-null   int64  
 4   Neighborhood  1460 non-null   object 
 5   BldgType      1460 non-null   object 
 6   HouseStyle    1460 non-null   object 
 7   OverallQual   1460 non-null   int64  
 8   OverallCond   1460 non-null   int64  
 9   YearBuilt     1460 non-null   int64  
 10  Remodeled     1460 non-null   int64  
 11  GrLivArea     1460 non-null   int64  
 12  BsmtFullBath  1460 non-null   int64  
 13  BsmtHalfBath  1460 non-null   int64  
 14  FullBath      1460 non-null   int64  
 15  HalfBath      1460 non-null   int64  
 16  BedroomAbvGr  1460 non-null   int64  
 17  Fireplaces    1460 non-null   int64  
 18  GarageArea    1460 non-null 

Dealing with `LotFrontage` missing values: in this case we're going to fill missing values with `0`

In [5]:
df.LotFrontage = df.LotFrontage.fillna(0)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 21 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MSSubClass    1460 non-null   int64  
 1   MSZoning      1460 non-null   object 
 2   LotFrontage   1460 non-null   float64
 3   LotArea       1460 non-null   int64  
 4   Neighborhood  1460 non-null   object 
 5   BldgType      1460 non-null   object 
 6   HouseStyle    1460 non-null   object 
 7   OverallQual   1460 non-null   int64  
 8   OverallCond   1460 non-null   int64  
 9   YearBuilt     1460 non-null   int64  
 10  Remodeled     1460 non-null   int64  
 11  GrLivArea     1460 non-null   int64  
 12  BsmtFullBath  1460 non-null   int64  
 13  BsmtHalfBath  1460 non-null   int64  
 14  FullBath      1460 non-null   int64  
 15  HalfBath      1460 non-null   int64  
 16  BedroomAbvGr  1460 non-null   int64  
 17  Fireplaces    1460 non-null   int64  
 18  GarageArea    1460 non-null 

We have to treat categorical values in order to preprocess them before fitting the xgb model:

In [7]:
categorical_mask = df.dtypes == 'object'
categorical_mask

MSSubClass      False
MSZoning         True
LotFrontage     False
LotArea         False
Neighborhood     True
BldgType         True
HouseStyle       True
OverallQual     False
OverallCond     False
YearBuilt       False
Remodeled       False
GrLivArea       False
BsmtFullBath    False
BsmtHalfBath    False
FullBath        False
HalfBath        False
BedroomAbvGr    False
Fireplaces      False
GarageArea      False
PavedDrive       True
SalePrice       False
dtype: bool

In [8]:
categorical_columns = df.columns[categorical_mask].tolist()
categorical_columns

['MSZoning', 'Neighborhood', 'BldgType', 'HouseStyle', 'PavedDrive']

#### __Encoding the categorical variables__

Using LabelEncoder:

In [9]:
URL = 'https://assets.datacamp.com/production/repositories/943/datasets/17a7c5c0acd7bfa253827ea53646cf0db7d39649/ames_unprocessed_data.csv'
df = pd.read_csv(URL)
df.LotFrontage = df.LotFrontage.fillna(0)

In [10]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [11]:
df_encoded = df[categorical_columns].apply(lambda x: le.fit_transform(x))
df_encoded.head()

Unnamed: 0,MSZoning,Neighborhood,BldgType,HouseStyle,PavedDrive
0,3,5,0,5,2
1,3,24,0,2,2
2,3,5,0,5,2
3,3,6,0,5,2
4,3,15,0,5,2


Using OneHotEncoding:

In [12]:
URL = 'https://assets.datacamp.com/production/repositories/943/datasets/17a7c5c0acd7bfa253827ea53646cf0db7d39649/ames_unprocessed_data.csv'
df = pd.read_csv(URL)
df.LotFrontage = df.LotFrontage.fillna(0)

In [13]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

In [14]:
df_encoded = ohe.fit_transform(df[categorical_columns])
df_encoded

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

Using DictVectorizer:

In [15]:
URL = 'https://assets.datacamp.com/production/repositories/943/datasets/17a7c5c0acd7bfa253827ea53646cf0db7d39649/ames_unprocessed_data.csv'
df = pd.read_csv(URL)
df.LotFrontage = df.LotFrontage.fillna(0)

In [16]:
from sklearn.feature_extraction import DictVectorizer

In [17]:
df_dict = df.to_dict(orient='records')

In [18]:
dv = DictVectorizer(sparse=False)
df_encoded = dv.fit_transform(df_dict)

In [19]:
df_encoded

array([[3.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.08500e+05, 2.00300e+03],
       [3.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.81500e+05, 1.97600e+03],
       [3.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        2.23500e+05, 2.00100e+03],
       ...,
       [4.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        2.66500e+05, 1.94100e+03],
       [2.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.42125e+05, 1.95000e+03],
       [3.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.47500e+05, 1.96500e+03]])

Mapping the name of the features with their indices using `.vocabulary_`

In [20]:
dv.vocabulary_

{'MSSubClass': 23,
 'MSZoning=RL': 27,
 'LotFrontage': 22,
 'LotArea': 21,
 'Neighborhood=CollgCr': 34,
 'BldgType=1Fam': 1,
 'HouseStyle=2Story': 18,
 'OverallQual': 55,
 'OverallCond': 54,
 'YearBuilt': 61,
 'Remodeled': 59,
 'GrLivArea': 11,
 'BsmtFullBath': 6,
 'BsmtHalfBath': 7,
 'FullBath': 9,
 'HalfBath': 12,
 'BedroomAbvGr': 0,
 'Fireplaces': 8,
 'GarageArea': 10,
 'PavedDrive=Y': 58,
 'SalePrice': 60,
 'Neighborhood=Veenker': 53,
 'HouseStyle=1Story': 15,
 'Neighborhood=Crawfor': 35,
 'Neighborhood=NoRidge': 44,
 'Neighborhood=Mitchel': 40,
 'HouseStyle=1.5Fin': 13,
 'Neighborhood=Somerst': 50,
 'Neighborhood=NWAmes': 43,
 'MSZoning=RM': 28,
 'Neighborhood=OldTown': 46,
 'Neighborhood=BrkSide': 32,
 'BldgType=2fmCon': 2,
 'HouseStyle=1.5Unf': 14,
 'Neighborhood=Sawyer': 48,
 'Neighborhood=NridgHt': 45,
 'Neighborhood=NAmes': 41,
 'BldgType=Duplex': 3,
 'Neighborhood=SawyerW': 49,
 'Neighborhood=IDOTRR': 38,
 'PavedDrive=N': 56,
 'Neighborhood=MeadowV': 39,
 'BldgType=TwnhsE': 

#### __Using a pipeline__

In [21]:
URL = 'https://assets.datacamp.com/production/repositories/943/datasets/17a7c5c0acd7bfa253827ea53646cf0db7d39649/ames_unprocessed_data.csv'
df = pd.read_csv(URL)

# Getting rid of missing data
df.LotFrontage = df.LotFrontage.fillna(0)

Splitting into variables $X$ and target $y$:

In [22]:
X, y = df.iloc[:, :-1], df.iloc[:, -1]

In [23]:
from sklearn.pipeline import Pipeline

Setting the pipeline's steps:

In [24]:
steps = [('ohe_onestep', DictVectorizer(sparse=False)),
         ('xgb_model', xgb.XGBRegressor())]

xgb_pipeline = Pipeline(steps)

Fitting the pipeline:

In [25]:
xgb_pipeline.fit(X.to_dict(orient='records'), y)

Pipeline(steps=[('ohe_onestep', DictVectorizer(sparse=False)),
                ('xgb_model',
                 XGBRegressor(base_score=0.5, booster='gbtree',
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree=1, enable_categorical=False,
                              gamma=0, gpu_id=-1, importance_type=None,
                              interaction_constraints='',
                              learning_rate=0.300000012, max_delta_step=0,
                              max_depth=6, min_child_weight=1, missing=nan,
                              monotone_constraints='()', n_estimators=100,
                              n_jobs=4, num_parallel_tree=1, predictor='auto',
                              random_state=0, reg_alpha=0, reg_lambda=1,
                              scale_pos_weight=1, subsample=1,
                              tree_method='exact', validate_parameters=1,
                              verbosity=None))])

#### __Additional components for pipelines__

Some of the additional features are:

- `DataFrameMapper`: interpretability between pandas (dataframe) and sklearn (arrays)
- `CategoricalImputer`: imputation of categorical values before conversion to integers

Let's apply cross-validation:

In [26]:
from sklearn.model_selection import cross_val_score

In [27]:
steps = [('ohe_onestep', DictVectorizer(sparse=False)),
         ('xgb_model', xgb.XGBRegressor(max_depth=2, objective='reg:linear'))]

xgb_pipeline = Pipeline(steps)

In [28]:
cross_val_scores = cross_val_score(xgb_pipeline, X.to_dict('records'), y,
                                   scoring='neg_mean_squared_error', cv=10)



In [29]:
rmse = np.mean(np.sqrt(np.abs(cross_val_scores)))

print(f'10-fold RMSE: {round(rmse, 2)}')

10-fold RMSE: 27683.04


#### __Fine tuning using pipelines__

In [30]:
from sklearn.model_selection import RandomizedSearchCV

In [31]:
X_train = X.to_dict(orient='records')
y_train = y.values

In [32]:
steps = [('ohe_onestep', DictVectorizer(sparse=False)),
         ('xgb_reg', xgb.XGBRegressor())]

xgb_pipeline = Pipeline(steps)

In [33]:
gbm_param_grid = {
    'xgb_reg__max_depth': np.arange(3, 10, 1),
}

Performing a randomized search:

In [None]:
randomized_roc_auc = RandomizedSearchCV(
    estimator=xgb_pipeline,
    param_distributions=gbm_param_grid,
    n_iter=2,
    scoring='roc_auc',
    cv=2,
    verbose=1
)

randomized_roc_auc.fit(X_train, y_train)