# Extreme Gradient Boosting with XGBoost

### [C1] Classification with XGBoost

In [10]:
import pandas as pd
import numpy as np
import xgboost as xgb

Loading the data:

In [2]:
URL = 'https://assets.datacamp.com/production/repositories/943/datasets/4dbcaee889ef06fb0763e4a8652a4c1f268359b2/ames_housing_trimmed_processed.csv'

In [3]:
df = pd.read_csv(URL)
df.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,Remodeled,GrLivArea,BsmtFullBath,BsmtHalfBath,...,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,PavedDrive_P,PavedDrive_Y,SalePrice
0,60,65.0,8450,7,5,2003,0,1710,1,0,...,0,0,0,0,1,0,0,0,1,208500
1,20,80.0,9600,6,8,1976,0,1262,0,1,...,0,1,0,0,0,0,0,0,1,181500
2,60,68.0,11250,7,5,2001,1,1786,1,0,...,0,0,0,0,1,0,0,0,1,223500
3,70,60.0,9550,7,5,1915,1,1717,1,0,...,0,0,0,0,1,0,0,0,1,140000
4,60,84.0,14260,8,5,2000,0,2198,1,0,...,0,0,0,0,1,0,0,0,1,250000


In [21]:
df['SalePrice'].describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

For classiffication porposes, we need to transform our target variable:

In [24]:
df['price_over_mean'] = (df['SalePrice'] > 181000).astype(int)
df['price_over_mean']

0       1
1       1
2       1
3       0
4       1
       ..
1455    0
1456    1
1457    1
1458    0
1459    0
Name: price_over_mean, Length: 1460, dtype: int64

In [25]:
del df['SalePrice']

In [28]:
df.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,Remodeled,GrLivArea,BsmtFullBath,BsmtHalfBath,...,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,PavedDrive_P,PavedDrive_Y,price_over_mean
0,60,65.0,8450,7,5,2003,0,1710,1,0,...,0,0,0,0,1,0,0,0,1,1
1,20,80.0,9600,6,8,1976,0,1262,0,1,...,0,1,0,0,0,0,0,0,1,1
2,60,68.0,11250,7,5,2001,1,1786,1,0,...,0,0,0,0,1,0,0,0,1,1
3,70,60.0,9550,7,5,1915,1,1717,1,0,...,0,0,0,0,1,0,0,0,1,0
4,60,84.0,14260,8,5,2000,0,2198,1,0,...,0,0,0,0,1,0,0,0,1,1


Creating features and target arrays:

In [29]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

Spliting the data:

In [7]:
from sklearn.model_selection import train_test_split

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

Fitting and predicting with classiffication XGBoost:

In [31]:
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

xg_cl.fit(X_train, y_train)
y_pred = xg_cl.predict(X_test)





In [32]:
accuracy = float(np.sum(y_pred == y_test)) / y_test.shape[0]
print(f'accuracy: {round(accuracy, 6)}')

accuracy: 0.931507


__Using cross_validation capabilities__

We'll use XGBoost's learning API throug its baked in cross-validation capabilities. For that, we need to transform our data into a `DMatrix`: 

In [33]:
df_dmatrix = xgb.DMatrix(data=X, label=y)

Defining parameters, we'll be using `nfold=3` which is the number of folds used in cross-validation and `num_boost_round=5` which represents the amount of trees used in the ensemble:

In [34]:
params = {
    'objective': 'reg:logistic',
    'max_depth': 3
}

In [36]:
cv_results = xgb.cv(dtrain=df_dmatrix, params=params, nfold=3, num_boost_round=5,
                    metrics='error', as_pandas=True, seed=123)

cv_results

Unnamed: 0,train-error-mean,train-error-std,test-error-mean,test-error-std
0,0.100343,0.004784,0.123286,0.006018
1,0.096236,0.007268,0.11918,0.002964
2,0.087669,0.004095,0.113708,0.009945
3,0.077055,0.002232,0.104799,0.004542
4,0.06918,0.00381,0.095213,0.008703


We're interested in the last error value for testing:

In [38]:
accuracy = 1 - cv_results['test-error-mean'].iloc[-1]
print(f'accuracy: {round(accuracy, 6)}')

accuracy: 0.904787


Measuring the AUC: for this, we need to change the `metrics` parameter when running the `cv_results`

In [40]:
cv_results = xgb.cv(dtrain=df_dmatrix, params=params, nfold=3, num_boost_round=5,
                    metrics='auc', as_pandas=True, seed=123)

cv_results

Unnamed: 0,train-auc-mean,train-auc-std,test-auc-mean,test-auc-std
0,0.942242,0.000899,0.92178,0.005403
1,0.960882,0.001369,0.945141,0.008937
2,0.968627,0.002563,0.951584,0.00623
3,0.975862,0.001657,0.957878,0.005808
4,0.980463,0.002479,0.962099,0.003871


In [42]:
auc = cv_results['test-auc-mean'].iloc[-1]
print(f'AUC: {round(auc, 6)}')

AUC: 0.962099
