# Tuning the XGBoost and the use of Google Colab's GPU

## 1. Case Study: 1994 census US Income

This case study will perform an predictive analysis related to 1994 census data set on U.S. income. It contains information on **marital status**, **age**, **type of work**, and **more**. The target column, **high_income**, records salaries less than or equal to 50k a year (0), and more than 50k a year (1).

You can download the data from the [University of California](http://archive.ics.uci.edu/ml/datasets/Adult), Irvine's website.

### 1.1. Load Libraries


In [0]:
import pandas as pd
import numpy as np
import xgboost as xgb
import pickle
import time
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import make_scorer, accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from xgboost import XGBClassifier

### 1.2. Get the data

In [0]:
# Loading the dataset and showing the five first records.
income = pd.read_csv("income.csv")
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [0]:
# Verifying if there are missing values.
income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    32561 non-null object
high_income       32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### 1.3. Clean, prepare and manipulate Data (feature engineering)

This dataset contains a mix of **categorical (9 columns)** and **numerical (6 columns)** independent variables which as we know will need to pre-processed in different ways and separately.

This means that initially they’ll have to go through **separate pipelines** to be pre-processed appropriately and then we’ll combine them together. So the first step in both pipelines would have to be to extract the appropriate columns that need to be pushed down for pre-processing.

In [0]:
# The only step necessary to be done outside of pipeline.
# Converting the target column to categorical.
col = pd.Categorical(income.high_income)
income["high_income"] = col.codes

In [0]:
# Custom Transformer that extracts columns passed as argument to its constructor.
class FeatureSelector(BaseEstimator, TransformerMixin):
    # Class Constructor.
    def __init__( self, feature_names ):
        self.feature_names = feature_names

    # Returns self nothing else to do here.
    def fit( self, X, y = None):
        return self

    # Method that describes what we need this transformer to do.
    def transform( self, X, y = None):
        return X[self.feature_names]

#### 1.3.1. Categorical Pipeline

In [0]:
# Converts certain features to categorical.
class CategoricalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes a boolean as its argument.
    def __init__(self, new_features=True):
        self.new_features = new_features

    # Returns self nothing else to do here.
    def fit( self, X, y = None):
        return self

    # Transformer method we wrote for this transformer.
    def transform(self, X , y = None):
        df = X.copy()
        if self.new_features:
            # Treats ? workclass as unknown.
            df['workclass']= df['workclass'].replace('?','Unknown')
            # Two many category level, convert just US and Non-US.
            df.loc[df['native_country'] != ' United-States', 'native_country'] = 'non_usa'

        # Converts columns to categorical.
        for name in df.columns.to_list():
            col = pd.Categorical(df[name])
            df[name] = col.codes

        # Returns numpy array.
        return df

### 1.4. Modeling (train and test)

In [0]:
# Global varibles.
seed = 42
num_folds = 10
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}

In [0]:
# Split-out train/validation and test dataset.
X_train, X_test, y_train, y_test = train_test_split(income.drop(labels="high_income",axis=1),
                                                    income["high_income"],
                                                    test_size=0.20,
                                                    random_state=seed,
                                                    shuffle=True,
                                                    stratify=income["high_income"])

In [0]:
# Categrical features to pass down the categorical pipeline.
categorical_features = income.select_dtypes("object").columns.to_list()

# Numerical features to pass down the numerical pipeline.
numerical_features = income.select_dtypes("int64").columns.to_list()

# Defining the steps in the categorical pipeline.
categorical_pipeline = Pipeline(steps = [('cat_selector', FeatureSelector(categorical_features)),
                                         ('cat_transformer', CategoricalTransformer())])

# Defining the steps in the numerical pipeline.
numerical_pipeline = Pipeline(steps = [('num_selector', FeatureSelector(numerical_features)),
                                       ('std_scaler', MinMaxScaler())])

# Combining numerical and categorical piepline into one full big pipeline horizontally using FeatureUnion.
full_pipeline_preprocessing = FeatureUnion(transformer_list = [('categorical_pipeline', categorical_pipeline),
                                                               ('numerical_pipeline', numerical_pipeline)])

### 1.5. Algorithm Tuning

In [0]:
# The full pipeline as a step in another pipeline with an estimator as the final step.
pipe = Pipeline(steps = [("full_pipeline", full_pipeline_preprocessing),
                         ("fs", SelectKBest()),
                         ("clf", XGBClassifier())])

# Creating a dictionary with the hyperparameters.
search_space = [
                {"clf": [RandomForestClassifier()],
                 "clf__n_estimators": [800],
                 "clf__criterion": ["gini", "entropy"],
                 "clf__max_leaf_nodes": [300],
                 "clf__random_state": [seed],
                 "clf__oob_score": [True],
                 "fs__score_func": [chi2],
                 "fs__k": [10]},
                {"clf": [XGBClassifier()],
                 "clf__n_estimators": [300],
                 "clf__max_depth": [4],
                 "clf__learning_rate": [0.1],
                 "clf__random_state": [seed],
                 "clf__subsample": [1],
                 "clf__colsample_bytree": [1],
#                  "clf__tree_method": ["gpu_hist"],  # For using the GPU.
                 "fs__score_func":[chi2],
                 "fs__k":[13]}
]

# Defining StratifiedKFold object.
kfold = StratifiedKFold(n_splits=num_folds, random_state=seed, shuffle=True)

#############################################################################
# return_train_score=True
# official documentation: "computing the scores on the training set can be
# computationally expensive and is not strictly required to
# select the parameters that yield the best generalization performance".
#############################################################################

# Creating the GridSearchCV object.
grid = GridSearchCV(estimator=pipe, 
                    param_grid=search_space,
                    cv=kfold,
                    scoring=scoring,
                    return_train_score=True,
                    n_jobs=-1,
                    refit="AUC")

# Getting the time start.
tmp = time.time()

# Fitting the GridSearchCV object.
best_model = grid.fit(X_train, y_train)

# Printing the time spent.
print("CPU Training Time: %s seconds" % (str(time.time() - tmp)))  # 311.7510848045349 seconds
# print("GPU Training Time: %s seconds" % (str(time.time() - tmp)))  # 301.1867105960846 seconds

CPU Training Time: 311.7510848045349 seconds


In [0]:
print("Best: %f using %s" % (best_model.best_score_,best_model.best_params_))  # 0.920415 (GPU) | 0.920279 (CPU)

Best: 0.920279 using {'clf': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1), 'clf__colsample_bytree': 1, 'clf__learning_rate': 0.1, 'clf__max_depth': 4, 'clf__n_estimators': 300, 'clf__random_state': 42, 'clf__subsample': 1, 'fs__k': 13, 'fs__score_func': <function chi2 at 0x7f08fc22c6a8>}


In [0]:
# Analyzing the results of cross-validation process.
result = pd.DataFrame(best_model.cv_results_)
result.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf,param_clf__criterion,param_clf__max_leaf_nodes,param_clf__n_estimators,param_clf__oob_score,param_clf__random_state,param_fs__k,param_fs__score_func,param_clf__colsample_bytree,param_clf__learning_rate,param_clf__max_depth,param_clf__subsample,params,split0_test_AUC,split1_test_AUC,split2_test_AUC,split3_test_AUC,split4_test_AUC,split5_test_AUC,split6_test_AUC,split7_test_AUC,split8_test_AUC,split9_test_AUC,mean_test_AUC,std_test_AUC,rank_test_AUC,split0_train_AUC,split1_train_AUC,split2_train_AUC,split3_train_AUC,split4_train_AUC,split5_train_AUC,split6_train_AUC,split7_train_AUC,split8_train_AUC,split9_train_AUC,mean_train_AUC,std_train_AUC,split0_test_Accuracy,split1_test_Accuracy,split2_test_Accuracy,split3_test_Accuracy,split4_test_Accuracy,split5_test_Accuracy,split6_test_Accuracy,split7_test_Accuracy,split8_test_Accuracy,split9_test_Accuracy,mean_test_Accuracy,std_test_Accuracy,rank_test_Accuracy,split0_train_Accuracy,split1_train_Accuracy,split2_train_Accuracy,split3_train_Accuracy,split4_train_Accuracy,split5_train_Accuracy,split6_train_Accuracy,split7_train_Accuracy,split8_train_Accuracy,split9_train_Accuracy,mean_train_Accuracy,std_train_Accuracy
0,18.210739,0.101418,0.927936,0.007255,"RandomForestClassifier(bootstrap=True, class_w...",gini,300.0,800,True,42,10,<function chi2 at 0x7f08fc22c6a8>,,,,,"{'clf': RandomForestClassifier(bootstrap=True,...",0.923003,0.910321,0.906696,0.924319,0.917938,0.912002,0.918228,0.905922,0.915092,0.906398,0.913992,0.00645,3,0.939462,0.94005,0.940241,0.939373,0.9398,0.94003,0.939486,0.940109,0.940028,0.940117,0.93987,0.000301,0.863776,0.852264,0.854566,0.872169,0.865259,0.853303,0.859447,0.857143,0.863287,0.854839,0.859605,0.006071,1,0.884907,0.884651,0.885974,0.884315,0.885211,0.885685,0.885813,0.885728,0.884917,0.885258,0.885246,0.000522
1,20.769695,0.127662,1.004835,0.041484,"RandomForestClassifier(bootstrap=True, class_w...",entropy,300.0,800,True,42,10,<function chi2 at 0x7f08fc22c6a8>,,,,,"{'clf': RandomForestClassifier(bootstrap=True,...",0.92376,0.910589,0.907789,0.924127,0.918645,0.912665,0.919372,0.906761,0.91483,0.907201,0.914574,0.006295,2,0.940572,0.941107,0.941312,0.940279,0.940777,0.94133,0.940619,0.941232,0.941232,0.941315,0.940978,0.000363,0.863392,0.849578,0.855718,0.871401,0.864491,0.855991,0.859447,0.859063,0.860983,0.855223,0.859529,0.005723,2,0.877741,0.878082,0.878935,0.876765,0.878429,0.879585,0.878604,0.879372,0.878391,0.879543,0.878545,0.000836
2,7.460662,0.06686,0.161731,0.008662,"XGBClassifier(base_score=0.5, booster='gbtree'...",,,300,,42,13,<function chi2 at 0x7f08fc22c6a8>,1.0,0.1,4.0,1.0,"{'clf': XGBClassifier(base_score=0.5, booster=...",0.931583,0.918099,0.90838,0.929589,0.926899,0.925231,0.92304,0.913778,0.919776,0.906407,0.920279,0.008196,1,0.941436,0.941725,0.941998,0.941231,0.94197,0.94202,0.941465,0.941914,0.942004,0.94209,0.941785,0.000288,0.866078,0.85495,0.846508,0.876008,0.872937,0.865591,0.849846,0.859063,0.860215,0.841398,0.85926,0.010695,3,0.884737,0.884268,0.884907,0.883078,0.884571,0.884533,0.884917,0.884149,0.884277,0.884448,0.884389,0.000503


In [0]:
# ROC of best model.
result[result.rank_test_AUC == 1][['mean_train_AUC', 'std_train_AUC','mean_test_AUC', 'std_test_AUC']]

Unnamed: 0,mean_train_AUC,std_train_AUC,mean_test_AUC,std_test_AUC
2,0.941785,0.000288,0.920279,0.008196


In [0]:
# Training score much higher than test score.
# The standard deviation of the test score is large.
result_auc = result[['mean_train_AUC', 'std_train_AUC','mean_test_AUC', 'std_test_AUC']]
result_auc

Unnamed: 0,mean_train_AUC,std_train_AUC,mean_test_AUC,std_test_AUC
0,0.93987,0.000301,0.913992,0.00645
1,0.940978,0.000363,0.914574,0.006295
2,0.941785,0.000288,0.920279,0.008196


In [0]:
result_acc = result[['mean_train_Accuracy', 'std_train_Accuracy','mean_test_Accuracy', 'std_test_Accuracy']]
result_acc

Unnamed: 0,mean_train_Accuracy,std_train_Accuracy,mean_test_Accuracy,std_test_Accuracy
0,0.885246,0.000522,0.859605,0.006071
1,0.878545,0.000836,0.859529,0.005723
2,0.884389,0.000503,0.85926,0.010695


### 1.6. Finalizing the Model

In [0]:
# Serializing the best model.
with open('pipe.pkl', 'wb') as file:
    pickle.dump(best_model, file)

In [0]:
# Restoring the best model.
with open("pipe.pkl", "rb") as file:
    best_model = pickle.load(file)

In [0]:
# Testing final model.
predict = best_model.predict(X_test)
print("Accuracy of testing: ", accuracy_score(y_test, predict), "\n")  # 0.8765545831414094 (GPU) | 0.8747121142330723 (CPU)
print("Confusion Matrix:\n", confusion_matrix(y_test,predict), "\n")
print("Classification report:\n", classification_report(y_test,predict))

Accuracy of testing:  0.8747121142330723 

Confusion Matrix:
 [[4652  293]
 [ 523 1045]] 

Classification report:
               precision    recall  f1-score   support

           0       0.90      0.94      0.92      4945
           1       0.78      0.67      0.72      1568

    accuracy                           0.87      6513
   macro avg       0.84      0.80      0.82      6513
weighted avg       0.87      0.87      0.87      6513

