## XGBoost

XGBoost models represent all problems as a regression predictive modeling problem that takes numerical values as input. Your data must be prepared into the expected format:

- encode string output variables for classification.
- prepare categorical input variables (ie one hot encoding)
- handle missing data with XGBoost


### Simple Case: numeric inputs with categorical outputs

meaning of input numbers in the iris dataset

   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm

In [18]:
import pandas
import numpy
import xgboost
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt

%matplotlib inline

In [3]:
# load data from http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data 
data = pandas.read_csv('../data/iris.data', header=None)
print(type(data))
dataset = data.values # convert pandas dataframe to <class 'numpy.ndarray'>
print(type(dataset))

<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>


In [4]:
# split data into X and y
X = dataset[:,0:4]
Y = dataset[:,4]

In [5]:
# print every 30th data point
print(type(X), X.shape)
print(X[::30])
print(Y[::30])

<class 'numpy.ndarray'> (150, 4)
[[5.1 3.5 1.4 0.2]
 [4.8 3.1 1.6 0.2]
 [5.0 2.0 3.5 1.0]
 [5.5 2.6 4.4 1.2]
 [6.9 3.2 5.7 2.3]]
['Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica']


In [6]:
# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
print(label_encoded_y[::30])

[0 0 1 1 2]


## Handling new unseen categories

If you use this label_encoder to convert categories to integer labels for the input to the
model, if at test time you encounter a category that is not seen in the training set, the label_encoder can crash, in the future, you may need to custom design a 2 way dictionary
that can handle the mapping of number:category that handles new unseen categories. 

In [7]:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, 
                                                                    label_encoded_y, 
                                                                    test_size=test_size, 
                                                                    random_state=seed)

In [8]:
# fit model no training data
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)
print(model)

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method=None, validate_parameters=False, verbosity=None)


In [9]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
print(predictions[::5])
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

[2, 0, 2, 2, 1, 2, 2, 0, 2, 0]
Accuracy: 92.00%


## Categorical inputs to Categorical ouputs

input features of Breast Cancer Data Set

1. Class: no-recurrence-events, recurrence-events
2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
3. menopause: lt40, ge40, premeno.
4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59
5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39
6. node-caps: yes, no.
7. deg-malig: 1, 2, 3.
8. breast: left, right.
9. breast-quad: left-up, left-low, right-up,	right-low, central.
10. irradiat:	yes, no.

In [14]:
# load data
data = pandas.read_csv('../data/breast-cancer.data', header=None)
dataset = data.values

# split data into X and y
X = dataset[:,0:9]
Y = dataset[:,9]

print(type(X), X.shape)
print(X[0])
X = X.astype(str) # deg-malig: int -> str


print(X[0])
print(Y[::50])

<class 'numpy.ndarray'> (286, 9)
['no-recurrence-events' '30-39' 'premeno' '30-34' '0-2' 'no' 3 'left'
 'left_low']
['no-recurrence-events' '30-39' 'premeno' '30-34' '0-2' 'no' '3' 'left'
 'left_low']
['no' 'no' 'no' 'no' 'no' 'no']


XGBoost may assume that encoded integer values for each input variable have an ordinal relationship. For example that ‘left-up’ encoded as 0 and ‘left-low’ encoded as 1 for the breast-quad variable have a meaningful relationship as integers. In this case, this assumption is untrue.

Instead, we must map these integer values onto new binary variables, one new variable for each categorical value.

left-up, left-low, right-up, right-low, central ->

1,0,0,0,0

0,1,0,0,0

0,0,1,0,0

0,0,0,1,0

0,0,0,0,1

using the OneHotEncoder class in scikit-learn

We can one hot encode each feature after we have label encoded it. First we must transform the feature array into a 2-dimensional NumPy array where each integer value is a feature vector with a length 1. We can then create the OneHotEncoder and encode the feature array.



In [29]:
# what are all the unique categories of the first feature?
print(set(X[:,0]))
label_encoder = LabelEncoder()
feature = label_encoder.fit_transform(X[:,0])
# they have now be encoded ordinally, but they are not ordinal
print(set(feature), feature.shape)
feature = feature.reshape(X.shape[0], 1) # add dimension: (286,) -> (286, 1) 
print(feature[:2], feature.shape)
onehot_encoder = OneHotEncoder(sparse=False, categories='auto')
feature = onehot_encoder.fit_transform(feature)
print(feature[:2])

{'no-recurrence-events', 'recurrence-events'}
{0, 1} (286,)
[[0]
 [0]] (286, 1)
[[1. 0.]
 [1. 0.]]


In [19]:
from sklearn.preprocessing import OneHotEncoder

encoded_x = None
for i in range(0, X.shape[1]):
	label_encoder = LabelEncoder()
	feature = label_encoder.fit_transform(X[:,i])
	feature = feature.reshape(X.shape[0], 1)
	onehot_encoder = OneHotEncoder(sparse=False, categories='auto')
	feature = onehot_encoder.fit_transform(feature)
	if encoded_x is None:
		encoded_x = feature
	else:
		encoded_x = numpy.concatenate((encoded_x, feature), axis=1)
        
print("X shape: : ", encoded_x.shape)

X shape: :  (286, 43)


In [32]:
encoded_x[::90] # The input consists of each features one hot encoding concatenated together 

array([[1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0.],
       [1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
        0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0.]])

In [33]:
# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

In [35]:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = model_selection.train_test_split(encoded_x, 
                                                    label_encoded_y, 
                                                    test_size=test_size, 
                                                    random_state=seed)

In [37]:
# fit model no training data
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)
print(model)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
              validate_parameters=False, verbosity=None)
Accuracy: 71.58%
