## XGBoost

XGBoost models represent all problems as a regression predictive modeling problem that takes numerical values as input. Your data must be prepared into the expected format:

- encode string output variables for classification.
- prepare categorical input variables (ie one hot encoding)
- handle missing data with XGBoost


### Simple Case: numeric inputs with categorical outputs

In [1]:
import pandas
import xgboost
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [13]:
# load data from http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data 
data = pandas.read_csv('../data/iris.data', header=None)
print(type(data))
dataset = data.values # convert pandas dataframe to <class 'numpy.ndarray'>
print(type(dataset))

<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>


In [3]:
# split data into X and y
X = dataset[:,0:4]
Y = dataset[:,4]

In [11]:
# print every 30th data point
print(type(X), X.shape)
print(X[::30])
print(Y[::30])

<class 'numpy.ndarray'> (150, 4)
[[5.1 3.5 1.4 0.2]
 [4.8 3.1 1.6 0.2]
 [5.0 2.0 3.5 1.0]
 [5.5 2.6 4.4 1.2]
 [6.9 3.2 5.7 2.3]]
['Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica']


In [10]:
# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
print(label_encoded_y[::30])

[0 0 1 1 2]


## handling new unseen categories
If you use this label_encoder to convert categories to integer labels for the input to the
model, if at test time you encounter a category that is not seen in the training set, the label_encoder can crash, in the future, you may need to custom design a 2 way dictionary
that can handle the mapping of number:category that handles new unseen categories. 

In [9]:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, 
                                                                    label_encoded_y, 
                                                                    test_size=test_size, 
                                                                    random_state=seed)

In [14]:
# fit model no training data
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)
print(model)

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method=None, validate_parameters=False, verbosity=None)


In [17]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 92.00%
