#  Formatting Data for XGBoostin' 

1. Encoding string output variables for classification
2. Preparing categorical input variables using one hot encoding
3. Automatically handle missing data

In [5]:
import pandas as pd
import numpy as np
import xgboost as xgb

from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split



## 1 - Label Encode String Classes

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
dataset = pd.read_csv(url, header=None)
dataset.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
dataval = dataset.values
X = dataval[:, 0:4]
Y = dataval[:, 4]

In [4]:
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

In [5]:
label_encoded_y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [6]:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y)

In [7]:
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
print(model)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)


In [8]:
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 94.74%


  if diff:


XGBoost is configured to automatically model a multiclass classification problem using the multi:softprob objective, a variation on the softmax loss function to model class probabilities. This suggests that internally the output class is automatically converted into a one hot type encoding.

## 2 - One Hot Encoding Categorical Data

In [9]:
from sklearn.preprocessing import OneHotEncoder

To prevent XGBoost assuming that encoded integers have an ordinal relationship, use one-hot encoding

In [None]:
# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

In [None]:
# encode string input values as integers
features = []
for i in range(0, X.shape[1]):
    label_encoder = LabelEncoder()
    feature = label_encoder.fit_transform(X[:,i])
    features.append(feature)
encoded_x = numpy.array(features)
encoded_x = encoded_x.reshape(X.shape[0], X.shape[1])

In [None]:
feature = feature.reshape(X.shape[0], 1)
onehot_encoder = OneHotEncoder(sparse=False)
feature = onehot_encoder.fit_transform(feature)
# encode string input values as integers
encoded_x = None
for i in range(0, X.shape[1]):
    label_encoder = LabelEncoder()
    feature = label_encoder.fit_transform(X[:,i])
    feature = feature.reshape(X.shape[0], 1)
    onehot_encoder = OneHotEncoder(sparse=False)
    feature = onehot_encoder.fit_transform(feature)
    if encoded_x is None:
        encoded_x = feature
    else:
        encoded_x = numpy.concatenate((encoded_x, feature), axis=1)
print("X shape: : ", encoded_x.shape)

XGBoost framework chose the ‘binary:logistic‘ objective automatically, the right objective for this binary classification problem.

## 3 - Missing Values

This horse colic dataset has 30% missing, so we must properly format the data.

XGBoost was designed to work with sparse data, which is good

In [1]:
cat ./data/horse-colic.names.txt

1. TItle: Horse Colic database

2. Source Information
   -- Creators: Mary McLeish & Matt Cecile
	  	Department of Computer Science
		University of Guelph
		Guelph, Ontario, Canada N1G 2W1
		mdmcleish@water.waterloo.edu
   -- Donor:    Will Taylor (taylor@pluto.arc.nasa.gov)
   -- Date:     8/6/89

3. Past Usage:
   -- Unknown

4. Relevant Information:

   -- 2 data files 
      -- horse-colic.data: 300 training instances
      -- horse-colic.test: 68 test instances
   -- Possible class attributes: 24 (whether lesion is surgical)
     -- others include: 23, 25, 26, and 27
   -- Many Data types: (continuous, discrete, and nominal)

5. Number of Instances: 368 (300 for training, 68 for testing)

6. Number of attributes: 28

7. Attribute Information:

  1:  surgery?
          1 = Yes, it had surgery
          2 = It was treated without surgery

  2:  Age 
          1 = Adult horse
          2 = Young (< 6 months)

  3:  Hospital Number 
          - nu

In [2]:
cols = ['Surgery', 'Age', 'Hospital Number', 'Rectal Temp', 'Pulse', 'Respiratory Rate', 'Extremity Temp', 
        'Peripheral Pulse', 'Mucous Membranes', 'Capillary Refill Time', 'Pain', 'Peristalsis', 'Abdominal Distension',
       'Nasogastric Tube', 'Nasogastric Reflux', 'Nasogastric Reflux PH', 'Rectal Exam - Feces', 'Abdomen', 'Packed Cell Volume',
       'Total Protein', 'Abdominocentesis Appearance', 'Abdominocentesis Total Protein', 'Outcome', 'Surgical Lesion',
       'Lesion Site', 'Lesion Type', 'Lesion Subtype', 'Pathology Data Present']

In [3]:
len(cols)

28

In [17]:
pd.read_csv?

In [6]:
df = pd.read_csv('./data/horse-colic.data.txt', delim_whitespace=True, header=None, names=cols)
df.head()

Unnamed: 0,Surgery,Age,Hospital Number,Rectal Temp,Pulse,Respiratory Rate,Extremity Temp,Peripheral Pulse,Mucous Membranes,Capillary Refill Time,...,Packed Cell Volume,Total Protein,Abdominocentesis Appearance,Abdominocentesis Total Protein,Outcome,Surgical Lesion,Lesion Site,Lesion Type,Lesion Subtype,Pathology Data Present
0,2,1,530101,38.5,66,28,3,3,?,2,...,45.0,8.4,?,?,2,2,11300,0,0,2
1,1,1,534817,39.2,88,20,?,?,4,1,...,50.0,85.0,2,2,3,2,2208,0,0,2
2,2,1,530334,38.3,40,24,1,1,3,1,...,33.0,6.7,?,?,1,2,0,0,0,1
3,1,9,5290409,39.1,164,84,4,1,6,2,...,48.0,7.2,3,5.30,2,1,2208,0,0,1
4,2,1,530255,37.3,104,35,?,?,6,2,...,74.0,7.4,?,?,2,2,4300,0,0,2


In [16]:
horsie = df.copy()
horseset = horsie.values

In [18]:
X = horseset[:, 0:27]
Y = horseset[:, 27 ]

change all '?' strings to 0's, and convert all to float

In [19]:
X[X == '?'] = 0
X = X.astype('float32')

Many class values are marked as 1/2.  Easily convert to 0/1 with LabelEncoder

In [21]:
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

In [25]:
label_encoded_y[0:5]

array([1, 1, 0, 0, 1])

In [27]:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
print(model)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
Accuracy: 83.84%


  if diff:


## How to Impute Values

In [28]:
from sklearn.preprocessing import Imputer

In [30]:
X = horseset[:, 0:27]
Y = horseset[:, 27 ]
X[X == '?'] = np.nan
X = X.astype('float32')

Using Imputer() will use the mean value of the column to fill in the missing value

In [32]:
imputer = Imputer()
imputed_x = imputer.fit_transform(X)

In [33]:
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

In [35]:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(imputed_x, label_encoded_y, test_size=test_size, random_state=seed)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
print(model)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
Accuracy: 83.84%


  if diff:


Here, appears to have same accuracy as setting values to zero.

But important to test as this may not always be the case