XGBoost is generally considered the best ML algorithm around right now.

Experiment using the Iris data set. This data set includes the width and length of the petals and sepals of many Iris flowers, and the specific species of Iris the flower belongs to. 

Challenge is to predict the species of a flower sample just based on the sizes of its petals.

In [1]:
from sklearn.datasets import load_iris

# load data from lib
iris = load_iris()

# explore shape of data
numSamples, numFeatures = iris.data.shape
print(numSamples) # 150 samples
print(numFeatures) # each flower has 4 features
print(list(iris.target_names)) # 3 different labels

150
4
['setosa', 'versicolor', 'virginica']


Dvide our data into 20% reserved for testing our model, and the remaining 80% to train it with. 

By withholding our test data, we can make sure we're evaluating its results based on new flowers it hasn't seen before. 

Typically we refer to our features (in this case, the petal sizes) as X, and the labels (in this case, the species) as y.

In [2]:
from sklearn.model_selection import train_test_split

# split up data 80/20
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

Load up XGBoost, and convert our data into the DMatrix format it expects: one for the training data, and one for the test data.

In [3]:
import xgboost as xgb

# training data consists of training features and labels
train = xgb.DMatrix(X_train, label=y_train)

# test data consists of test features and labels
test = xgb.DMatrix(X_test, label=y_test)

Define our hyperparameters: Choosing softmax since this is a multiple classification problem, but the other parameters should ideally be tuned through experimentation.

In [4]:
# define values for hyperparameters
param = {
    'max_depth': 4,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 10 

Train our model using these parameters as a first guess.

In [5]:
# train model with parameters, train DMatrix, num of epochs
model = xgb.train(param, train, epochs)



Use the trained model to predict classifications for the data we set aside for testing. Each classification number we get back corresponds to a specific species of Iris.

In [6]:
# predict category each flower is in
predictions = model.predict(test)

In [7]:
print(predictions)

[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 1. 0. 1. 1. 0. 0. 2. 1. 0. 0.
 2. 0. 0. 1. 1. 0.]


Measure the accuracy on the test data...

In [8]:
from sklearn.metrics import accuracy_score

# known correct values in y_test and predictions
accuracy_score(y_test, predictions)

1.0

It's perfect, and that's just with us guessing as to the best hyperparameters!