# An example of creating a gradient boosted classifier for the Iris data

This python notebook highlights how using xgboost, pandas, urllib, and sci-kit learn (sklearn) you can create a classifier for the Iris data.

 * Load the data (in this case from the UCI ML repositiry)
 * It converts the data into a Pandas Dataframe
 * We split the data into X and Y breaking the prediction variables (Petal / Sepal Length / Width) and the Y (Iris species name as string)
 * The LabelEncoder() method also the species string names to be encoded as integers
 * In order to prevent overfitting of the model, we hold back some data (1/3) to test the model with after we train it. We use the train_test_split() to help set up the test data and the training data.
 * The model is then created and training (fit) with the training data set
 * The model is then tested (predict) to see how it performs on the unseen test data
 * The accuracy score is how it performed in classifying the unseen data into one of the three Iris species

In [1]:
import urllib.request
import pandas as pd
import xgboost
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
 
# URL for the Iris dataset (UCI Machine Learning Repository)
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# download the file
raw_data = urllib.request.urlopen(url)

# load the CSV file as a numpy matrix
data = pd.read_csv(raw_data, header=None)
dataset = data.values

# split data into X (petal and sepal l/w's) and y (species name as string)
X = dataset[:, 0:4]
Y = dataset[:, 4]

# encode string class (species names) values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)

# fit model no training data
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)
# print the model and it's various parameters
print(model)

# make the predictions on the test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# determine the accuracy of the classifer
accuracy = accuracy_score(y_test, predictions)

print("Accuracy: {0:.2f}%".format(accuracy * 100.0))


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
Accuracy: 92.00%
