# XGBoost

Notes: https://github.com/daviskregers/notes/blob/master/data-science/04-machine-learning-with-python/09-xgboost.md

In [2]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.3.0.post0-py3-none-manylinux2010_x86_64.whl (157.5 MB)
[K     |████████████████████████████████| 157.5 MB 54 kB/s  eta 0:00:01     |███████████████████████████████▍| 154.6 MB 2.4 MB/s eta 0:00:02
Installing collected packages: xgboost
Successfully installed xgboost-1.3.0.post0


We can use sklearn iris data set which includes the width and length of the petals and sepals of many iris flowers and which specific species of iris the flower belongs to.

In [3]:
from sklearn.datasets import load_iris

iris = load_iris()

numSamples, numFeatures = iris.data.shape
print(numSamples)
print(numFeatures)
print(list(iris.target_names))

150
4
['setosa', 'versicolor', 'virginica']


Let's split the data.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

Load up XGBoost and convert our data into a DMatrix.

In [5]:
import xgboost as xgb

train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

Now define hyperparameters. We'll use softmax since it is a multiple classification problem. Other parameters should be tuned trough experimentation.

In [6]:
param = {
    'max_depth': 4,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 10 

In [7]:
model = xgb.train(param, train, epochs)



Now use the trained model on test data set.

In [8]:
predictions = model.predict(test)
print(predictions)

[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 1. 0. 1. 1. 0. 0. 2. 1. 0. 0.
 2. 0. 0. 1. 1. 0.]


Measure accuracy.

In [10]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

1.0