Using XGBoost is easy. Maybe too easy, considering it's generally considered the best ML algorithm around right now.

To install it, just:

pip install xgboost

Let's experiment using the Iris data set. This data set includes the width and length of the petals and sepals of many Iris flowers, and the specific species of Iris the flower belongs to. Our challenge is to predict the species of a flower sample just based on the sizes of its petals. We'll revisit this data set later when we talk about principal component analysis too.

In [1]:
from sklearn.datasets import load_iris

iris = load_iris()

numSamples, numFeatures = iris.data.shape
print(numSamples)
print(numFeatures)
print(list(iris.target_names))



150
4
['setosa', 'versicolor', 'virginica']


Let's divide our data into 20% reserved for testing our model, and the remaining 80% to train it with. By withholding our test data, we can make sure we're evaluating its results based on new flowers it hasn't seen before. Typically we refer to our features (in this case, the petal sizes) as X, and the labels (in this case, the species) as y.

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

Now we'll load up XGBoost, and convert our data into the DMatrix format it expects. One for the training data, and one for the test data.

In [6]:
import xgboost as xgb

train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

Now we'll define our hyperparameters. We're choosing softmax since this is a multiple classification problem, but the other parameters should ideally be tuned through experimentation.

In [7]:
param = {
    'max_depth': 4,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 10 

Let's go ahead and train our model using these parameters as a first guess.

In [8]:
model = xgb.train(param, train, epochs)

Now we'll use the trained model to predict classifications for the data we set aside for testing. Each classification number we get back corresponds to a specific species of Iris.

In [9]:
predictions = model.predict(test)

In [10]:
print(predictions)

[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 1. 0. 1. 1. 0. 0. 2. 1. 0. 0.
 2. 0. 0. 1. 1. 0.]


Let's measure the accuracy on the test data...

In [11]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, predictions)

1.0

Holy crow! It's perfect, and that's just with us guessing as to the best hyperparameters!

Normally I'd have you experiment to find better hyperparameters as an activity, but you can't improve on those results. Instead, see what it takes to make the results worse! How few epochs (iterations) can I get away with? How low can I set the max_depth? Basically try to optimize the simplicity and performance of the model, now that you already have perfect accuracy.

In [24]:
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

In [65]:
# param = {
#     'max_depth': 2,
#     'eta': 0.3,
#     'objective': 'multi:softmax',
#     'num_class': 3} 
# epochs = 2

param = {
    'max_depth': 3,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 1

In [66]:
model = xgb.train(param, train, epochs)

In [67]:
predictions = model.predict(test)

In [68]:
accuracy_score(y_test, predictions)

1.0

### Concepts applied in this notebook:

### XGBoost Model

XGBoost stands for eXtreme Gradient Boosting and is an implementation of gradient boosted decision trees designed for speed and performance. It is a highly efficient and scalable version of gradient boosting, and it has gained popularity in machine learning competitions for its performance in classification, regression, and ranking problems. XGBoost works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, unlike other gradient boosting methods, XGBoost improves upon the algorithm by introducing a regularization term in the loss function to control over-fitting, which enhances its performance.

### Parameters of XGBoost

XGBoost has a wide array of parameters that can be divided into several categories:

- **General Parameters:** These define the overall functionality of XGBoost.
  - `booster`: Type of model to run at each iteration. It can be `gbtree` (tree-based models), `gblinear` (linear models), or `dart` (Dropouts meet Multiple Additive Regression Trees).
  - `nthread`: Number of parallel threads used to run XGBoost.
  - `verbosity`: Level of verbosity of printing messages.

- **Booster Parameters:** These parameters depend on which booster you have chosen.
  - `eta`: Learning rate, makes the model more robust by shrinking the weights on each step.
  - `min_child_weight`: Minimum sum of instance weight (hessian) needed in a child.
  - `max_depth`: Maximum depth of a tree, increasing this value will make the model more complex and more likely to overfit.
  - `subsample`: Fraction of samples to be used for fitting the individual base learners.
  - `colsample_bytree`: Fraction of features to be used for each tree. This parameter can be used as a dimensionality reduction technique.

- **Learning Task Parameters:** These parameters specify the learning task and the corresponding learning objective.
  - `objective`: Specifies the learning task and the corresponding learning objective like `reg:squarederror` for regression tasks, `binary:logistic` for binary classification.
  - `eval_metric`: Evaluation metrics for validation data, allowing users to monitor performance during training.

### Features of XGBoost

- **Regularized Boosting:** XGBoost introduces regularization terms in the cost function to control over-fitting, which improves model performance and robustness.

- **Parallel Processing:** XGBoost utilizes efficient parallel processing, significantly speeding up the learning process. It does this by parallelizing the construction of trees across all available CPU cores during the training phase.

- **Incremental Learning:** XGBoost supports incremental learning or the ability to train more using an already trained model without starting from scratch, making it efficient for applications where data arrives in chunks over time.

- **Plug In Your Own Optimization Objectives and Evaluation Criteria:** XGBoost allows users to define custom optimization objectives and evaluation criteria, adding a layer of flexibility that can be tailored to very specific and complex industrial problems.

- **Tree Pruning:** Unlike other gradient boosting algorithms that grow trees greedily, XGBoost uses a more sophisticated approach called "depth-first" approach and prunes trees using the end-to-end tree splitting/regrowth instead of the level-wise growth strategy. This results in more optimal and efficient tree structures.

XGBoost has emerged as a highly effective and versatile machine learning algorithm capable of tackling a wide range of data science challenges. Its performance, scalability, and wide array of features make it a popular choice among data scientists and machine learning practitioners.
