<a href="https://colab.research.google.com/github/axel-sirota/decision-trees-and-random-forests/blob/main/3_Gradient_Boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gradient Boosting

Today I’m going to walk you through training a simple classification model. Although scikit-learn and other packages contain simpler models with few parameters (SVM comes to mind), gradient boosted trees are shown to be very powerful classifiers in a wide variety of datasets and problems.

XGBoost is one of the most popular libraries used to pursue classification and regression using machine learning, but without resorting to deep learning techniques, such as neural networks trained in Keras or PyTorch, for example. If you started your journey into data science by comparing different types of regressions or Naive Bayes classifiers, XGBoost is a wonderful tool to produce more accurate, robust models than even the well-performing RandomForestClassifier found in scikit-learn.


## Starting with data:


In [None]:
!pip install xgboost

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import time

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, average_precision_score, f1_score
from sklearn.preprocessing import LabelEncoder

import xgboost as xgb

import warnings  # `do not disturbe` mode
warnings.filterwarnings("ignore")

If you’re using virtualenv or another Python environment management system, feel free to do a pip install instead, or simply insert your own preferred method of installation.

## Setting a baseline:

We’re importing a RandomForestClassifier as a baseline to compare against, and using datasets and metrics functions from scikit-learn. Let’s go ahead and import the sample dataset:

In [None]:
bc_dataset = load_breast_cancer()
bc_dataset

In [None]:
X = pd.DataFrame(bc_dataset.data)
X.columns = bc_dataset.feature_names
y = bc_dataset.target


In [None]:
X.head()

In [None]:
y

In [None]:
pd.Series(y).value_counts()

In [None]:

# Split data into three parts: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

With a result of around 0.6, you can see that this dataset isn’t terribly imbalanced. How convenient! Still, we can pass this value into our model to achieve greater accuracy, which we’ll do below.

### An aside, looking at a Random Forest Classifier to compare to a simpler model:

Just so that we reassure ourselves that a simpler and less computationally intensive model won't serve us even better, let's try out a Random Forest Classifier from scikit-learn:

In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=42)
clf.fit(X_train, y_train)
accuracy_score(y_test, clf.predict(X_test))

In [None]:
weights = (y == 0).sum() / (1.0 * (y == 1).sum())
weights

In [None]:
model = xgb.XGBClassifier(
                          scale_pos_weight = weights,
                          n_jobs = 4,
                          objective='binary:logistic',
                          use_label_encoder = False
                        )

start = time.time()
model.fit(X_train, y_train, eval_metric='logloss')
fittingTime = time.time() - start

start = time.time()
prediction = model.predict(X_test)
InferenceTime = time.time() - start

F1score = f1_score(y_test, prediction)
probabilities = model.predict_proba(X_test)
AUPRC = average_precision_score(y_test, probabilities[:, 1])
acc = accuracy_score(y_test, model.predict(X_test))

print('AUPRC = {}'.format(average_precision_score(y_test, probabilities[:, 1])))
print('F1 Score = {}'.format(F1score))
print('Fitting Time = {}'.format(fittingTime))
print('Inference Time = {}'.format(InferenceTime))
print('Accuracy = {}'.format(acc))

As you can see, we now have multiple metrics that we can track: AUPRC, F1, fitting time, and inference time. Depending on your use case, these values may make or break your model in production.



### Background on XGBoost’s parameters:

* `min_child_weight`, used to control over-fitting, this parameter is the sample size under which the
model can not split a node. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
* `max_depth`, this is the maximum depth of a tree. This parameter controls over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
* `gamma`, this parameter specifies the minimum loss reduction required to make a split. The larger gamma is, the more conservative the algorithm will be.
subsample, defines the ratio of the training instances. For example setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees, preventing overfitting
* `colsample_bytree`, defines the fraction of features to be randomly sampled for each tree.
alpha and lambda are the L1 and L2 regularization terms on weights. Both values tend to prevent overfitting as they are increased. Additionally, Alpha can be used in case of very high dimensionality to help a model training converge faster
* `learning_rate` controls the weighting of new trees added to the model. Lowering this value will prevent overfitting, but require the model to add a larger number of tree


# Now you do it
<img src="https://www.dropbox.com/scl/fi/s9kv1dytq4qzr8g19y3r0/hands_on.jpg?rlkey=yz8kq22sfdgc7lsgmm1e0fksr&raw=1" width="100" height="100" align="right"/>


You will predict the position of NBA players based on their statistics

In [None]:
%%writefile get_data.sh
mkdir -p ./data
if [ ! -f data/NBA_players_2015.csv ]; then
  wget -O data/NBA_players_2015.csv https://www.dropbox.com/scl/fi/0jgo8u5lbphvwwl2btq1w/NBA_players_2015.csv?rlkey=q86m5lp3ycndh5jbegvjewwzu&dl=0
fi

In [None]:
!bash get_data.sh

### Use these parameters for testing

> random_state = 99

> test_size = 0.2

In [None]:
nba = pd.read_csv('./data/NBA_players_2015.csv')
nba.head()