# Gradient boosting | Gradient Boosting Machine | GBM

**Gradient boosting:** es un algoritmo de conjunto que se ajusta a los árboles de decisión potenciados al minimizar un gradiente de error. Hay muchas implementaciones del algoritmo de aumento de gradiente disponibles en Python.
Probaremos un clasificador con todos ellos y compararemos la velocidad y la precisión de cada uno.

In [1]:
# importamos librerías necesarias

# Basicas
import numpy as np
import pandas as pd
from time import time
from IPython.core.debugger import set_trace
from sklearn.datasets import make_classification

# Scikit-Learn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

In [6]:
# Genero un dataset de ejemplo
X, y = make_classification(
    n_samples=100000, 
    n_features=20, 
    n_informative=15, 
    n_redundant=5, 
    random_state=0)

print(X.shape)
print(y.shape)

(100000, 20)
(100000,)


In [4]:
# Genero diccionarios vación para guardar posteriores datos
accuracy = {}
speed = {}

### 1) Implementación de Gradient Boosting con Scikit-Learn

In [7]:
model = GradientBoostingClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

speed["GradientBoosting"] = np.round(time() - start, 3)
accuracy["GradientBoosting"] = np.mean(score).round(3)

print(
    f"Mean Accuracy: {accuracy['GradientBoosting']}\nStd: {np.std(score):.3f}\nRun time: {speed['GradientBoosting']}s"
)

Mean Accuracy: 0.894
Std: 0.003
Run time: 82.519s


<IPython.core.display.Javascript object>

##### Alternative

In [8]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

<IPython.core.display.Javascript object>

In [9]:
model = HistGradientBoostingClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

speed["HistGradientBoosting"] = np.round(time() - start, 3)
accuracy["HistGradientBoosting"] = np.mean(score).round(3)

print(
    f"Mean Accuracy: {accuracy['HistGradientBoosting']}\nStd: {np.std(score):.3f}\nRun time: {speed['HistGradientBoosting']}s"
)

Mean Accuracy: 0.964
Std: 0.001
Run time: 4.498s


<IPython.core.display.Javascript object>

#### XGBoost

In [10]:
#!pipenv install xgboost --skip-lock
from xgboost import XGBClassifier

<IPython.core.display.Javascript object>

In [12]:
model = XGBClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

speed["XGB"] = np.round(time() - start, 3)
accuracy["XGB"] = np.mean(score).round(3)

print(
    f"Mean Accuracy: {accuracy['XGB']}\nStd: {np.std(score):.3f}\nRun time: {speed['XGB']}s"
)

Mean Accuracy: 0.976
Std: 0.001
Run time: 39.936s


<IPython.core.display.Javascript object>

#### LightGBM

In [13]:
#!pipenv install lightgbm --skip-lock
from lightgbm import LGBMClassifier

<IPython.core.display.Javascript object>

In [14]:
model = LGBMClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

speed["LGBM"] = np.round(time() - start, 3)
accuracy["LGBM"] = np.mean(score).round(3)

print(
    f"Mean Accuracy: {accuracy['LGBM']}\nStd: {np.std(score):.3f}\nRun time: {speed['LGBM']}s"
)

Mean Accuracy: 0.963
Std: 0.001
Run time: 3.276s


<IPython.core.display.Javascript object>

#### Catboost

In [15]:
#!pipenv install catboost --skip-lock
from catboost import CatBoostClassifier

<IPython.core.display.Javascript object>

In [16]:
model = CatBoostClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

speed["CatBoost"] = np.round(time() - start, 3)
accuracy["CatBoost"] = np.mean(score).round(3)

print(
    f"Mean Accuracy: {accuracy['CatBoost']}\nStd: {np.std(score):.3f}\nRun time: {speed['CatBoost']}s"
)

Mean Accuracy: 0.983
Std: 0.002
Run time: 232.924s


<IPython.core.display.Javascript object>

In [17]:
print("Accuracy:")
{k: v for k, v in sorted(accuracy.items(), key=lambda i: i[1], reverse=True)}

Accuracy:


{'CatBoost': 0.983,
 'XGB': 0.976,
 'HistGradientBoosting': 0.964,
 'LGBM': 0.963,
 'GradientBoosting': 0.894}

<IPython.core.display.Javascript object>

In [18]:
print("Speed:")
{k: v for k, v in sorted(speed.items(), key=lambda i: i[1], reverse=False)}

Speed:


{'LGBM': 3.276,
 'HistGradientBoosting': 4.498,
 'XGB': 39.936,
 'GradientBoosting': 82.519,
 'CatBoost': 232.924}

<IPython.core.display.Javascript object>