# Part 2.13: Supervised Learning - Gradient Boosting Classifier

The Gradient Boosting Classifier is another powerful ensemble method that builds models sequentially. Each new model (a weak learner, typically a small decision tree) is trained to correct the errors of the one before it. This sequential, error-correcting process makes it one of the highest-performing out-of-the-box algorithms.

In [1]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate a dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Training a Gradient Boosting Classifier

In [2]:
# n_estimators: The number of boosting stages (trees) to perform.
# learning_rate: Shrinks the contribution of each tree.
# max_depth: Limits the number of nodes in the tree.
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_clf.fit(X_train, y_train)

print(f"Accuracy on test set: {gb_clf.score(X_test, y_test):.4f}")

Accuracy on test set: 0.9000


### Early Stopping
A useful technique to prevent overfitting is **early stopping**. We can monitor the model's performance on a validation set and stop training when the performance stops improving. Scikit-learn's implementation uses `n_iter_no_change` to achieve this.

In [3]:
gb_clf_early_stop = GradientBoostingClassifier(
    n_estimators=500, # Set a high number of estimators
    learning_rate=0.1, 
    max_depth=3, 
    random_state=42, 
    validation_fraction=0.1, # Fraction of training data to set aside as validation
    n_iter_no_change=10, # Stop if validation score doesn't improve for 10 iterations
    tol=0.01
)
gb_clf_early_stop.fit(X_train, y_train)

print(f"Number of estimators used after early stopping: {gb_clf_early_stop.n_estimators_}")
print(f"Accuracy with early stopping: {gb_clf_early_stop.score(X_test, y_test):.4f}")

Number of estimators used after early stopping: 89
Accuracy with early stopping: 0.9150


### Advanced Gradient Boosting Libraries
For top performance, consider these specialized libraries:
- **XGBoost**: The library that popularized gradient boosting for competitions.
- **LightGBM**: A very fast and efficient implementation.
- **CatBoost**: Excellent at handling categorical features automatically.