Gradient Boosting is an advanced machine learning technique. It builds multiple weak predictive models (usually decision trees) sequentially. Each new model gradually minimizes the loss function (error) of the whole system.

Three components are involved: an additive model that adds weak learners to minimize the loss function, a loss function that has to be optimized, and a weak learner that needs to generate predictions. Every new tree fixes the mistakes made by the ones before it.
Evaluation Metrics

    For Classification: Accuracy, Precision, Recall, F1 Score.
    For Regression: Mean Squared Error (MSE), R-squared.

Applying with Sci-kit Learn

We’ll use the Diabetes dataset for Gradient Boosting. Our goal will be to predict the progression of diabetes based on various features. We’ll train a gradient-boosting model and evaluate its performance.

Let’s see the steps we’ll follow below:

    Load the Diabetes Dataset

    Age, sex, body mass index, average blood pressure, and six blood serum measures are among the characteristics that are included in the Diabetes dataset. One year after baseline, a quantitative assessment of the disease’s development is the goal variable.

2. Create and Train the Gradient Boosting Model:

    We initialize a Gradient Boosting Regressor. Gradient Boosting permits the optimization of any differentiable loss function and constructs an additive model in a forward, step-by-step manner
    We train (fit) this model on the training data. In this step, the model learns to predict the diabetes progression based on the features.

3. Predict:

    We use the trained Gradient Boosting model to predict the disease progression on the test data. This step involves applying the model to unseen data to assess its predictive capabilities.

4. Evaluate:

    The model’s performance is assessed using two key metrics:
    • Mean Squared Error (MSE): The average of the squares of the mistakes is what this metric calculates. It is a metric for evaluating the quality of an estimator; values nearer zero indicate greater quality.
    • R-squared: Based on the percentage of total result variance that the model explains, this statistic gives an indication of how well the observed outcomes are duplicated by the model.

In [4]:
# Import necessary libraries
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [5]:
# Load Diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

In [7]:
# Splitting the dataset into training & testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42)

In [8]:
# Creating & training Gradient Boosting Model
model = GradientBoostingRegressor(random_state=42)
model.fit(X_train, y_train)

In [9]:
# Predicting the test set results
y_pred = model.predict(X_test)

In [10]:
# Evaluating the model results
mse_gb = mean_squared_error(y_test, y_pred)
r2_gb = r2_score(y_test, y_pred)

In [11]:
# Printing the results

print("MSE:", mse_gb)
print("R2 score:", r2_gb)

MSE: 3637.5315897877335
R2 score: 0.359823784755923


These results indicate that the gradient-boosting model has a moderate level of accuracy in predicting diabetes progression.

The R-squared value of 0.45 suggests that nearly 45% of the variance in the target variable is explained by the model, which is decent for a complex task like this.

The MSE gives us an idea of the average squared difference between the observed actual outcomes and the outcomes predicted by the model.