# Gradient Boosting

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [26]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import GradientBoostingRegressor

In [27]:
df = pd.read_csv('../data/data_feature_engineering.csv')
df.head()

Unnamed: 0,price,name,distance,source,destination,precipIntensity,humidity,temperatureHigh,apparentTemperatureHigh,uvIndex,precipIntensityMax,temperatureMax,apparentTemperatureMax
0,12.0,4,1.11,6,11,0.0,0.6,42.52,40.53,0,0.0003,42.52,40.53
1,16.0,0,1.11,6,11,0.0,0.66,33.83,32.85,0,0.0001,33.83,32.85
2,7.5,3,1.11,6,11,0.0,0.56,33.83,32.85,0,0.0001,33.83,32.85
3,7.5,5,1.11,6,11,0.0567,0.86,43.83,38.38,0,0.1252,43.83,38.38
4,26.0,1,1.11,6,11,0.0,0.64,33.83,32.85,0,0.0001,33.83,32.85


In [28]:
X = df.drop(columns=['price'])
y = df['price']

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

#### What is Gradient Boosting?

Gradient Boosting is a machine learning technique that builds models in a sequential manner. The gradient boosting algorithm combines predictions from multiple decision tress to general the best prediction. Each new model incrementally corrects errors made by previously trained models, using decision trees as the base. This approach makes gradient boosting particularly useful for analyzing complex datasets with nonlinear relationships and interaction effects. We chose gradient boosting as one of our models for its robustness and ability to handle various types of data efficiently, aiming to predict our target variable with high accuracy.

#### Random Forest VS Gradient Boosting

While Random Forest & Gradient Boosting are very similar they differ in key ways. Gradient Boosting builds trees sequentially, with each tree correcting the errors made by the ensemble of previous trees. This iterative process focuses on improving the model's weaknesses. Random Forest builds multiple decision trees independently and combines their predictions through averaging or voting. Each tree is trained on a random subset of the data and features, enhancing diversity and reducing overfitting. Both methods leverage decision trees through ensemble learning, Gradient Boosting focuses on iterative improvement, while Random Forest emphasizes diversity and averaging

#### Gradient Boosting Benchmark

In [33]:
#Model Initialization with GradientBoosting
model_gb = GradientBoostingRegressor(n_estimators=500, learning_rate=.01, max_features=5, max_depth=5, random_state=42)
# n_estimators=500: The number of trees in the forest. The algorithm will iteratively improve its predictions 500 times.
# learning_rate=.01: This is the rate at which the model learns. A smaller learning rate requires more trees to model all the relationships but can lead to a more accurate model.
# max_features=5: The maximum number of features to consider when looking for the best split which can help in making the model faster and reducing overfitting.
# max_depth=5: The maximum depth of each tree. Limited depth helps control overfitting, making the model less complex.
# random_state=42: A seed to  ensure reproducibility of the results.

# Model Training
model_gb.fit(X_train, y_train)
# The model learns to predict the target variable y_train from the features X_train
preds_test = model_gb.predict(X_test)
# The model uses the learned relationships to predict the target variable for new data, X_test.

#Evaluating the Model
mse = mean_squared_error(y_test, preds_test)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, preds_test)
# These metric are being used for all models to compare which model is best.

# Print the performance metrics
print("MSE (Mean Squared Error):", mse)
print("RMSE (Root Mean Squared Error):", rmse)
print("R² (Coefficient of Determination):", r2)


MSE (Mean Squared Error): 3.7984472773156606
RMSE (Root Mean Squared Error): 1.9489605633043632
R² (Coefficient of Determination): 0.9475249975833804


#### Hyperparamter Tuning 

After completing an initial iteration with Gradient Boosting using specific parameters, I've established a benchmark model that sets a baseline for performance metrics such as MSE, RMSE, and R². This initial model serves as a starting point for further improvements. To enhance these performance metrics, I've used XGBoost and engaged in hyperparameter tuning with GridSearchCV. Extreme Gradient Boosting (XGBoost) is an efficient version of Gradient Boosting. It allows for similar results with fewer trees, making large data sets easier to process. This approach  explores various parameter configurations to identify the most effective model based on the negative mean squared error metric. Through this process of performance optimization, I aim to discover a set of hyperparameters that not only boosts the model's performance beyond the initial benchmark but also ensures its robustness and generalizability to new, unseen data.

In [50]:

# Initialize the XGBoost regressor with seed 42 to reproduce
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', random_state = 42)

# Define the parameter grid to search
param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'learning_rate': [0.01, 0.1],
    'max_depth': [5, 10],
    'alpha': [5, 10],
    'n_estimators': [100, 200, 500, 800]
}

# A dictionary named param_grid is created, specifying the parameters to be tuned and the range of values for each. This includes:
    # - colsample_bytree: Fraction of features used per tree.
    # - learning_rate: Step size shrinkage used to prevent overfitting.
    # - max_depth: Maximum depth of the trees.
    # - alpha: L1 regularization term on weights.
    # - n_estimators: Number of trees in the ensemble.

# Setup GridSearchCV
grid_search_cv3 = GridSearchCV(estimator=xg_reg, param_grid=param_grid, cv=3, n_jobs=-1, scoring='neg_mean_squared_error', verbose=1, return_train_score=True)
# GridSearchCV is used to explore different combinations of hyperparameters specified in param_grid to find the best performing model.
    # - estimator: The  model you want to optimize.
    # - param_grid: The hyperparameters to be tested
    # - cv: Specifies the number of folds in a (Stratified) K-Fold cross-validation (3:2 test, 1 train - repeat 3 times)
    # - n_jobs: This parameter tells the grid search to run in parallel
    # - scoring: he metric used to evaluate the performance of the model for a given set of hyperparameters. Mean Squared Error (MSE), measures the average squared difference between the estimated values and the actual value. GridSearchCV aims to maximize the scoring metric; hence, MSE is negated since lower MSE values are better, and by negating it, the optimization problem becomes consistent.
    # - verbose=1: This controls the verbosity (How much information is printed)

# Fit the grid search to the data
grid_search_cv3.fit(X_train, y_train)

# Best parameters and best score with a cross-validation split of 3
print("Best parameters found: ", grid_search_cv3.best_params_)
print("Best score found: ", np.sqrt(-grid_search_cv3.best_score_))

# Best estimator (model) with a cross-validation split of 3
best_model3 = grid_search_cv3.best_estimator_


Fitting 3 folds for each of 64 candidates, totalling 192 fits
Best parameters found:  {'alpha': 10, 'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 200}
Best score found:  1.859852786509415
Best model:  XGBRegressor(alpha=10, base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.7, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.1, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=10, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=200, n_jobs=None,
             num_parallel_tree=None, ...)
Best model:  {'alpha': 10, 'cols

Conducting cross-validation (CV) with different fold numbers, within the XGBoost framework allows for a nuanced understanding of model performance across varying levels of data segmentation. This strategy helps in identifying the optimal balance between model training time and prediction accuracy. By comparing results across these different CV settings, one can better understand the trade-offs involved and select a CV strategy that aligns best with the project objectives

CV is used ito assess how well a model will generalize to a data set. It involves splitting the data into training sets and a test set, then training the model on the training set and evaluating it on the test set. This process is repeated multiple times with each split taking a turn as the test set to produce a more accurate and less biased estimate of the model's performance. Cross-validation helps in identifying the model that performs best on unseen data, thereby reducing the likelihood of overfitting.

In [51]:

# Setup GridSearchCV with 5
grid_search_cv5 = GridSearchCV(estimator=xg_reg, param_grid=param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error', verbose=1, return_train_score=True)

# Fit the grid search to the data
grid_search_cv5.fit(X_train, y_train)

# Best parameters and best score with a cross-validation split of 5
print("Best parameters found: ", grid_search_cv5.best_params_)
print("Best score found: ", np.sqrt(-grid_search_cv5.best_score_))

# Best estimator (model) with a cross-validation split of 5
best_model5 = grid_search_cv5.best_estimator_

Fitting 5 folds for each of 64 candidates, totalling 320 fits
Best parameters found:  {'alpha': 10, 'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 200}
Best score found:  1.852872653020119


In assessing the performance of XGBoost models using 3-fold and 5-fold cross-validation, it was observed that the model validated with 5 folds produced a marginally lower Root Mean Squared Error (RMSE) compared to the 3-fold cross-validation. This improvement in RMSE, although slight, indicates that a higher number of folds can offer a more refined estimate of the model's ability to generalize to unseen data, enhancing its predictive accuracy. Notably, both cross-validation strategies converged on the same optimial hyperparameters, emphasizing the robustness of the model's configuration. However, it's important to highlight that the increase in folds to CV=5 resulted in a substantially longer processing time. This trade-off between improved accuracy and increased computational demand emphasizes the need for a balanced approach, especially when considering the constraints of the project.

#### Best Model 

In [52]:
# Use the best estimator to make predictions
predictions = best_model3.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, predictions)

# Print the performance metrics
print("MSE (Mean Squared Error):", mse)
print("RMSE (Root Mean Squared Error):", rmse)
print("R² (Coefficient of Determination):", r2)

MSE (Mean Squared Error): 3.3717752399377443
RMSE (Root Mean Squared Error): 1.8362394288157915
R² (Coefficient of Determination): 0.9534194103678413
