## Q1. What is Gradient Boosting Regression?

## Gradient Boosting Regression is a machine learning technique that builds an ensemble of decision trees sequentially to predict continuous numerical values. Here’s a concise explanation:

- **Ensemble Learning**: Gradient Boosting Regression combines multiple weak learners (typically shallow decision trees) sequentially to create a strong learner.

- **Gradient Descent**: Unlike AdaBoost which focuses on adjusting instance weights, Gradient Boosting Regression minimizes a loss function (usually squared error loss for regression tasks) by gradient descent.

- **Sequential Training**: It sequentially builds trees, where each subsequent tree fits the residuals (errors) of the previous tree.

- **Model Complexity**: It can fit complex nonlinear relationships in data due to its ability to capture interactions and dependencies between features.

- **Regularization**: Various regularization techniques are employed to prevent overfitting, such as controlling tree depth, learning rate, and early stopping.

Gradient Boosting Regression algorithms such as Gradient Boosting Machines (GBM), XGBoost, LightGBM, and CatBoost are widely used due to their effectiveness in predicting continuous outcomes with high accuracy and robustness to noisy data.

## Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a simple regression problem as an example and train the model on a small dataset. Evaluate the model's performance using metrics such as mean squared error and R-squared.

In [1]:
import numpy as np

# Generate synthetic dataset
np.random.seed(0)
X = np.random.rand(100, 1) * 10
y = 2 * (X - 5) ** 2 + np.random.randn(100, 1)

# Split data into training and testing sets
split_ratio = 0.8
split_index = int(split_ratio * len(X))

X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]


In [2]:
class GradientBoostingRegressor:
    def __init__(self, n_estimators=100, learning_rate=0.1):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.models = []
        self.residuals = []
    
    def fit(self, X, y):
        # Initialize predictions with mean of y
        initial_prediction = np.mean(y)
        self.models.append(initial_prediction)
        
        # Fit each estimator sequentially
        for i in range(self.n_estimators):
            # Compute residuals
            residuals = y - self.predict(X)
            self.residuals.append(residuals)
            
            # Fit decision stump (weak learner) to residuals
            stump = DecisionStump()
            stump.fit(X, residuals)
            
            # Update predictions by adding scaled stump prediction
            self.models.append(stump)
    
    def predict(self, X):
        predictions = np.ones(len(X)) * self.models[0]
        for model in self.models[1:]:
            predictions += self.learning_rate * model.predict(X)
        return predictions

class DecisionStump:
    def __init__(self):
        self.split_feature = None
        self.split_threshold = None
        self.prediction = None
    
    def fit(self, X, y):
        # Find best split
        min_error = float('inf')
        for feature in range(X.shape[1]):
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                left_indices = X[:, feature] < threshold
                right_indices = ~left_indices
                error = np.sum((y[left_indices] - np.mean(y[left_indices])) ** 2) \
                      + np.sum((y[right_indices] - np.mean(y[right_indices])) ** 2)
                if error < min_error:
                    min_error = error
                    self.split_feature = feature
                    self.split_threshold = threshold
                    self.prediction = np.mean(y[left_indices]), np.mean(y[right_indices])
    
    def predict(self, X):
        return np.where(X[:, self.split_feature] < self.split_threshold, self.prediction[0], self.prediction[1])


In [3]:
# Initialize and train gradient boosting regressor
gb_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1)
gb_regressor.fit(X_train, y_train.flatten())

# Make predictions on test set
y_pred = gb_regressor.predict(X_test)

# Evaluate performance using metrics
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Mean Squared Error: 6.803210586991037
R-squared: 0.9726079781335314


## Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth tooptimise the performance of the model. Use grid search or random search to find the best hyperparameters

In [4]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor

# Generate synthetic dataset
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=0)

# Define the Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor()

# Define the grid of parameters to search
param_grid = {
    'learning_rate': [0.05, 0.1, 0.2],
    'n_estimators': [50, 100, 200],
    'max_depth': [1, 2, 3, 4]
}

# Define the scoring method (mean squared error)
scorer = make_scorer(mean_squared_error, greater_is_better=False)

# Perform grid search
grid_search = GridSearchCV(estimator=gb_regressor, param_grid=param_grid, scoring=scorer, cv=5, verbose=1)
grid_search.fit(X, y)

# Print the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best MSE:", grid_search.best_score_)

# Evaluate the best model on test data (if available)
# X_test, y_test = ...
# y_pred = grid_search.best_estimator_.predict(X_test)
# print("Test MSE:", mean_squared_error(y_test, y_pred))


Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Parameters: {'learning_rate': 0.2, 'max_depth': 4, 'n_estimators': 200}
Best MSE: -15.145719917331515


In [5]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define the random search grid of parameters to sample from
param_dist = {
    'learning_rate': uniform(0.05, 0.3),  # uniform distribution between 0.05 and 0.35
    'n_estimators': randint(50, 300),     # discrete uniform distribution between 50 and 300
    'max_depth': randint(1, 5)            # discrete uniform distribution between 1 and 5
}

# Perform random search
random_search = RandomizedSearchCV(estimator=gb_regressor, param_distributions=param_dist, n_iter=100, scoring=scorer, cv=5, verbose=1)
random_search.fit(X, y)

# Print the best parameters and best score
print("Best Parameters:", random_search.best_params_)
print("Best MSE:", random_search.best_score_)

# Evaluate the best model on test data (if available)
# X_test, y_test = ...
# y_pred = random_search.best_estimator_.predict(X_test)
# print("Test MSE:", mean_squared_error(y_test, y_pred))


Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best Parameters: {'learning_rate': 0.08607100646678517, 'max_depth': 3, 'n_estimators': 246}
Best MSE: -15.143677209201774


## Q4. What is a weak learner in Gradient Boosting?

## In the context of Gradient Boosting, a weak learner refers to a simple model that performs slightly better than random guessing on a classification or regression problem. Specifically:

1. **Definition**: A weak learner is typically a model that has limited predictive power on its own, often with low complexity. For classification tasks, weak learners might be decision stumps (single-level decision trees), shallow decision trees, or linear models. For regression tasks, they could be decision stumps or simple linear regression models.

2. **Role in Gradient Boosting**: In Gradient Boosting, weak learners are sequentially added to the ensemble. Each weak learner is trained on the residuals (errors) of the ensemble up to the current stage. By focusing on the mistakes made by the previous models, weak learners incrementally improve the overall prediction accuracy of the ensemble.

3. **Characteristics**:
   - **Low Complexity**: Weak learners are intentionally kept simple to facilitate the boosting process and prevent overfitting.
   - **Slightly Better than Random**: While weak learners are not highly accurate individually, they perform better than random guessing, ensuring that each subsequent model contributes to reducing the overall prediction error.

4. **Example**: In practice, weak learners can vary based on the specific implementation of Gradient Boosting. They might include decision trees with a maximum depth of one or two, linear models, or other models that are computationally inexpensive and easy to train.

5. **Boosting Mechanism**: The concept of boosting revolves around combining multiple weak learners into a strong learner. Through iterative training, each weak learner is trained to correct the errors of the previous ones, gradually improving the ensemble's predictive performance.

In summary, in Gradient Boosting, a weak learner is a basic model that is incorporated into the ensemble to collectively contribute to the final prediction. Despite their simplicity, when combined effectively through boosting, these weak learners can produce highly accurate predictions for both regression and classification tasks.

## Q5. What is the intuition behind the Gradient Boosting algorithm?

## Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?