<a href="https://colab.research.google.com/github/VickyKandale/Assignment_pyhton.pwskills/blob/main/Assignment_17th_Apr_(Boosting_2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boosting-2

## Q1. What is Gradient Boosting Regression?

Apologies for the repetition in my previous responses. Let's provide a more concise explanation of Gradient Boosting Regression.

Gradient Boosting Regression is an ensemble machine learning technique used for regression tasks. It is based on the idea of boosting, where multiple weak learners (usually decision trees) are combined to create a strong predictive model.

The algorithm works as follows:

Initialization: The initial prediction is set to the mean (or median) of the target variable. This initial prediction acts as the starting point for the ensemble.

Residual Calculation: The difference between the actual target values and the initial prediction is calculated. These residuals represent the errors of the current model on the training data.

Training Weak Learner: A new weak learner (often a decision tree) is trained to predict the residuals. The weak learner is fitted to minimize the loss function (e.g., mean squared error) between its predictions and the residuals.

Weighted Combination: The predictions of the new weak learner are combined with the current model's predictions. The contribution of the new weak learner is scaled by a learning rate, which controls the step size of each iteration.

Update Residuals: The residuals are updated by subtracting the weighted predictions of the new weak learner from the current residuals.

Iterative Process: Steps 3 to 5 are repeated for a fixed number of iterations (controlled by the number of weak learners) or until a specific stopping criterion is met.

Final Prediction: The final prediction of the ensemble model is the sum of the initial prediction and the weighted predictions of all weak learners.

Gradient Boosting Regression is known for its ability to handle complex nonlinear relationships in data and produce accurate predictions. It is widely used in various regression tasks, such as house price prediction, stock market forecasting, and other numerical value prediction problems. As with any ensemble method, it's essential to tune hyperparameters to optimize performance and prevent overfitting.

## Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a simple regression problem as an example and train the model on a small dataset. Evaluate the model's performance using metrics such as mean squared error and R-squared.

In [None]:
import numpy as np

# Generate random data for regression
np.random.seed(42)
X = np.random.rand(100, 1)
y = 2 * X[:, 0] + np.random.normal(0, 0.1, 100)


In [None]:
class DecisionTreeRegressor:
    def __init__(self, max_depth=1):
        self.max_depth = max_depth
        self.tree = None

    def fit(self, X, y):
        self.tree = self._build_tree(X, y, depth=0)

    def predict(self, X):
        return np.array([self._predict_one(x, self.tree) for x in X])

    def _build_tree(self, X, y, depth):
        if depth >= self.max_depth or len(np.unique(y)) == 1:
            return np.mean(y)

        feature_idx, split_value = self._find_best_split(X, y)
        left_mask = X[:, feature_idx] < split_value
        left_tree = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_tree = self._build_tree(X[~left_mask], y[~left_mask], depth + 1)
        return (feature_idx, split_value, left_tree, right_tree)

    def _find_best_split(self, X, y):
        best_mse = float('inf')
        best_feature_idx = 0
        best_split_value = 0

        for feature_idx in range(X.shape[1]):
            unique_values = np.unique(X[:, feature_idx])
            for split_value in unique_values:
                left_mask = X[:, feature_idx] < split_value
                y_left = y[left_mask]
                y_right = y[~left_mask]
                mse = np.mean((y_left - np.mean(y_left))**2) + np.mean((y_right - np.mean(y_right))**2)
                if mse < best_mse:
                    best_mse = mse
                    best_feature_idx = feature_idx
                    best_split_value = split_value

        return best_feature_idx, best_split_value

    def _predict_one(self, x, tree):
        if not isinstance(tree, tuple):
            return tree

        feature_idx, split_value, left_tree, right_tree = tree
        if x[feature_idx] < split_value:
            return self._predict_one(x, left_tree)
        else:
            return self._predict_one(x, right_tree)


In [None]:
class GradientBoostingRegressor:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=1):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []

    def fit(self, X, y):
        prediction = np.zeros(len(y))
        for _ in range(self.n_estimators):
            residual = y - prediction
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residual)
            self.trees.append(tree)
            prediction += self.learning_rate * tree.predict(X)

    def predict(self, X):
        return np.sum(self.learning_rate * tree.predict(X) for tree in self.trees)

    def mse(self, X, y_true):
        y_pred = self.predict(X)
        return np.mean((y_true - y_pred) ** 2)

    def r_squared(self, X, y_true):
        y_pred = self.predict(X)
        ss_res = np.sum((y_true - y_pred) ** 2)
        ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
        return 1 - ss_res / ss_tot


In [None]:
import warnings
warnings.filterwarnings("ignore")

# Split the data into training and testing sets
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

# Create and train the gradient boosting regressor
gb_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1)
gb_regressor.fit(X_train, y_train)

# Evaluate the model on the test set
mse = gb_regressor.mse(X_test, y_test)
r_squared = gb_regressor.r_squared(X_test, y_test)

print("Mean Squared Error:", mse)
print("R-squared:", r_squared)


Mean Squared Error: 0.025593672828457547
R-squared: 0.9127186373840868


## Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to optimise the performance of the model. Use grid search or random search to find the best hyperparameters

In [2]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import GradientBoostingRegressor
from scipy.stats import randint


In [3]:

# Generate random data for regression
np.random.seed(42)
X = np.random.rand(100, 1)
y = 2 * X[:, 0] + np.random.normal(0, 0.1, 100)



In [4]:
# Split the data into training and testing sets
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]


In [5]:

# Define the parameter grid for random search
param_dist = {
    'n_estimators': randint(50, 200),      # Number of trees
    'learning_rate': np.linspace(0.01, 0.2, 20),  # Learning rate
    'max_depth': randint(1, 5)            # Tree depth
}

# Create the GradientBoostingRegressor model
gb_model = GradientBoostingRegressor()


In [6]:

# Perform random search
random_search = RandomizedSearchCV(
    gb_model,
    param_distributions=param_dist,
    n_iter=50,        # Number of parameter settings that are sampled
    cv=5,             # Cross-validation folds
    scoring='neg_mean_squared_error',  # Evaluation metric (negative MSE)
    random_state=42
)



In [7]:

# Fit the random search to the data
random_search.fit(X_train, y_train)


In [8]:

# Print the best hyperparameters found during the search
print("Best Hyperparameters:")
print(random_search.best_params_)


Best Hyperparameters:
{'learning_rate': 0.11, 'max_depth': 1, 'n_estimators': 85}


In [9]:

# Evaluate the model with the best hyperparameters on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 0.012379467793888033
R-squared: 0.9577826588332036


## Q4. What is a weak learner in Gradient Boosting?

In the context of Gradient Boosting, a weak learner refers to a simple and relatively low-complexity model that performs only slightly better than random guessing on a given classification or regression task. Weak learners are often used as building blocks in ensemble methods like Gradient Boosting to create a strong predictive model.

In the case of Gradient Boosting Regression, the weak learner is typically a shallow decision tree (also known as a decision stump) with limited depth, containing only a small number of nodes and splits. Decision stumps are simple trees that make decisions based on a single feature and a threshold. They are not powerful enough to capture complex relationships in the data on their own, but they are capable of learning basic patterns.

For Gradient Boosting Classification, weak learners are usually weak classifiers that perform only slightly better than random guessing. These weak classifiers could be, for example, decision stumps based on a single feature and threshold, or even simple models like linear classifiers.

The strength of Gradient Boosting lies in its ability to sequentially combine multiple weak learners into a strong ensemble model. Each weak learner is trained on the errors (residuals) made by the ensemble up to that point, which allows them to focus on the challenging instances that the ensemble currently struggles to predict correctly. The weighted combination of these weak learners in the ensemble gradually improves the model's predictive performance, making it a strong learner.

By combining a series of weak learners, Gradient Boosting is able to create complex models that can handle intricate relationships and provide highly accurate predictions, even on challenging datasets. The iterative nature of boosting, with the adaptive weight adjustment, allows the model to iteratively learn from its mistakes and continually improve, resulting in a robust and powerful ensemble model.

## Q5. What is the intuition behind the Gradient Boosting algorithm?

The intuition behind the Gradient Boosting algorithm can be understood through the metaphor of learning from mistakes or teamwork. Here's the intuition in simple terms:

Learning from Mistakes: Imagine you have a group of people trying to solve a complex problem together. Each person has their own area of expertise and can contribute something valuable to the solution. However, no one person has all the answers. They work together in iterations, where each person identifies the mistakes made by the group in the previous iteration and focuses on correcting those mistakes.

Teamwork and Adaptability: Initially, the group makes predictions based on their collective knowledge, which may not be accurate. In the first iteration, they observe the mistakes they made, and each person adapts their strategy to focus on the areas where they can make the most significant improvement. They iteratively repeat this process, with each iteration building upon the previous one.

Weighted Collaboration: In each iteration, the group members give more importance to the areas where their previous predictions were incorrect. The team members specialize in different aspects of the problem, so they collectively cover more ground and become increasingly better at solving the problem as they work together.

Strong Team Emerges: As the iterations progress, the team members become more specialized in their areas of expertise. The group's collective intelligence grows, and a strong team with a diverse skill set emerges. Together, they create a powerful ensemble model that can solve the problem more effectively than any individual could.

In Gradient Boosting, the weak learners (e.g., decision trees) are like the members of the group, and they work collaboratively to solve a complex prediction problem. Each weak learner focuses on correcting the errors made by the ensemble up to that point. The ensemble's predictions and the actual target values are compared, and the errors (residuals) are used to train the next weak learner in the next iteration.

The algorithm's name, "Gradient Boosting," comes from the use of gradients (derivatives) of the loss function to optimize the model's predictions. The ensemble gradually minimizes the loss function by moving in the direction of the steepest descent (gradient), correcting its mistakes along the way. This iterative process continues until a certain stopping criterion is met, or until the ensemble achieves a satisfactory level of performance.

The intuition of teamwork and learning from mistakes makes Gradient Boosting a powerful and effective technique for various machine learning tasks. It can handle complex patterns in the data, provide high accuracy, and be robust to overfitting, making it one of the most popular and widely used ensemble methods in machine learning.






## Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?

The Gradient Boosting algorithm builds an ensemble of weak learners in an iterative and adaptive manner. The ensemble is constructed step-by-step, and each weak learner is trained to correct the mistakes made by the previous learners. The process can be summarized as follows:

Initialization: The ensemble starts with an initial prediction, typically the mean (or median) of the target variable for regression tasks. This initial prediction acts as the starting point for the iterative process.

Residual Calculation: The difference between the actual target values and the current prediction of the ensemble is calculated. These residuals represent the errors of the current model on the training data.

Training Weak Learner: A new weak learner (often a decision tree) is trained to predict the residuals. The weak learner is fitted to minimize the loss function (e.g., mean squared error) between its predictions and the residuals.

Weighted Combination: The predictions of the new weak learner are combined with the current model's predictions. The contribution of the new weak learner is controlled by a learning rate, which scales the impact of the new learner's predictions.

Update Residuals: The residuals are updated by subtracting the weighted predictions of the new weak learner from the current residuals. The updated residuals represent the errors that are not yet explained by the current ensemble.

Iterative Process: Steps 3 to 5 are repeated for a fixed number of iterations (controlled by the number of weak learners) or until a specific stopping criterion is met.

Final Prediction: The final prediction of the ensemble model is the sum of the initial prediction and the weighted predictions of all weak learners.

Each iteration of the Gradient Boosting algorithm focuses on correcting the mistakes made by the ensemble up to that point. The weak learners are trained to specialize in capturing the patterns that the ensemble currently struggles to predict accurately. The adaptive weight adjustment, using the residuals to train subsequent learners, allows the ensemble to gradually improve its predictive performance over iterations.

## Q7. What are the steps involved in constructing the mathematical intuition of Gradient Boosting algorithm?

Constructing the mathematical intuition of the Gradient Boosting algorithm involves understanding the underlying principles and mathematical concepts that drive its iterative learning process. The key steps involved in constructing the mathematical intuition of Gradient Boosting are as follows:

Objective Function: Define the objective function that the algorithm aims to optimize. For regression tasks, the objective function is typically the mean squared error (MSE), while for classification tasks, it can be the cross-entropy loss or other appropriate loss functions.

Gradient Descent: Understand the concept of gradient descent, which is a numerical optimization technique used to minimize the objective function. Gradient descent involves computing the gradient (or derivative) of the objective function with respect to the model's parameters. The gradient points in the direction of the steepest increase in the objective function, and gradient descent seeks to move in the opposite direction to find the local minimum.

Weak Learner Fitting: Define the weak learner used in the ensemble (e.g., decision trees or linear models). Train the weak learner on the training data to minimize the residuals (the negative gradient) between the current predictions of the ensemble and the actual target values. The weak learner is fitted to approximate the negative gradient of the objective function with respect to the current ensemble's predictions.

Learning Rate: Introduce the learning rate, which controls the step size of each iteration. The learning rate scales the contribution of the weak learner's predictions to the ensemble's overall prediction. A smaller learning rate reduces the impact of each weak learner, making the learning process more conservative.

Weighted Combination: Combine the predictions of the weak learner with the current ensemble's predictions, taking into account the learning rate and the importance of the weak learner's predictions in minimizing the objective function. This weighted combination updates the current prediction of the ensemble.

Update Residuals: Update the residuals by subtracting the weighted predictions of the weak learner from the current residuals. The updated residuals represent the errors that are not yet explained by the current ensemble, and they guide the subsequent iterations to focus on difficult-to-predict instances.

Iterative Process: Repeat steps 3 to 6 for a fixed number of iterations or until a specific stopping criterion is met. At each iteration, the ensemble's predictions gradually improve as the weak learners learn to correct the mistakes made in the previous iterations.

Final Prediction: The final prediction of the Gradient Boosting ensemble is the sum of the initial prediction and the weighted predictions of all weak learners.