### Ensemble Methods using Scikit Learn

Ensemble methods are a type of machine learning technique that combine the predictions of multiple models to improve the overall performance of the model. The idea behind ensemble methods is that by combining multiple models, the ensemble will be more robust and less susceptible to overfitting than any individual model.

There are several types of ensemble methods, let's discuss:

### Bagging (Bootstrap Aggregating)


Bagging, also known as Bootstrap Aggregating, is an ensemble method that combines the predictions of multiple models trained on different subsets of the data. The idea behind bagging is that by training multiple models on different subsets of the data, the ensemble will be more robust and less susceptible to overfitting than any individual model.

The basic procedure for bagging is as follows:

- A dataset is randomly sampled with replacement to create multiple new datasets, called bootstrap samples.
- A model is trained on each bootstrap sample.
- The predictions of the individual models are combined to make a final prediction.

In classification problems, the final prediction is typically made by majority voting. In regression problems, the final prediction is typically made by averaging the predictions of the individual models.

Bagging can be applied to any type of model, and it is particularly useful for decision tree based models which are known to have high variance. Bagging can also be extended to random subspace method which is used for high-dimensional datasets.

Bagging is also a simple but powerful ensemble method that can be used to improve the performance of a model and reduce the variance of the predictions. However, it's important to keep in mind that bagging can also be computationally expensive and requires a lot of data.

Bagging can be implemented in scikit-learn using the BaggingClassifier or BaggingRegressor classes, depending on whether the task is classification or regression.


Here is an example of how to use the BaggingClassifier in scikit-learn:


In [2]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree classifier
base_estimator = DecisionTreeClassifier()

# Create a bagging classifier ( explanation=n_estimators parameter sets the number of base estimators in the ensemble, max_samples sets the proportion of samples to be used for fitting the base estimator, and max_features sets the proportion of features to be used for fitting the base estimator. )
bagging = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, max_samples=0.8, max_features=0.8)

# Fit the bagging classifier on the training data
bagging.fit(X_train, y_train)

# Make predictions on the test data
y_pred = bagging.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

1.0


Here is an example of how to use the BaggingRegressor in scikit-learn:

In [3]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load the boston dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree regressor
base_estimator = DecisionTreeRegressor()

# Create a bagging regressor ( explanation=n_estimators parameter sets the number of base estimators in the ensemble, max_samples sets the proportion of samples to be used for fitting the base estimator, and max_features sets the proportion of features to be used for fitting the base estimator. )
bagging = BaggingRegressor(base_estimator=base_estimator, n_estimators=10, max_samples=0.8, max_features=0.8)

# Fit the bagging regressor on the training data
bagging.fit(X_train, y_train)

# Make predictions on the test data
y_pred = bagging.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_pred))

10.271417647058824



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

### Boosting

Boosting is an ensemble method that improves the performance of a weak model by iteratively fitting the model on different subsets of the data, and then adjusting the weights of the observations to focus on the observations that are misclassified. The idea behind boosting is to combine the predictions of multiple weak models to create a strong model that is more accurate than any individual weak model.

Boosting algorithms can be broadly classified into two categories:

__Adaptive Boosting (AdaBoost):__ AdaBoost is the first and most popular boosting algorithm that was proposed by Freund and Schapire in 1996. AdaBoost works by iteratively fitting a weak model, such as a decision tree with a small number of splits, on the data and adjusting the weights of the observations to focus on the observations that are misclassified by the previous weak model.

__Gradient Boosting:__ Gradient Boosting is a more recent boosting algorithm that was proposed by Friedman in 1999. Instead of adjusting the weights of the observations, gradient boosting adjusts the predictions of the weak model by fitting a gradient descent algorithm to minimize the loss function. The most popular gradient boosting algorithm is XGBoost and LightGBM.

Boosting algorithms are considered powerful algorithms that can be used to improve the performance of a model and reduce the bias of the predictions. However, it's important to keep in mind that boosting algorithms can also be computationally expensive and require a lot of data. Additionally, boosting algorithms can also be sensitive to noise and overfitting if the weak model is too complex or if the number of iterations is too large.

here is an example of how to use the AdaBoostClassifier in scikit-learn:

In [4]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree classifier
base_estimator = DecisionTreeClassifier()

# Create an AdaBoost classifier(The n_estimators parameter sets the number of base estimators in the ensemble, learning_rate controls the contribution of each classifier.)
ada_boost = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=50, learning_rate=1)

# Fit the AdaBoost classifier on the training data
ada_boost.fit(X_train, y_train)

# Make predictions on the test data
y_pred = ada_boost.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

1.0


here is an example of how to use the AdaBoostRegressor in scikit-learn:

In [5]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load the boston dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree regressor
base_estimator = DecisionTreeRegressor()

# Create an AdaBoost regressor
ada_boost = AdaBoostRegressor(base_estimator=base_estimator, n_estimators=50, learning_rate=1)

# Fit the AdaBoost regressor on the training data
ada_boost.fit(X_train, y_train)

# Make predictions on the test data
y_pred = ada_boost.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_pred))


11.084999999999996



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

### Gradient Boosting

Gradient Boosting is a specific implementation of the boosting algorithm that uses gradient descent to minimize the loss function of the weak model. The most popular gradient boosting algorithms are XGBoost and LightGBM.

Gradient Boosting works by iteratively fitting a weak model, such as a decision tree, on the residuals of the previous model. The residuals are the differences between the true values and the predictions of the previous model. The new model is then added to the ensemble and the process is repeated until a certain number of models is reached or the improvement in the loss function is below a certain threshold.

The main advantage of gradient boosting is its ability to handle large datasets and high-dimensional feature spaces. Additionally, gradient boosting can also be used for both regression and classification problems, and it's relatively insensitive to the choice of the weak model.

The main disadvantage of gradient boosting is its sensitivity to overfitting if the number of iterations is too high or the weak model is too complex. Additionally, gradient boosting can also be computationally expensive and require a lot of memory.

You can use GradientBoostingRegressor or GradientBoostingClassifier from sklearn.ensemble to implement gradient boosting in your project.


There are several types of gradient boosting algorithms, but the most popular ones are:

__XGBoost (eXtreme Gradient Boosting):__ XGBoost is an optimized version of gradient boosting that is designed to be more efficient and faster than other gradient boosting libraries. It was developed by Tianqi Chen and was first released in 2014. XGBoost is known for its good performance on large datasets and high-dimensional feature spaces.

__LightGBM:__ LightGBM is another gradient boosting library that is designed to be more efficient and faster than other gradient boosting libraries. It was developed by Microsoft and was first released in 2017. LightGBM is known for its good performance on large datasets and high-dimensional feature spaces and for its ability to handle categorical features without one-hot encoding.

here is an example of how to use the GradientBoostingClassifier in scikit-learn:

In [7]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Gradient Boosting classifier
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=1)

# Fit the Gradient Boosting classifier on the training data
gb.fit(X_train, y_train)

# Make predictions on the test data
y_pred = gb.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))


1.0


 here is an example of how to use the GradientBoostingRegressor in scikit-learn:

In [6]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load the boston dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Gradient Boosting regressor
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1)

# Fit the Gradient Boosting regressor on the training data
gb.fit(X_train, y_train)

# Make predictions on the test data
y_pred = gb.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_pred))

14.52681979717445



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

### XGBoost

XGBoost (eXtreme Gradient Boosting) is an open-source library for gradient boosting that is designed to be more efficient and faster than other gradient boosting libraries. XGBoost was developed by Tianqi Chen and was first released in 2014. It is widely used in machine learning competitions and has been a go-to tool for many data scientists.

XGBoost has several advantages over other gradient boosting libraries:

- It is designed to handle large datasets and high-dimensional feature spaces.
- It uses a technique called "out-of-core" learning, which allows it to handle data that is too large to fit in memory.
- It supports parallel processing, which makes it faster than other gradient boosting libraries.
- It has a built-in regularization mechanism that helps prevent overfitting.
- It includes several features such as early stopping, cross-validation, and automatic handling of missing values.

Here is an example of how to use the XGBClassifier in scikit-learn:

In [13]:
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a XGBoost classifier
xgb_clf = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=1)

# Fit the XGBoost classifier on the training data
xgb_clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = xgb_clf.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))


1.0


Here is an example of how to use the XGBRegressor in scikit-learn:

In [12]:
from xgboost import XGBRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load the boston dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a XGBoost regressor
xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=1)

# Fit the XGBoost regressor on the training data
xgb_reg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = xgb_reg.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_pred))


14.398599735606238



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

### LightGBM

LightGBM is another popular gradient boosting library that is designed to be faster and more efficient than other libraries like XGBoost. LightGBM uses a technique called "gradient-based one-side sampling" to reduce the data used in the tree-growing process, which speeds up the training time. Additionally, LightGBM uses a technique called "leaf-wise" tree-growing, which results in deeper trees and better performance.

Here is an example of how to use the LGBMClassifier in scikit-learn

In [17]:
from lightgbm import LGBMClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a LightGBM classifier
lgb_clf = LGBMClassifier(n_estimators=100, learning_rate=0.1, max_depth=1)

# Fit the LightGBM classifier on the training data
lgb_clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lgb_clf.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))


1.0


Here is an example of how to use the LGBMRegressor in scikit-learn

In [16]:
from lightgbm import LGBMRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load the boston dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a LightGBM regressor
lgb_reg = LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=1)

# Fit the LightGBM regressor on the training data
lgb_reg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lgb_reg.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_pred))


14.33363981900555



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [15]:
pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-3.3.4-py3-none-win_amd64.whl (1.0 MB)
     ---------------------------------------- 1.0/1.0 MB 1.2 MB/s eta 0:00:00
Installing collected packages: lightgbm
Successfully installed lightgbm-3.3.4
Note: you may need to restart the kernel to use updated packages.


### Random Forest ensemble method

Random Forest is an ensemble machine learning method that is used for both classification and regression problems. It is an extension of decision trees and is based on the idea of training multiple decision trees and combining their predictions.

The basic idea behind Random Forest is to randomly select a subset of the training data, and train a decision tree on each subset. The final prediction is made by averaging the predictions of all the trees in the forest. The randomness in the selection of subsets of the data and features used to split the nodes in the decision tree, provides a degree of randomness in the predictions made by the individual trees, which helps to reduce overfitting.

Here is an example of how to use the RandomForestClassifier in scikit-learn

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=1)

# Fit the Random Forest classifier on the training data
rf_clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_clf.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))


0.9666666666666667


Here is an example of how to use the RandomForestRegressor in scikit-learn

In [19]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load the boston dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest regressor
rf_reg = RandomForestRegressor(n_estimators=100, max_depth=1)

# Fit the Random Forest regressor on the training data
rf_reg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_reg.predict(X_test)

# Evaluate the performance of the ensemble
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_pred))


34.49386518215856



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

##### Md. Ashiqur Rahman
##### Thank You