## ENSEMBLE LEARNING TECHNIQUES

![alt text](https://media.geeksforgeeks.org/wp-content/uploads/20250516170015848931/Ensemble-learning.webp)

## TYPES OF ENSEMBLES

## 1. BAGGING (BOOTSTRAPP AGGREGATING)
- VOTING METHOD FOR A CLASSIFICATION PROBLEM
- Models are trained independently on different random subsets of the training data. Their results are then combined—usually by averaging (for regression) or voting (for classification). This helps reduce variance and prevents overfitting.

![alt text](https://media.geeksforgeeks.org/wp-content/uploads/20250516170016504785/Bagging.webp)

**A.Importing Libraries and Loading Data**
- Import scikit learn for:
- BaggingClassifier: for creating an ensemble of classifiers trained on different subsets of data.
- DecisionTreeClassifier: the base classifier used in the bagging ensemble.
- load_iris: to load the Iris dataset for classification.
- train_test_split: to split the dataset into training and testing subsets.
- accuracy_score: to evaluate the model’s prediction accuracy.

In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

**B. Loading and Splitting the Iris Dataset**

- data = load_iris(): loads the Iris dataset, which includes features and target labels.
- X = data.data: extracts the feature matrix (input variables).
- y = data.target: extracts the target vector (class labels).
- train_test_split(...): splits the data into training (80%) and testing (20%) sets, with random_state=42 to ensure reproducibility.

In [2]:
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**C. CreatE a Base Classifier**
- Decision tree is chosen as the base model. It is prone to overfitting when trained on small datasets making them good candidates for bagging.
- base_classifier = DecisionTreeClassifier(): initializes a Decision Tree classifier, which will serve as the base estimator in the Bagging ensemble.

In [3]:
base_classifier = DecisionTreeClassifier()

**D. Create and Train the Bagging Classifier**
- A BaggingClassifier is created using the decision tree as the base classifier.
- n_estimators = 10 specifies that 10 decision trees will be trained on different bootstrapped subsets of the training data (number of trees can vary).

In [4]:
bagging_classifier = BaggingClassifier(base_classifier, n_estimators=10, random_state=42)
bagging_classifier.fit(X_train, y_train)

**E. Make Predictions and Evaluate Accuracy**
- The trained bagging model predicts labels for test data.
- The accuracy of the predictions is calculated by comparing the predicted labels (y_pred) to the actual labels (y_test).

In [5]:
y_pred = bagging_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0


## 2. BAGGING (BOOTSTRAPP AGGREGATING)
- AVERAGING FOR A REGRESSION PROBLEM
- Averaging method: It is mainly used for regression problems. 
- The method consists of building multiple models independently and returning the average of the prediction of all the models. 
- In general, the combined output is better than an individual output because variance is reduced.
- In this example, three regression models (linear regression, xgboost, and random forest) are trained and their predictions are averaged. The final prediction output is pred_final.

**A. Import utility libraries and modules**

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score

# importing machine learning models for prediction (base models)
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression


**B. Import and split data**

In [None]:
#import dataset (diamonds)
df = pd.read_excel("diamonds_new.xlsx")

# getting target data from the dataframe
target = df["price"]

# getting train data from the dataframe
train = df.drop("price")

# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
    train, target, test_size=0.20)

**B. Import and Train Base Models**

In [None]:
# initializing all the model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()

# training all the model on the training dataset
model_1.fit(X_train, y_target)
model_2.fit(X_train, y_target)
model_3.fit(X_train, y_target)


**C. Predict output on validation dataset**

In [None]:
# predicting the output on the validation dataset
pred_1 = model_1.predict(X_test)
pred_2 = model_2.predict(X_test)
pred_3 = model_3.predict(X_test)

**D. Average the final accuracy metrics**

In [None]:
# final prediction after averaging on the prediction of all 3 models
pred_final = (pred_1+pred_2+pred_3)/3.0

# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))

#print the final accuracy
print(accuracy_score(y_test, pred_final))


## 3. BOOSTING (SEQUENTIAL) ALGORITHM
- Combines multiple weak learners to create a strong learner. 
- Weak models are trained in series such that each next model tries to correct errors of the previous model until the entire training dataset is predicted correctly. 
- One of the most well-known boosting algorithms is AdaBoost (Adaptive Boosting). 
- Here is an overview of Boosting algorithm:

- Initialize Model Weights: Begin with a single weak learner and assign equal weights to all training examples.
- Train Weak Learner: Train weak learners on these dataset.
- Sequential Learning: Boosting works by training models sequentially where each model focuses on correcting the errors of its predecessor. Boosting typically uses a single type of weak learner like decision trees.
- Weight Adjustment: Boosting assigns weights to training datapoints. Misclassified examples receive higher weights in the next iteration so that next models pay more attention to them.
![alt text](https://media.geeksforgeeks.org/wp-content/uploads/20250516170016802150/Boosting.webp)


**A. Importing Libraries and Modules**
- AdaBoostClassifier from sklearn.ensemble: for building the AdaBoost ensemble model.
- DecisionTreeClassifier from sklearn.tree: as the base weak learner for AdaBoost.
- load_iris from sklearn.datasets: to load the Iris dataset.
- train_test_split from sklearn.model_selection: to split the dataset into training and testing sets.
- accuracy_score from sklearn.metrics: to evaluate the model’s accuracy.

In [6]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

**B.  Load and Split the Dataset**
- data = load_iris(): loads the Iris dataset, which includes features and target labels.
- X = data.data: extracts the feature matrix (input variables).
- y = data.target: extracts the target vector (class labels).
- train_test_split(...): splits the data into training (80%) and testing (20%) sets, with random_state=42 to ensure reproducibility.

In [7]:
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**C. Defining the Weak Learner (BASE MODEL)**
- Create the base classifier as a decision tree with maximum depth 1 (a decision stump). This simple tree will act as a weak learner for the AdaBoost algorithm, which iteratively improves by combining many such weak learners.

In [8]:
base_classifier = DecisionTreeClassifier(max_depth=1)

**D. Create and Train the AdaBoost Classifier**
- base_classifier: The weak learner used in boosting.
- n_estimators = 50: Number of weak learners to train sequentially.
- learning_rate = 1.0: Controls the contribution of each weak learner to the final model.
- random_state = 42: Ensures reproducibility.

In [9]:
adaboost_classifier = AdaBoostClassifier(
    base_classifier, n_estimators=50, learning_rate=1.0, random_state=42
)
adaboost_classifier.fit(X_train, y_train)



**E. Make Predictions and Calculate Accuracy**
- Calculate the accuracy of the model by comparing the true labels y_test with the predicted labels y_pred. 
- The accuracy_score function returns the proportion of correctly predicted samples.

In [10]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0
