<a href="https://colab.research.google.com/github/atakishiyevamalak/MachineLearningProjects/blob/main/Ensemble_Learning_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Hard Voting

In [2]:
from sklearn.datasets import make_moons # importing make_moons dataset for demonstration
from sklearn.ensemble import RandomForestClassifier, VotingClassifier #Importing classifiers
from sklearn.linear_model import LogisticRegression  # import logistic regression model
from sklearn.model_selection import train_test_split # importing train_test_split for data splitting
from sklearn.svm import SVC  # importing support vector machine classifier

# Generating a synthetic dataset(make_moons)
X, y = make_moons(n_samples=500, noise=0.3, random_state=42)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#creating a VoitingClassifier with Logistic

voting_clf = VotingClassifier(
    estimators=[('lr', LogisticRegression(random_state=42)),
     ('rf', RandomForestClassifier(random_state=42)),
      ('svc', SVC(random_state=42))
      ]
)

voting_clf.fit(X_train, y_train)



In [3]:
# Accessing the named_estimators_
voting_clf.named_estimators_

{'lr': LogisticRegression(random_state=42),
 'rf': RandomForestClassifier(random_state=42),
 'svc': SVC(random_state=42)}

In [4]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, '=', clf.score(X_test, y_test))
#

lr = 0.864
rf = 0.896
svc = 0.896


In [5]:
voting_clf.predict(X_test[:1])

array([1])

In [6]:
y_test[:1]

array([1])

In [7]:
voting_clf.estimators

[('lr', LogisticRegression(random_state=42)),
 ('rf', RandomForestClassifier(random_state=42)),
 ('svc', SVC(random_state=42))]

In [8]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

[array([1]), array([1]), array([0])]

In [9]:
voting_clf.score(X_test, y_test)

0.912

Soft Voting

In [10]:
# configuring the votingclassifier for soft voting and enabling probability estimates for SVC

voting_clf.voting = 'soft'
voting_clf.named_estimators['svc'].probability = True      #we don't use _ in order to modify the original model
voting_clf.fit(X_train, y_train)

#evaluating the VotingClassifier with soft voting on the test set
voting_clf.score(X_test, y_test)

0.92

Bagging and Pasting
Bagging (Bootstrap Aggregating):
Uses the same data multiple times: Bagging involves creating multiple bootstrap samples (random samples with replacement) from the original dataset. These samples may contain duplicate instances.

Trains different parts of the same data with the same model: Each bootstrap sample is used to train a separate instance of the same model. The final prediction is then an average or majority vote of the predictions from each model.

Pasting
Uses it once and does not repeat: Pasting also involves creating multiple samples, but without replacement. Each instance is selected only once in each sample.

Trains different parts of the same data with the same model: Similar to bagging, pasting trains separate models on different samples.

So, in both bagging and pasting, different parts of the same data are used to train different models. The main difference lies in whether the sampling is done with or without replacement.

Bagging samples with replacement, while pasting samples without replacement. Both techniques aim to reduce overfitting and improve model generalization by introducing diversity through multiple models trained on different subsets of the data.

In [11]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

#Creating a BaggingClassifier with
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                            n_estimators=500,
                            max_samples=100,
                            n_jobs=-1,
                            random_state=42)

#fitting the BaggingClassifier on the training data
bag_clf.fit(X_train, y_train)

In [12]:
# evaluating the baggingClassifier on the test set
bag_clf.score(X_test, y_test)

0.904

Out-of-Bag Evaluation

In [13]:
#creating a BaggingClassifier with out-of-bag (OOB) score calculation
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                            n_estimators=500,
                            oob_score=True,
                            n_jobs=-1,
                            random_state=42
                            )

#Fitting the bagging classsifier on the training data
bag_clf.fit(X_train, y_train)

#Accesssing the OOB score
bag_clf.oob_score_

0.896

In bagging, on average, each base model is trained on about 63% of the original dataset. This is because, during bootstrap sampling, about 37% of the data points are not included in each bootstrap sample.

When you set the oob_score parameter to True in a bagging ensemble (e.g., BaggingClassifier or BaggingRegressor in scikit-learn), it indicates that you want to evaluate the model's performance on the out-of-bag (OOB) instances – the approximately 37% of data points that were not included in the bootstrap sample for each base model.

The OOB score serves as a validation metric without the need for a separate validation set. It helps estimate how well the ensemble is likely to generalize to unseen data. This is a useful feature of bagging, providing a built-in validation mechanism during the training process.

In [14]:
from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.92

In [15]:

from sklearn.ensemble import RandomForestClassifier

# Creating a RandomForestClassifier with specified parameters
rnd_clf = RandomForestClassifier(n_estimators=500,
                                 max_leaf_nodes=16,
                                 n_jobs=-1,
                                 random_state=42)

# Fitting the RandomForestClassifier on the training data
rnd_clf.fit(X_train, y_train)

# Generating predictions using the RandomForestClassifier on the test set
y_pred_rf = rnd_clf.predict(X_test)

In [16]:
#
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features='sqrt', max_leaf_nodes=16),
    n_estimators=500,
    n_jobs=-1,
    random_state=42
)

In [17]:
#Fitting the BaggingClassifier on the training data
bag_clf.fit(X_train, y_train)

#Generating predictions using the BaggingClassifier
y_pred_bag = bag_clf.predict(X_test)

#comparing predictios betweeen BaggingClassifier and RandomForestClassifier
np.all(y_pred_bag == y_pred_rf)

np.True_

Feature Importance

In [18]:
from sklearn.datasets import load_iris

#Loading the Iris dataset as a DataFrame
iris = load_iris(as_frame=True)

#
rnd_clf = RandomForestClassifier(n_estimators=500,
                                 random_state=42)

rnd_clf.fit(iris.data, iris.target)

for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score, 2)*100,'%', name)

11.0 % sepal length (cm)
2.0 % sepal width (cm)
44.0 % petal length (cm)
42.0 % petal width (cm)


Boosting
AdaBoosting

In [19]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=30,
    learning_rate=0.5,
    random_state=42)

ada_clf.fit(X_train, y_train)

In [20]:
ada_clf.score(X_test, y_test)

0.88

Gradient Boosting

In [21]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)

X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

In [22]:
y2 = y - tree_reg1.predict(X)

tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)

tree_reg2.fit(X, y2)

In [23]:
y3 = y2 - tree_reg2.predict(X)

tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)

tree_reg3.fit(X, y3)

In [24]:
X_new = np.array([[-0.4],[0], [0.5]])

sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

array([0.49484029, 0.04021166, 0.75026781])

In [25]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2,
                                 n_estimators=3,
                                 learning_rate=1.0,
                                 random_state=42)

gbrt.fit(X, y)

In [26]:
gbrt_best = GradientBoostingRegressor(
    max_depth=2,
    learning_rate=0.05,
    n_estimators=500,
    n_iter_no_change=10,
    random_state=42)

gbrt_best.fit(X, y)

In [27]:
gbrt_best.n_estimators_

92

StackingClassifier

In [29]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(probability=True, random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=43),
    cv=5
)
stacking_clf.fit(X_train, y_train)

In [30]:
stacking_clf.score(X_test, y_test)

0.928