<a href="https://www.kaggle.com/code/zeeshanahmadyar/bagging-ensemble-decision-tree-bagging-with-python?scriptVersionId=288048366" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/fully-cleaned-titanac-dataset/cleaned_titanac_dataset


In [2]:
df = pd.read_csv('/kaggle/input/fully-cleaned-titanac-dataset/cleaned_titanac_dataset')
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,0,3,34.5,0.0,0.0,7.8292,1,1,0
1,1,3,47.0,1.0,0.0,7.0,0,0,1
2,0,2,62.0,0.0,0.0,9.6875,1,1,0
3,0,3,27.0,0.0,0.0,8.6625,1,0,1
4,1,3,22.0,1.0,1.0,12.2875,0,0,1


# **Divide data into train test split**

In [3]:
X = df.drop('Survived', axis=1)
y = df['Survived']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

# **Use DecisionTree for Classification**

In [5]:
dt = DecisionTreeClassifier()

In [6]:
dt.fit(X_train, y_train)

# **Predict Decision Tree**

In [7]:
y_pred = dt.predict(X_test)

In [8]:
print('Accuracy:', accuracy_score(y_test, y_pred))

Accuracy: 1.0


# ***Bagging Classifier***

In [9]:
bag = BaggingClassifier(
    estimator = DecisionTreeClassifier(), n_estimators=500, max_samples=0.25, bootstrap=True, random_state=42, n_jobs=-1
)

In [10]:
bag.fit(X_train, y_train)

In [11]:
y_pred = bag.predict(X_test)

In [12]:
print('Accuracy:', accuracy_score(y_test, y_pred))

Accuracy: 1.0


In [13]:
bag.estimators_samples_[499]

array([ 68, 281, 152, 156,  48,  30, 257, 330,  62, 297, 112, 223,  17,
       152,  36, 276,  59,  67,  74, 256,  13,  58,  12,  75, 182, 157,
       301, 284, 133, 323, 267, 297, 288,  32, 161,  22,  85, 279, 105,
       269, 294, 322, 102, 202,  67, 248, 207, 255, 226, 229,  24, 244,
       146,  72,  53, 263, 317, 284,  86, 245, 308, 183,   7, 147,  45,
        58,  77, 184, 200, 124, 130, 221, 296, 277, 240, 125, 322, 138,
         7, 119, 230,  63, 200])

> # **Bagging Using SVM**

In [14]:
bag = BaggingClassifier(
    estimator=SVC(probability=True),
    n_estimators=500,
    max_samples=.25,
    bootstrap=True,
    random_state=42,
    n_jobs=-1,
    verbose=1
)

In [15]:
bag.fit(X_train, y_train)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    1.0s remaining:    1.0s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    1.0s finished


In [16]:
y_pred = bag.predict(X_test)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    0.2s remaining:    0.2s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    0.3s finished


In [17]:
print('SVM Accuracy:', np.round(accuracy_score(y_test, y_pred), 2))

SVM Accuracy: 0.6


# ***OOB Score***
> **When we do Row Sampling with replacement. So there is a chance a Decision Tree can not find any row, and some rows find by DT many times. Statistically Proved 63% of rows are found by the Decision Tree. It means our BaggingClassifier can't see 37% of the Data.**

In [18]:
bag1 = BaggingClassifier(
    estimator=DecisionTreeClassifier(), n_estimators=500, max_samples=.25, bootstrap=True, oob_score=True, verbose=1, random_state=42
)

In [19]:
bag1.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s finished


In [20]:
y_pred = bag1.predict(X_test)

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s finished


In [21]:
print('Accuracy:', accuracy_score(y_test, y_pred))

Accuracy: 1.0


# **GridSearchCV**
> We should tune the hyperparameter using **GridSearchCV**

In [22]:
parameters = {
    'n_estimators': [50, 100, 500],
    'max_samples': [0.1, 0.4, 0.7, 1.0],
    'bootstrap': [True, False],
    'max_features': [0.1, 0.4, 0.7, 1.0]
}

In [23]:
search = GridSearchCV(BaggingClassifier(), parameters, cv=5)

In [24]:
search.fit(X_train, y_train)

In [25]:
search.best_params_

{'bootstrap': True,
 'max_features': 0.7,
 'max_samples': 0.1,
 'n_estimators': 50}

In [26]:
search.best_score_

np.float64(1.0)

# **Bagging Regressor**
> Now we can apply the **Bagging Regressor**