# Chapter 7: Ensemble Learning and Random Forests

**Tujuan:** Memahami teknik ensemble—Voting, Bagging/Pasting, Random Patches/Subspaces, Random Forests, Extra‑Trees, Boosting (AdaBoost & Gradient Boosting), dan Stacking.

---

## 1. Voting Classifier

- **Ide:** Gabungkan beberapa model (“weak learners”) dengan voting:  
  - **Hard voting** → mayoritas label  
  - **Soft voting** → rata‑rata probabilitas  
- Berguna untuk meningkatkan stabilitas dan akurasi dengan model beragam.

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

# Persiapkan data Iris
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=42
)

# Definisikan base learners
clf1 = LogisticRegression(max_iter=200)
clf2 = SVC(kernel='rbf', probability=True)
clf3 = LogisticRegression(C=0.5, max_iter=200)

# Hard Voting
voting_clf = VotingClassifier(
    estimators=[('lr1',clf1),('svc',clf2),('lr2',clf3)],
    voting='hard'
)
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
print("Voting (hard) Accuracy:", accuracy_score(y_test, y_pred))

Voting (hard) Accuracy: 1.0


## 2. Bagging & Pasting
- Bagging (Bootstrap AGGregating): sampling dengan pengembalian → tiap learner latih subset data berbeda

- Pasting: sampling tanpa pengembalian

- Mengurangi varians, cocok untuk model high‑variance (Decision Trees).

In [2]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, random_state=42
)
bag_clf.fit(X_train, y_train)
print("Bagging Accuracy:", accuracy_score(y_test, bag_clf.predict(X_test)))

Bagging Accuracy: 1.0


## 3. Random Patches & Random Subspaces
- Random Subspaces: sample subset fitur untuk tiap learner

- Random Patches: sample subset data + subset fitur

- Cara mengurangi korelasi antar model tambahan.

In [3]:
# Contoh Random Patches: bootstrap + max_features
patches_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=200,
    max_samples=100, max_features=2,
    bootstrap=True, bootstrap_features=True,
    random_state=42
)
patches_clf.fit(X_train, y_train)
print("Random Patches Accuracy:", accuracy_score(y_test, patches_clf.predict(X_test)))

Random Patches Accuracy: 1.0


## 4. Random Forest
- Random Forest = Bagging pada Decision Trees + random feature selection di tiap split

- Parameter utama: `_estimators`, `max_features`, `max_depth`.

In [4]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(
    n_estimators=500, max_features='sqrt', random_state=42
)
rnd_clf.fit(X_train, y_train)
print("Random Forest Accuracy:", accuracy_score(y_test, rnd_clf.predict(X_test)))

# Feature importance
import pandas as pd
feat_imp = pd.Series(rnd_clf.feature_importances_, index=iris.feature_names)
feat_imp.sort_values(ascending=False)

Random Forest Accuracy: 1.0


Unnamed: 0,0
petal length (cm),0.447097
petal width (cm),0.405924
sepal length (cm),0.113184
sepal width (cm),0.033795


## 5. Extra‑Trees (Extremely Randomized Trees)
- Mirip Random Forest, tapi randomisasi lebih agresif:

  - Split points dipilih acak

- Bias sedikit naik, varian turun, biasanya lebih cepat.

In [5]:
from sklearn.ensemble import ExtraTreesClassifier

et_clf = ExtraTreesClassifier(
    n_estimators=500, max_features='sqrt', random_state=42
)
et_clf.fit(X_train, y_train)
print("Extra-Trees Accuracy:", accuracy_score(y_test, et_clf.predict(X_test)))

Extra-Trees Accuracy: 1.0


## 6. Boosting
###A. AdaBoost (Adaptive Boosting)
- Berikan bobot lebih besar pada misclassified instances

- Weak learner: Decision Stump (Tree depth=1)

In [7]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier # Import DecisionTreeClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=200, algorithm="SAMME", random_state=42
)
ada_clf.fit(X_train, y_train)
print("AdaBoost Accuracy:", accuracy_score(y_test, ada_clf.predict(X_test)))



AdaBoost Accuracy: 0.9736842105263158


### B. Gradient Boosting
- Sequential build trees yang memperbaiki residual (gradient of loss)

- Parameter: `learning_rate`, `n_estimators`, `max_depth`

In [8]:
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_clf.fit(X_train, y_train)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, gb_clf.predict(X_test)))

Gradient Boosting Accuracy: 1.0


## 7. Stacking
- Combine predictions dari beberapa model level‑0 sebagai fitur untuk level‑1 learner

- Meningkatkan akurasi dengan meta‑learner.

In [9]:
from sklearn.ensemble import StackingClassifier
from sklearn.neighbors import KNeighborsClassifier

stack_clf = StackingClassifier(
    estimators=[
        ('rf', rnd_clf),
        ('gb', gb_clf),
        ('knn', KNeighborsClassifier())
    ],
    final_estimator=LogisticRegression(),
    cv=5
)
stack_clf.fit(X_train, y_train)
print("Stacking Accuracy:", accuracy_score(y_test, stack_clf.predict(X_test)))

Stacking Accuracy: 1.0


# Ringkasan Chapter 7
- Voting menggabungkan model dengan voting.

- Bagging/Pasting kurangi varians.

- Random Forest & Extra-Trees: ensemble Decision Trees.

- Boosting (AdaBoost & Gradient Boosting) kurangi bias & varians.

- Stacking: meta‑learner atas prediksi model lain.