<a href="https://colab.research.google.com/github/ceyxasm/ml/blob/main/Ensemble_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Ensemble Learning**

Suppose you build an ensemble containing 1,000 classifiers that are individ‐
ually correct only 51% of the time (barely better than random guessing). If you pre‐
dict the majority voted class, you can hope for up to 75% accuracy! However, this is
only true if all classifiers are perfectly independent, making uncorrelated errors



In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons( 500, noise=0.035, random_state=20)
X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size=0.33, random_state=42)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier(max_depth=4)
svm_clf = SVC(C=0.05)
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard')
voting_clf.fit(X_train, y_train)


VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier(max_depth=4)),
                             ('svc', SVC(C=0.05))])

In [2]:
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.896969696969697
RandomForestClassifier 0.9878787878787879
SVC 0.9636363636363636
VotingClassifier 0.9636363636363636


In [3]:
## implementing soft voting


log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier(max_depth=4)
svm_clf = SVC(C=0.05, probability=True) #<<<<<<<<<<<<<<<<<<<<<<<
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='soft')  #<<<<<<<<<<<<<<<<<<<<<<,  
voting_clf.fit(X_train, y_train)


for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.896969696969697
RandomForestClassifier 0.9878787878787879
SVC 0.9636363636363636
VotingClassifier 0.9454545454545454



Clearly are ensemble is not performing as it should.
reason being: we are training the models on same data. Errors will therefore be correlated.

 Using different classifier is one way of doing ensemble learning.
 
 .
 ----
**Bagging and Pasting**

use the same training algorithm for every
predictor, but to train them on different random subsets of the training set. When
sampling is performed with replacement, this method is called bagging . When sampling is performed without replacement, it is called
pasting. 

In [5]:
from sklearn.ensemble import BaggingClassifier  #BaggingRegressor for regression
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
  DecisionTreeClassifier(), n_estimators=500,
  max_samples=100, bootstrap=True, n_jobs=-1) #for pasting, bootstrap=False
'''here 500 Dtrees are trained on 100 randomly sampled instances'''

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print('Accuracy:  ', accuracy_score(y_pred, y_test))

Accuracy:   0.9878787878787879
