<a href="https://colab.research.google.com/github/gbiamgaurav/ml_mastery/blob/main/Ensemble_Algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
filename = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

### Bagging Algorithms : -   

Bootstrap Aggregation (or Bagging) involves taking multiple samples from your training dataset (with replacement) and training a model for each sample. The final output prediction is averaged across the predictions of all of the sub-models.

* Bagged Decision Trees
* Random Forest
* Extra Trees

### Bagged Decision Trees : -  

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
import warnings
warnings.filterwarnings("ignore")

In [3]:
df = pd.read_csv(filename, names=features)
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
array = df.values
X = array[:, 0:8]
y = array[:, 8]

In [5]:
kfold = KFold(n_splits=10)
dt = DecisionTreeClassifier()
num_trees = 100
seed = 7
model = BaggingClassifier(base_estimator=dt, n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print("Results based on Bagging :", results.mean())

Results based on Bagging : 0.7720437457279563


### Random Forest : -  

Random Forests is an extension of bagged decision trees. Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation
between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of each tree, only a random subset of features are considered for each split.

In [6]:
from sklearn.ensemble import RandomForestClassifier
max_features = 3
num_trees = 100
kfold = KFold(n_splits=10)
seed = 7
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print("Results using Random Forest Classifier : ", results.mean())


Results using Random Forest Classifier :  0.7733766233766234


### Extra Trees : -  

Extra Trees are another modification of bagging where random trees are constructed from samples of the training dataset.

In [7]:
from sklearn.ensemble import ExtraTreesClassifier
max_features = 3
num_trees = 100
kfold = KFold(n_splits=10)
seed = 7

model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print("Results using Extra Tree Classifier : ", results.mean())

Results using Extra Tree Classifier :  0.7603725222146275


### Boosting Algorithms : -  

Boosting ensemble algorithms creates a sequence of models that attempt to correct the mistakes of the models before them in the sequence. Once created, the models make predictions which
may be weighted by their demonstrated accuracy and the results are combined to create a final output prediction.

### AdaBoost : -  

AdaBoost was perhaps the first successful boosting ensemble algorithm. It generally works by weighting instances in the dataset by how easy or dicult they are to classify, allowing the algorithm to pay or less attention to them in the construction of subsequent models.

In [8]:
from sklearn.ensemble import AdaBoostClassifier

num_trees = 30
max_features = 3
seed = 7

kfold = KFold(n_splits=10)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print("Results based on AdaBoostClassifier : ", results.mean())

Results based on AdaBoostClassifier :  0.760457963089542


### Stochastic Gradient Boosting : -  

Stochastic Gradient Boosting (also called Gradient Boosting Machines) are one of the most sophisticated ensemble techniques. It is also a technique that is proving to be perhaps one of the best techniques available for improving performance via ensembles.

In [9]:
from sklearn.ensemble import GradientBoostingClassifier

num_trees = 100

seed = 7

kfold = KFold(n_splits=10)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print("Results based on GradientBoostingClassifier (GB) : ", results.mean())

Results based on GradientBoostingClassifier (GB) :  0.7681989063568012


### Voting Ensemble : -  

Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by first creating two or more standalone models from your training dataset.
A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

In [13]:
## create the sub models

estimators = []

model1 = LogisticRegression()
estimators.append(('Logistic Regression', model1))

model2 = DecisionTreeClassifier()
estimators.append(('Decision Tree', model2))

model3 = SVC()
estimators.append(('SVM', model3))

In [14]:
## Create the ensemble model 

ensemble = VotingClassifier(estimators)

results = cross_val_score(ensemble, X, y, cv=kfold)

print("Results based on Voting Classifier : ", results.mean())

Results based on Voting Classifier :  0.7668831168831168
