### Objectives

1. Understanding Ensemble Methods
2. Bagging vs Boosting
3. Recap RandomForest
4. AdaBoost
5. VotingClassifier
6. GBT

<hr>

### Ensemble Methods
* Base Estimators - Simple estimators like Linear Regression, Decision Trees, NearestNeighbours, Naive Bayes
* Ensemble Methods - They combine same or different base estimators & create predictors
* Decision Tree was Base estimator & RandomForest is Ensemble Methods
* Ensemble methods results into more robust models

### Types of Ensemble Methods
* Bagging - Build several estimators independently & average their predictions. Example - RandomForest.
* Boosting - 
           - Each data has an important information which is the weightage of data. 
           - Initially, all the data is of same weightage. 
           - Weightage of data tells how important is to classify the data correctly.
           - Training the model with higher weightage of perviously trained model for misclassified data
           - For prediction all the weak classifiers are consulted

In [10]:
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [3]:
adaboost = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100)

In [18]:
adaboost = AdaBoostClassifier(base_estimator= RandomForestClassifier(n_estimators=100), n_estimators=100)

In [6]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

In [5]:
digits = load_digits()

In [33]:
trainX, testX, trainY, testY = train_test_split(digits.data, digits.target)

In [19]:
adaboost.fit(trainX, trainY)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=RandomForestClassifier(bootstrap=True,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=None,
                                                         max_features='auto',
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         n_estimators=100,
                        

In [20]:
adaboost.score(testX,testY)

0.9777777777777777

In [14]:
rf = RandomForestClassifier(n_estimators=100)

In [16]:
rf.fit(trainX, trainY)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [17]:
rf.score(testX,testY)

0.98

In [21]:
from sklearn.linear_model import LogisticRegression

In [23]:
adab = AdaBoostClassifier(n_estimators=10, base_estimator=LogisticRegression())

In [34]:
adab.fit(trainX,trainY)



AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=LogisticRegression(C=1.0, class_weight=None,
                                                     dual=False,
                                                     fit_intercept=True,
                                                     intercept_scaling=1,
                                                     l1_ratio=None,
                                                     max_iter=100,
                                                     multi_class='warn',
                                                     n_jobs=None, penalty='l2',
                                                     random_state=None,
                                                     solver='warn', tol=0.0001,
                                                     verbose=0,
                                                     warm_start=False),
                   learning_rate=1.0, n_estimators=10, random_state=None)

In [35]:
adab.score(testX,testY)

0.9377777777777778

In [36]:
lr = LogisticRegression()

In [37]:
lr.fit(trainX,trainY)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [38]:
lr.score(testX,testY)

0.9533333333333334

In [39]:
from sklearn.ensemble import GradientBoostingClassifier

In [40]:
gbt = GradientBoostingClassifier(n_estimators=100)

In [41]:
gbt.fit(trainX,trainY)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [42]:
gbt.score(testX, testY)

0.9555555555555556

* We will discuss separately GBT & XGBoost

### VotingClassifier
* As of now the same base estimator is used
* How about if we want to combine different type of base estimators
* Hard Voting - Same weitage for different algorithms
* Soft voting - Different weightage for differtn algo
* How to fig out the best combination of weightage

In [43]:
from sklearn.ensemble import VotingClassifier

In [44]:
from sklearn.ensemble import VotingClassifier,RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

In [45]:
estimators = [ 
    ('rf',RandomForestClassifier(n_estimators=20)),
    ('svc',SVC(kernel='rbf', probability=True)),
    ('knc',KNeighborsClassifier()),
    ('abc',AdaBoostClassifier(base_estimator=DecisionTreeClassifier() ,n_estimators=20)),
    ('lr',LogisticRegression()) 
]

In [46]:
vc = VotingClassifier(estimators=estimators, voting='hard')

In [47]:
vc.fit(trainX,trainY)



VotingClassifier(estimators=[('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=20,
                                                     n_jobs=None,
           

In [48]:
for est,name in zip(vc.estimators_,vc.estimators):
    print (name[0], est.score(testX,testY))

rf 0.9622222222222222
svc 0.5088888888888888
knc 0.9888888888888889
abc 0.8577777777777778
lr 0.9533333333333334


In [49]:
vc.score(testX,testY)

0.9755555555555555

In [50]:
vc.predict(testX[:2])

array([3, 2])

In [51]:
estimators = [ 
    ('rf',RandomForestClassifier(n_estimators=20)),
    ('knc',KNeighborsClassifier()),
    ('abc',AdaBoostClassifier(base_estimator=DecisionTreeClassifier() ,n_estimators=20)),
    ('lr',LogisticRegression()) 
]

In [56]:
vc = VotingClassifier(estimators=estimators, voting='soft', weights=[3,5,1,1])

In [57]:
vc.fit(trainX,trainY)



VotingClassifier(estimators=[('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=20,
                                                     n_jobs=None,
           

In [58]:
for est,name in zip(vc.estimators_,vc.estimators):
    print (name[0], est.score(testX,testY))

rf 0.9688888888888889
knc 0.9888888888888889
abc 0.8488888888888889
lr 0.9533333333333334


In [59]:
vc.score(testX,testY)

0.9866666666666667

In [60]:
from sklearn.model_selection import GridSearchCV

In [67]:
vc.estimators

[('rf',
  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                         max_depth=None, max_features='auto', max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_samples_leaf=1, min_samples_split=2,
                         min_weight_fraction_leaf=0.0, n_estimators=20,
                         n_jobs=None, oob_score=False, random_state=None,
                         verbose=0, warm_start=False)),
 ('knc',
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                       weights='uniform')),
 ('abc', AdaBoostClassifier(algorithm='SAMME.R',
                     base_estimator=DecisionTreeClassifier(class_weight=None,
                                                           criterion='gini',
                                                           max_depth=None,
   

In [70]:
gs = GridSearchCV(vc, param_grid={'weights':[[3,2,1,1],[3,3,1,2]]},cv=5)

In [71]:
gs.fit(trainX,trainY)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=VotingClassifier(estimators=[('rf',
                                                     RandomForestClassifier(bootstrap=True,
                                                                            class_weight=None,
                                                                            criterion='gini',
                                                                            max_depth=None,
                                                                            max_features='auto',
                                                                            max_leaf_nodes=None,
                                                                            min_impurity_decrease=0.0,
                                                                            min_impurity_split=None,
                                                                            min_samples_leaf=1,
                      

In [64]:
gs.best_params_

{'weights': [3, 3, 1, 2]}

In [65]:
gs.best_score_

0.9799554565701559

In [74]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler

In [75]:
pipeline = make_pipeline(MinMaxScaler(), AdaBoostClassifier(n_estimators=100, base_estimator=DecisionTreeClassifier()))

In [76]:
pipeline.fit(trainX,trainY)

Pipeline(memory=None,
         steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('adaboostclassifier',
                 AdaBoostClassifier(algorithm='SAMME.R',
                                    base_estimator=DecisionTreeClassifier(class_weight=None,
                                                                          criterion='gini',
                                                                          max_depth=None,
                                                                          max_features=None,
                                                                          max_leaf_nodes=None,
                                                                          min_impurity_decrease=0.0,
                                                                          min_impurity_split=None,
                                                                          min_samples_leaf=1,
                                            

In [80]:
gs = GridSearchCV(pipeline,param_grid={'adaboostclassifier__base_estimator':[DecisionTreeClassifier(), LogisticRegression()]}, cv=5)

In [81]:
gs.fit(trainX, trainY)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('minmaxscaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0, 1))),
                                       ('adaboostclassifier',
                                        AdaBoostClassifier(algorithm='SAMME.R',
                                                           base_estimator=DecisionTreeClassifier(class_weight=None,
                                                                                                 criterion='gini',
                                                                                                 max_depth=None,
                                                                                                 max_features=None,
                                                                                                 max_lea

In [82]:
gs.best_params_

{'adaboostclassifier__base_estimator': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='warn', n_jobs=None, penalty='l2',
                    random_state=None, solver='warn', tol=0.0001, verbose=0,
                    warm_start=False)}

In [83]:
gs.best_score_

0.9064587973273942

In [84]:
import pandas as pd

In [98]:
data = pd.read_csv('Data/NewsAggregatorDataset/newsCorpora.csv', encoding='utf-8', sep='\t', index_col='ID',
                   names=['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP'])

In [99]:
data.head()

Unnamed: 0_level_0,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027
