# Ensemble Learning

The intuition is to combine different classifiers into a meta-classifier that has a better performance than any individual classifier alone.

In [1]:
import numpy as np
import sklearn

print np.__version__
print sklearn.__version__

1.9.2
0.17


## Majority Vote Concept

The concept of weighted majority vote.   
3 Classifier classifies this observation as class [ 0, 0, 1 ] respectively. If the weight of these 3 classifiers are [ 0.2, 0.2, 0.6 ] then this observation will be classified as class 1. 

In [6]:
# assign a list of weight, whose shape is equivalent to x
np.argmax( np.bincount( [ 0, 0, 1 ], weights = [ 0.2, 0.2, 0.6 ] ) )

1

When class becomes probability ( probability of being classified as class 0 and class1 ).

In [7]:
ex = np.array( [ [ 0.9, 0.1 ], [ 0.8, 0.2 ], [ 0.4, 0.6 ] ] )
p = np.average( ex, axis = 0, weights = [ 0.2, 0.2, 0.6 ] )
print(p)
print( np.argmax(p) )

[ 0.58  0.42]
0


## Preparing the Dataset and Algorithm

In [9]:
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import Pipeline
import numpy as np

Load iris data. Use only two of the features and work on classifying two classes.

In [10]:
iris = datasets.load_iris()
X, y = iris.data[ 50:, [1, 2] ], iris.target[50:]
le = LabelEncoder()
y  = le.fit_transform(y)

# split the data into 50% training / 50% testing
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.5, random_state = 1 )

Train three different classifiers : 1) logistic regression ; 2) decision tree ; 3) k nearest neighbors. Look at their individual performance via a 10 fold cross-validation before combining them into an emsemble classifer.

Note that unlike tree-base algorithms, logistic regression and k nearest neighbors are not scale-invariant, thus it's a good habit to work with standardized features ( use a Pipeline ).

In [11]:
clf1 = LogisticRegression( penalty = 'l2', 
                           C = 0.001, 
                           random_state = 0 )

clf2 = DecisionTreeClassifier(max_depth = 1, 
                              criterion = 'entropy', 
                              random_state = 0 )

clf3 = KNeighborsClassifier( n_neighbors = 1, 
                             p = 2, 
                             metric = 'minkowski')

pipe1 = Pipeline( [ [ 'sc', StandardScaler() ],
                    [ 'clf', clf1 ] ] )
pipe3 = Pipeline( [ [ 'sc', StandardScaler() ],
                    [ 'clf', clf3 ] ] )

clf_labels = [ 'Logistic Regression', 'Decision Tree', 'KNN' ]

for clf, label in zip( [ pipe1, clf2, pipe3 ], clf_labels ) :
    # scores returns an array of float
    scores = cross_val_score( estimator = clf, X = X_train, y = y_train, 
                              cv = 10, scoring = 'roc_auc' )
    print( "ROC AUC: %0.2f (+/- %0.2f) [%s]" % 
           ( scores.mean(), scores.std(), label ) )

ROC AUC: 0.92 (+/- 0.20) [Logistic Regression]
ROC AUC: 0.92 (+/- 0.15) [Decision Tree]
ROC AUC: 0.93 (+/- 0.10) [KNN]


## Using the Majority Voter

Implement majority voter from scratch. A quick tour on some of the part that may be confusing.

### sklearn's `_name_estimators` example usage.

`_name_estimators` returns a list of tuples, we can convert it into a dictionary, where the key is the name of the estimator provided by the `_name_estimator` function and the value is its corresponding estimator.

In [63]:
from sklearn.pipeline import _name_estimators # generate names for estimators 
classifiers = [ pipe1, clf2, pipe3 ]
named_classifiers = { key: value for key, value in _name_estimators( classifiers ) }
named_classifiers

{'decisiontreeclassifier': DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=1,
             max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=0, splitter='best'),
 'pipeline-1': Pipeline(steps=[['sc', StandardScaler(copy=True, with_mean=True, with_std=True)], ['clf', LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False)]]),
 'pipeline-2': Pipeline(steps=[['sc', StandardScaler(copy=True, with_mean=True, with_std=True)], ['clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            metric_params=None, n_jobs=1, n_neighbors=1, p=2,
            weights='uniform')]])}

### python's exception. Raise a value error.

Side note on the difference between str and repr in python. https://www.youtube.com/watch?v=5cvM-crlDvg.
str is meant to be readable to the users, while repre is meant to be unambiguous ( easier to debug, since it's kind of like showing you the functional call that creates the variable ).   
`%r` represents the object as repr.

In [14]:
vote = 1
if vote not in ('probability', 'classlabel') :
    raise ValueError( "vote must be 'probability' or 'classlabel'"
                      "; got (%r)" % vote )

ValueError: vote must be 'probability' or 'classlabel'; got (1)

Loop through all the classfiers and fit all the models.

In [15]:
from sklearn.base import clone
lablenc_ = LabelEncoder()
lablenc_.fit(y)
classes_ = lablenc_.classes_ # obtains the class for the labelEncoder
classifiers_ = []
for clf in classifiers:
    fitted_clf = clone(clf).fit( X, lablenc_.transform(y) )
    classifiers_.append(fitted_clf)

Code when you only want to obtain the final ensembled-class.

In [50]:
predictions = np.asarray([ clf.predict(X) for clf in classifiers_ ]).T
# give a function to apply along the specified axis of the array
maj_vote = np.apply_along_axis( lambda x:
                                np.argmax( np.bincount( x, weights = weights ) ),
                                axis = 1,
                                arr = predictions )
maj_vote = lablenc_.inverse_transform(maj_vote)
maj_vote

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1])

In [59]:
classifiers_[0].get_params( deep = False )

{'steps': [['sc', StandardScaler(copy=True, with_mean=True, with_std=True)],
  ['clf',
   LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
             intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
             penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
             verbose=0, warm_start=False)]]}

In [23]:
# print out the first five predicted rows for all three classifiers
np.asarray( [clf.predict(X) for clf in classifiers_ ] ).T[ :5, : ]

array([[1, 0, 0],
       [0, 0, 0],
       [1, 1, 0],
       [0, 0, 0],
       [0, 0, 0]])

In [39]:
probas = np.asarray( [ clf.predict_proba(X) for clf in classifiers_ ] )
probas

array([[[ 0.49879125,  0.50120875],
        [ 0.5011151 ,  0.4988849 ],
        [ 0.49756567,  0.50243433],
        [ 0.51680288,  0.48319712],
        [ 0.50434615,  0.49565385],
        [ 0.50550797,  0.49449203],
        [ 0.49769299,  0.50230701],
        [ 0.5238264 ,  0.4761736 ],
        [ 0.50324794,  0.49675206],
        [ 0.51357472,  0.48642528],
        [ 0.52589051,  0.47410949],
        [ 0.50679702,  0.49320298],
        [ 0.51789983,  0.48210017],
        [ 0.50208605,  0.49791395],
        [ 0.51486292,  0.48513708],
        [ 0.50337526,  0.49662474],
        [ 0.5033116 ,  0.4966884 ],
        [ 0.51125229,  0.48874771],
        [ 0.51209547,  0.48790453],
        [ 0.51576937,  0.48423063],
        [ 0.49762933,  0.50237067],
        [ 0.51131592,  0.48868408],
        [ 0.50415519,  0.49584481],
        [ 0.50318429,  0.49681571],
        [ 0.50673338,  0.49326662],
        [ 0.50447346,  0.49552654],
        [ 0.50202239,  0.49797761],
        [ 0.49750201,  0.502

In [41]:
probas.shape 

(3, 100, 2)

In [44]:
avg_proba = np.average( probas, axis = 1, weights = None )
avg_proba

array([[ 0.5000001,  0.4999999],
       [ 0.5      ,  0.5      ],
       [ 0.53     ,  0.47     ]])

In [49]:
weights = [ 0.4, 0.4, 0.3 ]

In [None]:
#mv_clf = MajorityVoteClassifier(
#                classifiers=[pipe1, clf2, pipe3])

In [65]:
for name, step in named_classifiers.iteritems():
    print name, step

pipeline-2 Pipeline(steps=[['sc', StandardScaler(copy=True, with_mean=True, with_std=True)], ['clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')]])
pipeline-1 Pipeline(steps=[['sc', StandardScaler(copy=True, with_mean=True, with_std=True)], ['clf', LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)]])
decisiontreeclassifier DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=1,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=0, splitter='best')
