# COMP47590: Advanced Machine Learning
# Assignment 1: The Super Learner

## Environment Setup

Project completed using a Conda virtual env.

   - [Anaconda3](https://conda.io/docs/user-guide/install/index.html)

   - [Manageing Conda Environment](https://conda.io/docs/user-guide/tasks/manage-environments.html)

To setup an identical environment to the one used.

    conda env create -f environment.yml

Or to build your own.

    conda create --name ENV_NAME --file spec-file.txt
    
both files can be found in the env folder.


Unofficial Jupyter extensions were used to make Jupyter Notebook a more indepth IDE. 
They aren't included in spec-file.txt or environment.yml but they are very cool.

   - [nbextensions](http://jupyter-contrib-nbextensions.readthedocs.io/en/latest/index.html)

## Import Packages Etc

In [1]:

import numpy as np
import pandas as pd
import unittest
import os

from time import time
from functools import reduce
from inspect import getmembers

from itertools import combinations
from itertools import permutations
from itertools import chain

from IPython.display import display, HTML, Image
from IPython.html.services.config import ConfigManager

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin

from sklearn.neural_network import MLPClassifier

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid

from sklearn.svm import SVC

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestRegressor

from sklearn.gaussian_process import GaussianProcessClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier


from sklearn.utils import resample
from sklearn.utils.validation import check_X_y
from sklearn.utils.validation import check_array
from sklearn.utils.validation import check_is_fitted


from sklearn.utils.multiclass import unique_labels

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import KFold


from sklearn.exceptions import NotFittedError
from sklearn.exceptions import ConvergenceWarning

from sklearn.datasets import load_iris

from sklearn.externals import joblib



#%matplotlib inline
#%qtconsole





### Paths and PKLS
In this notebook I use joblib to PKL classifiers, it saves re-training and redoing GridSearches. It's also a safety net against loss of results. To keep things organised, they are kept in the pkl folder. PKLS cannot however be included in the submission as they are too large. Some searches were kept as they harbour the searches best model and the results across the gridsearch, these give valuable infromation to reflect on the model and understand what type of parameters and base learners suit the MNIST set. 

#### IMPORTANT
To create new set of pkls, simply delete or move or rename the old ones.


In [2]:
pkls = os.path.join(os.getcwd(),'pkls')
tuned_bases = os.path.join(pkls, 'tuned_bases')
super_learners = os.path.join(pkls, 'super_learners')
searches = os.path.join(pkls, 'searches')


#Some useful functions.

def name(class_str):
    '''A small function for parsing class names'''
    class_str = type(class_str)
    class_str = str(class_str).split('.')[-1]
    alphas = [c for c in list(class_str) if c.isalnum()]
    return ''.join(alphas)



#This function should exist in SKLearn!
def bootstrapped_KFold(X, y=None, n_splits=3):
    '''Returns a generatir for KFold with resampling with replacement'''
    kf = KFold(n_splits=n_splits, shuffle=True)
    bootstrapped_cv = ((resample(train), test) for train, test in kf.split(X, y))
    return bootstrapped_cv

list(bootstrapped_KFold(np.arange(10), n_splits=3))

[(array([3, 6, 8, 3, 6, 8]), array([1, 2, 4, 5])),
 (array([0, 0, 7, 1, 4, 4, 4]), array([3, 6, 9])),
 (array([5, 5, 1, 2, 2, 5, 5]), array([0, 7, 8]))]

### Useful Diversity Metrics

In [3]:
'''Pairwise diversity measures.
ref : Kuncheva et al. 2003
Similar implementation @ http://brew.readthedocs.io/en/latest/
brew.diversity.paired.py has a Python2 implementation for its ensemble API.
Key differences:
Brew produces an aggregation of each metrics for the ensemble.
Brew is faster. 

My functions produce pairwise metrics for each pair of base estimators;
allowing for more indepth analysis. These functions are more fit for
purpose in this project.
'''

def coefs(y_pred_1, y_pred_2, y):
    A = np.equal(y_pred_1, y)
    B = np.equal(y_pred_2, y)
    d,c,b,a = confusion_matrix(A,B).ravel()
    return a,b,c,d

def disagree(y_pred_1, y_pred_2, y):
    a,b,c,d = coefs(y_pred_1, y_pred_2, y)
    return (b + c) / sum((a,b,c,d))

def q_stat(y_pred_1, y_pred_2, y):
    a,b,c,d = coefs(y_pred_1, y_pred_2, y)
    return (a*d - b*c)/(a*d + b*c)

def double_fault(y_pred_1, y_pred_2, y):
    a,b,c,d = coefs(y_pred_1, y_pred_2, y)
    return d / sum((a,b,c,d))


'''Ensemble expertise metrics; I made these to get more diagnostic
infromation on how well the ensemble covers data in the test set.'''
def unanimous_agreement(ensemble, X, y):
    '''Measures the proportion of the data where bases unanimously agree.
    '''
    accum = np.equal(y,y)
    for base in ensemble:
        data = np.equal(base.predict(X), y)
        accum = (data*accum)
    return np.sum(accum) / len(accum)
        
def missing_expertise(ensemble, X, y):
    '''Measures the proportion of the data where no member has produced a
    correct prediction.
    '''
    accum = np.equal(y,~y) #Falses
    for base in ensemble:
        data = np.equal(base.predict(X), y) #True for correct predicitons
        accum = np.logical_or(data, accum) #True for any correct prediction.
    return(np.sum(~accum)) / len(accum)


## Define Super Learner Classifier

The *Super Learner* is a heterogeneous stacked ensemble classifier. This is a classification model that uses a set of base classifiers of different types, the outputs of which are then combined in another classifier at the stacked layer. The Super Learner was described in [(van der Laan et al, 2007)](https://pdfs.semanticscholar.org/19e9/c732082706f39d2ba12845851309714db135.pdf) but the stacked ensemble idea has been around for a long time. 

Figure 1 shows a flow diagram of the Super Learner process (this is from (van der Laan et al, 2007) and the process is also described in the COMP47590 lecture "[COMP47590 2017-2018 L04 Supervised Learning Ensembles 3](https://www.dropbox.com/s/1ksx94nxtuyn4l8/COMP47590%202017-2018%20L04%20Supervised%20Learning%20Ensembles%203.pdf?raw=1)"). The base classifiers are trained and their outputs are combined along with the training dataset labels into a training set for the stack layer classifier. To avoid overfitting the generation of the stacked layer training set uses a k-fold cross validation process (described as V-fold in Figure 1). To further add variety to the base estimators a bootstrapping selection (as is used in the bagging ensemble approach).
 
![Super Learner Process Flow](SuperLearnerProcessFlow.png "Logo Title Text 1")



### Define the SuperLearnerClassifier Class

    Some documentation that was adhered to when creating the SuperLearnerClassifier.

   - [Sklearn contributors guidelines](http://scikit-learn.org/stable/developers/contributing.html)
   
    
   - [PEP8 style for SKLearn consistancy](https://www.python.org/dev/peps/pep-0008/)
   
   
   - [NumPy/SciPy docstring conventions](https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt)
   
TLDR: 

Docstring Linelengths >=75 chars 

Code line lengths >= 80

Docstring sections e.g. Short Summary, parameters, raises, returns etc.

Don't include private/class methods attribute in docs if not part of public API

In [4]:

class SuperLearnerClassifier(BaseEstimator, ClassifierMixin):
    """An ensemble classifier that uses heterogeneous models in its base 
    layer and an aggregation model at the stacked layer. K-fold cross
    validation is used to generate training data for the stacked layer.

    Parameters
    ----------
    BaseEstimator : 
                     Base class for all estimators in scikit-learn
    ClassifierMixin : Mixin class for all classifiers in scikit-learn.

    
    Attributes
    ----------
    __allowed_bases : tuple of class names
                      permitted base layer estimators.
    __allowed_stacked : tuple of class names
                        permitted stacked layer estimators.
    estimators : list of estimators
                 Instances of base estimators.
    stacked_estimator: estimator
                       estimator used in the stack layer.                       
    probability : bool, optional
                  True/False to use probability/class labels in stacked
                  dataset.    
    keep_original : bool, optional
                    include/exclude original input data to stacked layer                    
    probabilities : dict
                    stores whether a base can output a probability label.
    is_fitted_ : bool
                 stores whether classifier has been fitted.
    
    Methods
    -------    
    fit(X, y)
        Fit the model according to the given training data.
    get_params([deep])
        Get parameters for this estimator.
    predict(X)
        Predict class labels for samples in X.
    predict_proba(X)
        Probability estimates.
    set_params(**params)
        Set the parameters of this estimator.    
    todo: score(X, y)
    todo: add sample_weight for fit,score
    
    
    Notes
    -----
    1. ensures conformity to allowed bases.
    2. ensures conformity to allowed stacked estimators.
    3. To pass GridSearch estimators; use their best_estimator_ attribute.
    4. handles GridSearch type estimators and checks for their conformity.
    5. allows for use of n_jobs where appropriate.
    6. No Pandas dependency data handles using NumPy/SKLearn
    7. param and attribute names appropriately aligned with SKLearn.
    8. Can select which estimators give probability bases labels.

    See also
    --------
    
    ----------
    [1]  van der Laan, M., Polley, E. & Hubbard, A. (2007). 
         Super Learner. Statistical Applications in Genetics 
         and Molecular Biology, 6(1) 
         doi:10.2202/1544-6115.1309
    
    Examples
    --------
    >>> from sklearn.datasets import load_iris
    >>> from sklearn.model_selection import cross_val_score
    >>> clf = SuperLearnerClassifier()
    >>> iris = load_iris()
    >>> cross_val_score(clf, iris.data, iris.target, cv=10)
    """
        
    # Constructor for the classifier object
    def __init__(self, estimators=None, stacked_estimator=None,
                 cv=10, probability=False, keep_original=False):       
        """Setup a SuperLearner classifier.
        Parameters
        ----------
        estimators : list of estimators, int optional (default=None)        
                     A list of the instances of base estimators. If None
                     estimators default to a decision tree and logistic
                     regression. If an int n where 0<n<14; list equals the 
                     first n estimators in allowed_bases.
        stacked_estimator : estimator, optional (default=None)
                            Estimator to be used on the stack layer. if 
                            None estimator defauls to a decision tree.
        cv : int, optional (default=10)
             folds of cross validation used to create stacked layer.
        probability : bool, optional (default=False)
                      if True, probability estimations for all possible
                      learners are used to generate stacked dataset instead
                      of class labels. False, class label predictions are
                      added to stacked dataset.
        keep_original : bool, optional (default=False)
                        True/False to include/exclude raw input data in the
                        stacked training set.        

        Returns
        -------
        None
        
        Notes
        -------
        1. No logic, just attribute assignment. Consistent with SKLearn.

        """
        self.__allowed_bases = (
            MLPClassifier,
            KNeighborsClassifier,
            NearestCentroid,
            #SVC, #probability=True
            ExtraTreesClassifier,
            RandomForestClassifier,
            AdaBoostClassifier,
            GradientBoostingClassifier,
            GaussianProcessClassifier,
            GaussianNB,
            BernoulliNB,
            MultinomialNB,
            LogisticRegression,
            #SGDClassifier,
            #PassiveAggressiveClassifier,
            DecisionTreeClassifier,
            LinearDiscriminantAnalysis,
            QuadraticDiscriminantAnalysis
        )
        
        self.__allowed_stacked = (
            #MLPClassifier,
            KNeighborsClassifier,
            NearestCentroid,
            #SVC, #predict_proba if probability=True
            ExtraTreesClassifier,
            RandomForestClassifier,
            AdaBoostClassifier,
            GradientBoostingClassifier,
            GaussianProcessClassifier,
            GaussianNB,
            BernoulliNB,
            MultinomialNB,
            LogisticRegression,
            #SGDClassifier,
            #PassiveAggressiveClassifier,
            DecisionTreeClassifier,
            LinearDiscriminantAnalysis,
            QuadraticDiscriminantAnalysis
        )
        
        self.estimators = estimators
        self.stacked_estimator = stacked_estimator
        self.cv = cv    
        self.probability = probability
        self.keep_original = keep_original
        
       
    
    def fit(self, X, y):
        """Build a SuperLearner classifier from the training set (X, y).
        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            The training input samples. 
        y : array-like, shape = [n_samples] 
            The target values (class labels) as integers or strings.
        
        Returns
        -------
        self
        
        """        
        X, y = check_X_y(X, y)
        self.classes_, y = np.unique(y, return_inverse=True)
        self._setup_base_estimators()
        self._setup_stacked_estimators()
        self._setup_probabilities()
        stacked_set = self.__setup_stacked_set(X,y, purpose='train')
        for clf in self.estimators:
            clf.fit(X, y)
        try:
            self.stacked_estimator.fit(stacked_set, y, n_jobs=n_jobs)
            '''todo: what error is thrown if no n_jobs param exists?'''
        except:
            self.stacked_estimator.fit(stacked_set, y)
        self.is_fitted_ = True
        return self


    
    def predict(self, X, y=None):
        """Predict class labels of the input samples X.
        Parameters
        ----------
        X : array-like matrix of shape = [n_samples, n_features]
            The input samples. 
        y : not used
       
        Raises
        ------
        NotFittedError
            if 'fit' has not been executed before 'predict' is called.
        
        Returns
        -------
        p : array of shape = [n_samples, ].
            The predicted class labels of the input samples.
        
        Notes
        ------
        1. Predicts outcome according to each base classifier
        2. feeds those predictions (and raw data if keep_original=True) to 
           stacked estimator.
        
        """
        X = check_array(X)
        check_is_fitted(self, ['is_fitted_'])
        stacked_set = self.__setup_stacked_set(X,y, purpose='predict')
        p = self.stacked_estimator.predict(stacked_set)
        return p 
        #except NotFittedError as e:
            #print('Error in SuperLearnerClassifier',repr(e))
            #raise NotFittedError('Call fit before prediction')
    

    
    def predict_proba(self, X):
        """Predict class probabilities of the input samples X.
        Parameters
        ----------
        X : array-like matrix of shape = [n_samples, n_features]
            The input samples. 
        
        Raises
        ------
        NotFittedError
            if 'fit' has not been executed before 'predict_proba' is called.
        
        Returns
        -------
        p : array of shape = [n_samples, n_labels].
            The predicted class label probabilities of the input samples. 
        """
        X = check_array(X)
        check_is_fitted(self, ['is_fitted_'])
        stacked_set = self.__setup_stacked_set(X,y, purpose='predict')
        p = self.stacked_estimator.predict_proba(stacked_set)
        return p
    
    
    def diversity(self, X, y):
        '''Returns pairwise diversity metrics for all base pairs.
        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            The training input samples. 
        y : array-like, shape = [n_samples] 
            The target values (class labels) as integers or strings.
        
        Returns
        -------
        pairwise : dict
                   dictionary diversity metrics for each pair of
                   estimators in self.estimators
        '''
        X, y = check_X_y(X, y)
        check_is_fitted(self, ['is_fitted_'])
        all_pairs = list(combinations(self.estimators,2))
        
        def pair(*args):
            return tuple([name(arg) for arg in args])
        
        pairwise = {'Pairs' : [pair(one, two)\
                              for one, two in all_pairs],       
            'Matthews Coef.'  : [matthews_corrcoef(one.predict(X),
                                                   two.predict(X)) \
                                 for one,two in all_pairs],
            'Disagree': [disagree(one.predict(X),
                                  two.predict(X),
                                  y) \
                        for one, two in all_pairs],
            'Q': [q_stat(one.predict(X),
                         two.predict(X),
                         y) \
                  for one, two in all_pairs],
            'Double Fault':[double_fault(one.predict(X),
                                         two.predict(X),
                                         y) \
                            for one, two in all_pairs]
           }
        return pairwise
        
    
    def score_bases(self, X, y):
        '''
        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            The training input samples. 
        y : array-like, shape = [n_samples] 
            The target values (class labels) as integers or strings.
        
        Returns
        -------
        pairwise : dict
                   dictionary accuracy for for each base classifier
                   estimators in self.estimators
        '''
        X, y = check_X_y(X, y)
        check_is_fitted(self, ['is_fitted_'])
        scores = {'Estimator': [name(base) \
                               for base in self.estimators],
                  'Accuracy' : [accuracy_score(y, base.predict(X))\
                               for base in self.estimators],
             }
        return scores
    
    
    def coverage(self, X, y):
        '''
        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            The training input samples. 
        y : array-like, shape = [n_samples] 
            The target values (class labels) as integers or strings.
        '''
        X, y = check_X_y(X, y)
        check_is_fitted(self, ['is_fitted_'])
        scores = {'Unanimous_Agreement' : unanimous_agreement(self.estimators,
                                                          X,
                                                          y),
                  'Missing_Expertise' : missing_expertise(self.estimators,
                                                          X,
                                                          y)
                 }
        return scores
    
    
    def _setup_base_estimators(self):
        '''runs setup logic for estimators attribute.
        Parameters
        ----------
        None
        
        Raises
        ------
        ValueError
             if invalid int is used
             or forbidden base estimator is used.
        TypeError
             if an invalid type for estimators is used. e.g. dict.
        
        Returns
        -------
        None
        '''
        if self.estimators == None: 
            self.estimators = [LogisticRegression(),
                               DecisionTreeClassifier()]  
        elif isinstance(self.estimators, int):
            if 0 < self.estimators < len(self.__allowed_bases): 
                subset = self.__allowed_bases[0:self.estimators]
                self.estimators = [clf() for clf in subset]
            else: 
                raise ValueError('Set classifiers to an int between 0 and',
                                 len(self.__allowed_bases))
        elif any([self._is_valid_base(clf) for clf in self.estimators]):
            if all([self._is_valid_base(clf) for clf in self.estimators]): 
                self.estimators = list(self.estimators)
            else: 
                raise ValueError('Invalid base estimator used.',
                                 'Use any of--->', self.__allowed_bases)
        else: raise TypeError('Unexpected type--->',type(self.estimators),\
                              '. Use--->int or list of classifiers')
            
    
    
    def _setup_stacked_estimators(self):
        '''runs setup logic for stacked estimators attribute
        Parameters
        ----------
        None
        
        Raises
        ------

        TypeError
             if an invalid estimators is used. e.g. random forest regressor
        
        Returns
        -------
        None
        '''
        if self.stacked_estimator == None:
            self.stacked_estimator = DecisionTreeClassifier()
        elif not isinstance(self.stacked_estimator, self.__allowed_stacked): 
            raise TypeError('Invalid stacked estimator used--->',\
                            type(self.stacked_estimator),\
                            'Use any of--->',\
                            self.__allowed_stacked )
     
    
    
    def _setup_probabilities(self):
        '''runs setup logic for probabilities attribute
        Parameters
        ----------
        None
        
        Returns
        -------
        None
        '''
        def has_proba(clf):
                return callable(getattr(clf, 'predict_proba', None)) 
        if self.probability:
            self.__probabilities = {clf:has_proba(clf) for clf in self.estimators}
        else:
            self.__probabilities = {clf:False for clf in self.estimators}
    
    
    
    def _is_valid_base(self, clf):
        """checks if clf is a permitted base estimator
        ----------
        clf : estimator
        
        Returns
        -------
        permitted : bool
                    True if classifier is valid, false otherwise.
        """
        is_allowed = isinstance(clf, self.__allowed_bases)
        is_grid_search = (hasattr(clf, 'best_estimator_') \
                          and isinstance(clf.best_estimator_, 
                                         self.__allowed_bases))
        permitted = is_allowed or is_grid_search
        return permitted


    
    def __setup_stacked_set(self, X, y=None, purpose='predict'):
        """Predict class probabilities of the input samples X.
        
        Parameters
        ----------
        X : array-like matrix of shape = [n_samples, n_features]
            The input samples. 
        y : None, not used.
            Present for consitancy with other methods.
        purpose : str, optional (default=predict)
                  'predict' is stacked set is designed for prediciton.
                  'train' is stacked set is required for training/fitting.
        
        Returns
        -------
        stacked_set : array of shape = [n_samples, n_labels].
                      dataset used for stacked layer, in training and 
                      prediction
        """
        X = check_array(X)
        no_rows, no_cols = X.shape
        if not self.keep_original: no_cols = 0
        i = no_cols
        size_method = self.__size_method(purpose)
        extra_cols = reduce(lambda x,y:x+y, 
                            map(lambda clf:size_method[clf][0],
                                                self.estimators))
        stacked_set = np.zeros((no_rows, no_cols + extra_cols))
        
        if purpose == 'train':                
            for clf in self.estimators:
                size, method = size_method[clf]
                
                #cross_val_predict with bootstrapping
                stacked_set[:,i:i+size] = cross_val_predict(
                    estimator=clf,
                    X=X, 
                    y=y, 
                    cv=bootstrapped_KFold(X, n_splits=self.cv),
                    method=method).reshape(no_rows, size)
                i += size
        
        elif purpose == 'predict':        
            for clf in self.estimators:
                size, method_func = size_method[clf]
                stacked_set[:,i:i+size] = method_func(
                    X).reshape(no_rows, size)
                i += size
        
        else:
            raise ValueError('set purpose to either `train` or `predict`')
        
        if self.keep_original: stacked_set[0:no_rows, 0:no_cols] = X            
        return stacked_set
    
    
    
    def __size_method(self, purpose):
        """Build a SuperLearner classifier from the training set (X, y).
        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            The training input samples. 
        y : array-like, shape = [n_samples] 
            The target values (class labels) as integers or strings.
        
        Returns
        -------
        size_method : dictionary with key: clf and value tuple(size, method)
        """
        size = self._size_of_output
        method = self._method_of_output
        size_method = {clf:(size(clf), 
                            method(clf,purpose)) for clf in self.estimators}
        return size_method
    
    
    
    def _size_of_output(self, clf):
        """returns size of array or prediciton produced by clf.
        Parameters
        ----------
        clf : estimator
        
        Returns
        -------
        size : int
                if classifier uses probaility output labels, size is 
                number of labels, otherwise 1.
        """
        if self.probability and self.__probabilities[clf]:
            return len(self.classes_)
        else:
            return 1
        
    
    
    def _method_of_output(self, clf, purpose):
        """returns size of array or prediciton produced by clf.
        Parameters
        ----------
        clf : estimator
        
        Returns
        -------
        method : str, function_object
                 method string if purpose is training
                 function object if purpose is prediction.
        """
        try_method = getattr(clf, 'predict_proba', clf.predict)
        if hasattr(clf, 'predict_proba'):
            try_string= 'predict_proba'  
        else:
            try_string = 'predict'
                    
        results = {'train':['predict', try_string],
                  'predict':[clf.predict, try_method]}
        condition = 1 if self.__probabilities[clf] else 0
        method = results[purpose][condition]
        return method


### Test the SuperLearnerClassifier

Perform a simple test using the SuperLearnClassifier on the Iris dataset

In [5]:
%%time

SLC = SuperLearnerClassifier(keep_original=False, probability=True)
iris = load_iris()
X,y = iris.data, iris.target
SLC.fit(X,y)

Wall time: 82 ms


In [6]:
%%time

SLC = SuperLearnerClassifier(keep_original=True, probability=False)
iris = load_iris()
X,y = iris.data, iris.target
SLC.fit(X,y)
score = accuracy_score(SLC.predict(X), y)
cv_score = np.mean(cross_val_score(SLC, X,y, cv=10))
print('Overtrained-Acc',score, sep='\t')
print('10xCv-Acc', cv_score, sep='\t')
display(pd.DataFrame(SLC.diversity(X, y)))
display(pd.DataFrame(SLC.coverage(X, y), index=list(range(2))))
display(pd.DataFrame(SLC.score_bases(X, y)))

Overtrained-Acc	1.0
10xCv-Acc	0.953333333333




Unnamed: 0,Disagree,Double Fault,Matthews Coef.,Pairs,Q
0,0.04,0.0,0.941004,"(LogisticRegression, DecisionTreeClassifier)",


Unnamed: 0,Missing_Expertise,Unanimous_Agreement
0,0.0,0.96
1,0.0,0.96


Unnamed: 0,Accuracy,Estimator
0,0.96,LogisticRegression
1,1.0,DecisionTreeClassifier


Wall time: 388 ms


### UnitTesting

Simple UnitTest class to illustrate SuperLearnerClassifier has all desirable behaviours, features and error handling.
The SuperLearnerClassifier uses the Iris dataset to run many different test cases.


In [7]:
class TestSLC(unittest.TestCase):
    '''Unittest class for SuperLearnerClassifier'''
    
    def use_iris(self):
        '''
        1. used to load/unpack iris conviniently
        '''
        iris = load_iris()
        return iris.data, iris.target
    
    
    
    def test_default_init(self):
        '''
        1. Tests param defaults on instantiation
        '''
        SLC = SuperLearnerClassifier()
        self.assertEqual(SLC.estimators, None)
        self.assertEqual(SLC.stacked_estimator, None)
        self.assertEqual(SLC.cv, 10)
        self.assertEqual(SLC.probability, False)
        self.assertEqual(SLC.keep_original, False)
        
    
    def test_default_fit(self):       
        '''
        1. Base/stacked classifiers are set during fit method
        2. Base classifiers match defined defaults
        3. Stacked classifier matches defined default
        '''
        X,y = self.use_iris()
        SLC = SuperLearnerClassifier()
        SLC.fit(X,y)
        default = [LogisticRegression, DecisionTreeClassifier]
        for i in range(max(len(default), len(SLC.estimators))):
            self.assertTrue(isinstance(SLC.estimators[i], default[i]))
        self.assertTrue(isinstance(SLC.stacked_estimator, 
                                   DecisionTreeClassifier))
        
        
    def test_default_predict(self):
        '''
        1. predict/predict_proba raise NotFittedError before fitting
        2. predict/predict_proba do not raise an error after fitting
        '''
        X,y = self.use_iris()
        SLC = SuperLearnerClassifier()
        with self.assertRaises(NotFittedError):
            SLC.predict(X)
        with self.assertRaises(NotFittedError):
            SLC.predict_proba(X)
        SLC.fit(X,y)
        self.assertEqual(len(X),len(SLC.predict(X)))
        self.assertEqual(len(X),len(SLC.predict_proba(X)))

        
    def test_arg_init(self):
        '''
        1. Default params modified on instantiation if args are supplied
        '''
        clfs = [RandomForestClassifier(),
                KNeighborsClassifier()]
        stacked = LogisticRegression()
        cv = 5
        probability, keep_original = True, True
        SLC = SuperLearnerClassifier(clfs, stacked, cv,
                                     probability, keep_original)
        self.assertEqual(SLC.estimators, clfs)
        self.assertEqual(SLC.stacked_estimator, stacked)
        self.assertEqual(SLC.cv, cv)
        self.assertEqual(SLC.probability, probability)
        self.assertEqual(SLC.keep_original, keep_original)
        
    
    def test_kwarg_init(self):
        '''
        1. Default params modified on instantiation if kwargs are supplied
        '''
        clfs = [RandomForestClassifier(), 
                KNeighborsClassifier()], 
        stacked = LogisticRegression()
        cv = 5
        probability, keep_original = True, True
        SLC = SuperLearnerClassifier(estimators=clfs, 
                                     stacked_estimator=stacked, 
                                     cv=cv,
                                     probability=probability, 
                                     keep_original=keep_original)
        self.assertEqual(SLC.estimators, clfs)
        self.assertEqual(SLC.stacked_estimator, stacked)
        self.assertEqual(SLC.cv, cv)
        self.assertEqual(SLC.probability, probability)
        self.assertEqual(SLC.keep_original, keep_original)
        
    
    def test_allowed_bases(self):
        '''
        1. Any allowed base can be used.
        2. base is not fitted until fit method is called
        3. base is fitted during slc fit method
        '''
        X,y = self.use_iris()
        SLC = SuperLearnerClassifier()
        allowed = SLC._SuperLearnerClassifier__allowed_bases
        for clf in allowed:
            SLC2 = SuperLearnerClassifier(estimators=[clf()])
            self.assertTrue(isinstance(SLC2.estimators[0], clf))
            SLC2.fit(X,y)
            self.assertEqual(len(X),len(SLC2.predict(X)))
        
            
    def test_forbidden_bases(self):
        '''
        1. forbidden bases raise TypeError during slc fit method.
        '''
        X,y = self.use_iris()
        classifiers = [DecisionTreeRegressor, RandomForestRegressor]
        for clf in classifiers:
            SLC = SuperLearnerClassifier([clf()])
            self.assertTrue(isinstance(SLC.estimators[0], clf))
            with self.assertRaises(TypeError):
                SLC.fit(X,y)                                                    
        classifiers = [clf() for clf in classifiers]
        SLC = SuperLearnerClassifier(classifiers)
        with self.assertRaises(TypeError):
            SLC.fit(X,y)                                                           

        
    def test_allowed_stacked(self):
        '''
        1. any allowed stacked estimators can be used.
        2. stacked estimator is not fitted until fit method is called.
        3. stacked estimator is fitted with slc fit method
        '''
        SLC = SuperLearnerClassifier()
        allowed = SLC._SuperLearnerClassifier__allowed_stacked
        for clf in allowed:
            SLC2 = SuperLearnerClassifier(stacked_estimator=clf())
            self.assertTrue(isinstance(SLC2.stacked_estimator, clf))
            SLC2.fit(X,y)
            self.assertEqual(len(X),len(SLC2.predict(X)))
        
            
    
    def test_forbidden_stacked(self):
        '''
        1. forbidden stacked estimator raise TypeError during fit method.
        '''
        X,y = self.use_iris()
        SLC = SuperLearnerClassifier(
            stacked_estimator=RandomForestRegressor())
        self.assertTrue(isinstance(SLC.stacked_estimator, 
                                   RandomForestRegressor))
        with self.assertRaises(TypeError):
            SLC.fit(X,y)
            
            
    def test_pandas(self):
        '''SLC can handle pandas DataFrames'''
        X,y = self.use_iris()
        X = pd.DataFrame(X)
        SLC = SuperLearnerClassifier()
        SLC.fit(X,y)
        self.assertEqual(len(X),len(SLC.predict(X)))

    
    def test_predict_param_grid(self):
        '''
        1. Validates that a grid of expected parameters pass testing
        2. Validates different combinations of program flows.
        3. Tuples, lists and ints can be used to specify classifiers
        '''
        X,y = self.use_iris()
        param_grid = {'estimators':[None, 
                                     [RandomForestClassifier()], 
                                     (RandomForestClassifier(),),
                                     [RandomForestClassifier(), 
                                          KNeighborsClassifier()],
                                     (RandomForestClassifier(),
                                          KNeighborsClassifier()),
                                     1,2
                                    ],
                      'stacked_estimator':[None, LogisticRegression()],
                      'cv':list(range(2,4)),
                      'probability':[True,False],
                      'keep_original':[True,False]
                     }
        all_combos = ParameterGrid(param_grid)
        for params in all_combos:
            SLC = SuperLearnerClassifier(**params)
            with self.assertRaises(NotFittedError):
                SLC.predict(X)
            with self.assertRaises(NotFittedError):
                SLC.predict_proba(X)
            SLC.fit(X,y)
            self.assertEqual(len(X),len(SLC.predict(X)))
            self.assertEqual(len(X),len(SLC.predict_proba(X)))
      
                
             

In [8]:

if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)
    

  X2 = np.dot(Xm, R * (S ** (-0.5)))
  X2 = np.dot(Xm, R * (S ** (-0.5)))
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])






.
----------------------------------------------------------------------
Ran 11 tests in 24.085s

OK


## Load & Partition Data

### Setup - IMPORTANT

Take only a sample of the dataset for fast testing
Setup the number of folds for all grid searches.

In [9]:
data_sampling_rate = 0.10

#Stratified: Preserves class weighting
cv_folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

### Load Dataset

Load the dataset and explore it.

### Pre-process & Partition Data

Perform data pre-processing and manipulation as required

In [10]:
def stratified_sample(df, by, frac):
    '''takes stratified sample of dataset based on some grouping'''
    df = df.groupby(by=[by]).apply(lambda df : df.sample(frac=frac))
    df = df.reset_index(drop=True)
    return df

num_classes = 10
classes = {0: "T-shirt/top", 1:"Trouser", 
           2: "Pullover",    3:"Dress", 
           4:"Coat",         5:"Sandal", 
           6:"Shirt",        7:"Sneaker", 
           8:"Bag",          9:"Ankle boot"}

dataset = pd.read_csv('fashion-mnist_train.csv')
dataset = stratified_sample(dataset, by='label', frac=0.1)
X_train = dataset[dataset.columns[1:]]
y_train = np.array(dataset["label"])
X_train = X_train/255
display(X_train.head())


dataset = pd.read_csv('fashion-mnist_test.csv')
X_test = dataset[dataset.columns[1:]]
y_test = np.array(dataset["label"])
X_test = X_test/255
display(X_test.head())


FileNotFoundError: File b'fashion-mnist_train.csv' does not exist

## Train and Evaluate a Simple Model

Train a Super Learner Classifier using the prepared dataset

In [None]:
%%time

#Some plane learners. Keeping it simple
classifiers = [DecisionTreeClassifier(min_samples_split=20),
               RandomForestClassifier(min_samples_split=100),
               AdaBoostClassifier(),
               ExtraTreesClassifier(min_samples_split=100),
               GradientBoostingClassifier(n_estimators=10, min_samples_split=100),
               MLPClassifier(),
               LogisticRegression(),
               KNeighborsClassifier(),
               NearestCentroid(),
               GaussianNB(),
               MultinomialNB(),
               LinearDiscriminantAnalysis(),
               QuadraticDiscriminantAnalysis()
            ]

address = os.path.join(super_learners, 'simple_model.pkl' )
    
try:
    clf = joblib.load(address)
    print('Loaded...')

except FileNotFoundError:
    print('Building...')
    clf = SuperLearnerClassifier(estimators = classifiers, cv=5)
    clf.fit(X_train, y_train)
    joblib.dump(clf, address)

Evaluate the trained classifier

In [None]:
%%time

#Warning: SLOW
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy :', accuracy)
display(pd.crosstab(np.array(y_test), 
            y_pred,
            rownames=['True'],
            colnames=['Predicted'],
            margins=True))


In [None]:
%%time

#Warning: SLOW
for base in clf.estimators:
    start = time()
    base.fit(X_train, y_train)
    delta_fit = format(time() - start, '.2f')
    
    start = time()
    y_pred = base.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    delta_pred = format(time() - start, '.2f')
    
    print('Fit_Time', delta_fit ,
          '\tPred_Time :', delta_pred,
          '\tAccuracy :', accuracy,
           name(base))

### First Impression

#### Train Time Slow :  10 mins
This was expected each base estimator must run cross_val_predict() as part of fit method.
Smaller levels of CV may would speed up train time, but could cause a steep drop in accuracy if the model is overtrained and hasn't been fitted on enough data. KNNs slow predict time will contribute to this, then models with large train time like GradBoosting, AdaBoost and MLP and LogisticRegression have large train times.

#### Accuracy Good: 78+
Good for a model that hasn't been tuned. But some individuals in the ensemble perform better than the collective ensemble.
Untuned base learners. RandomForest,ExtraTrees (which themselves are ensembles) have similar performance to the SuperLearner.
MLP, LogisticRegression both perform noticeably stronger than SuperLearner. LDA, 5NN, had similar or better performance to the SuperLearner but are suprisingly simple but algorithms.
A reasonable goal would be to beat an MLP to get 86+.


#### Prediction Time Slow: 2.75 mins
KNN is the source of this problem. For such a large amount of features and samples, NN computation is expensive. This is the price it pays for not "generalising" in its training phase. Since the fit method of the SuperLearnerClassifier requires KNN to make predicitons, KNN is also culpible for slow training.

#### Response:
GradBoosting was outperformed by two simpler ensembles and majorly contributes to train time.
KNN should be removed from the ensemble if it does not significantly contribute to ensemble accuracy. KNN weakens prediction time and has overall a similar peformance to LDA. Other member such as QDA, GaussianNB, ADABoost should be trimmed to improve train/pred times and they most likely weaken the ensemble due to their poor performance.

#### Result: 
    Accuracy increase to 81+
    Train time reduced to 3.33mins, 
    Prediction time is now in accepted range less than 0.5 seconds.


In [None]:
%%time

classifiers = [DecisionTreeClassifier(min_samples_split=20),
               RandomForestClassifier(min_samples_split=100),
               ExtraTreesClassifier(min_samples_split=100),
               GradientBoostingClassifier(n_estimators=10, min_samples_split=100),
               MLPClassifier(),
               LogisticRegression(),
               NearestCentroid(),
               MultinomialNB(),
               LinearDiscriminantAnalysis(),
              ]

address = os.path.join(super_learners, 'pruned_simple_model.pkl')

try:
    pruned_clf = joblib.load(address)
    print('Loaded...')

except FileNotFoundError:
    pruned_clf = SuperLearnerClassifier(estimators=classifiers, cv=5)
    print('Building...')
    pruned_clf.fit(X_train, y_train)
    joblib.dump(pruned_clf, address, compress=9)
    

In [None]:
%%time

y_pred = pruned_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy :', accuracy)
display(pd.crosstab(np.array(y_test), 
            y_pred,
            rownames=['True'],
            colnames=['Predicted'],
            margins=True))

display(pd.DataFrame(pruned_clf.diversity(X_test, y_test)))
display(pd.DataFrame(pruned_clf.coverage(X_test, y_test), index=list(range(2))))
display(pd.DataFrame(pruned_clf.score_bases(X_test, y_test)))

## Cross Validation Experiment (Task 2)

Perfrom a 10-fold cross validation experiment to evaluate the performance of the SuperLearnerClassifier

In [None]:
%%time

accuracys = cross_val_score(pruned_clf, X_train, y_train, cv=10)
mean = format(np.mean(accuracys),'.4f')
std = format(accuracys.std(), '.4f')
print('10xCV Accuracy : ', mean, '+-', std)

    10XCV Accuracy is similar to accuracy on the test_set; 81%. 
    STD of 1% so variance is relatively small.
    10XCV takes 30 mins, quite expensive way to measure accuracy.
    The SuperLearnerClassifer seems to generalise quite well.


## Comparing the Performance of Different Stack Layer Approaches (Task 5)

Compare the performance of the ensemble when a label based stack layer training set and a probability based stack layer training set is used.

In [None]:
%%time


param_grid = {'estimators':[classifiers],
              'probability':[True, False],
              'stacked_estimator':[DecisionTreeClassifier(), 
                                   LogisticRegression() ],
              'cv':[5]
             }

address = os.path.join(searches, 'stack_layer_approaches.pkl')

try:
    search = joblib.load(address)
    print('Loaded...')

except FileNotFoundError:
    print('Building...')
    search = GridSearchCV(SuperLearnerClassifier(), 
                      param_grid, 
                      cv=cv_folds, 
                      #verbose = 2, 
                      return_train_score=True,
                      refit=True).fit(X_train,y_train)
    joblib.dump(search, address, compress=9)


In [None]:

results = pd.DataFrame(search.cv_results_)
results['stacked'] = results['param_stacked_estimator']
info = [ 'mean_test_score',
         'std_test_score',
         'mean_fit_time',
         'std_fit_time',
         'mean_score_time',
         'std_score_time',
         'param_probability',
         'stacked']
results = results.sort_values('rank_test_score')
display(results[info].round(2))
display(results)


In [None]:
display(pd.DataFrame(search.best_estimator_.diversity(X_test, y_test)))
display(pd.DataFrame(search.best_estimator_.coverage(X_test, y_test), index=list(range(2))))
display(pd.DataFrame(search.best_estimator_.score_bases(X_test, y_test)))

In [None]:
y_pred = search.best_estimator_.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy :', accuracy)
display(pd.crosstab(np.array(y_test), 
                    y_pred,
                    rownames=['True'],
                    colnames=['Predicted'],
                    margins=True))

### Different Stacked Layer Approaches


#### Overview
In theory, probability outputs are continuous allowing base learners to contribute their estimated probability distribution function to the stacked layer. The infromation they provide is of higher resolution and should help capture boundary cases. It is not expected to catch any outliers in the data.

The from the reasoning above, the expected result is: for the majority of cases, probability based labels would improve the performance on the SuperLearners. This is not the case. Instead, the pairwise interaction between these different hyperparameters appears to be significant. However, the strongest learner had probability based labels. It benefitted from the higher resolution infromation of its ensemble.

There are a number of phenomena to be noted from the results. 

        1. The LogisticRegression stacked layer is enhanced by using probability based labels. 
           DecisionTree stacked layer is weakened by by using probability based labels.
    
        2. The magnitude of this change is quite large in both cases; 
           
           LogisticRegression stacked layer appears to be more sensative to this change. 
           It produced both the best and worst results depending on enabling/disabling the probability parameter.
           The difference in accuracy for  is nearly 25%. 
           
           DecisionTrees both perform reasonably well giving them a mid-table performance.
           For DecisionTrees this difference is nearly 2% which is still quite big.

       3. The pairwise interaction of hyperparameters is quite significant. Probabilities do not always enhance the ensemble.
          The use of LogisticRegression vs DecisionTrees in the stacked layer does not always enhance accuracy.
          Therefore, it is necessary to perform wide searches of the parameter space using GridSearches and RandomSearches.


        4. Use of probability based labels slightly increases prediciton time, in general.
           Probability based labels give a negligable increase fit time for LogisticRegression.
           Probability based labels noticably increase fit time for decision tree stacked layer.
          

These observations were taken from the aggregated results above, they are also consitent with the results of each of the 5 splits performed in the data. 


#### Logistic Regression
LogisticRegression appears to be very sensitive to using probability labels in the stacked layer. Giving it strong performance when present and much weaker when not. Consider this could be due to having relatively few features to fit to in the stacked layer, with a larger ensemble LogisticRegression may perform better, however a larger ensemble would negatively impact training time noticably for each member added and diversity becomes increasingly difficult to achieve. Enabling probability achieves the desired effect with minimal training/prediction costs. It is understandable why LogisticRegression performs well in the stacked layer, given probability based labels, it is easy to percieve the certainty different classifiers have of their prediction. Then by setting the coeficients, the predictions of different base classifiers essentially become weighted. Viewing the coeficients of the regression should give insight to which classifiers its favoring, it could be the case that it is simply predicting colinearly with the MLP in all cases. To enhance on this prediction, base classifiers need to be comparably accurate to MLP while guessing differently. 

#### Decision Tree
DecisionTrees appear to be less sensative to this change, but unlike the LogisticRegression stacked layer, it is weakened from using probability based labels. Perhaps it is overtrained from picking up noise in the probability labels; weakening its generalisation accuracy. An interesting experiment would be to tune the stacked estimator for both probability based and class based labels and seeing which gives a stronger performance. The DecisionTree may benefit from keeping_original data, allowing it to favor certain classifiers for certain input data, which could help it use the MLPs power for some data, but capture diversity of other classifiers for different data.

## Grid Search Through SuperLearnerClassifier Architectures & Parameters (Task 7)

Perfrom a grid search experiment to detemrine the optimal architecture and hyper-parameter values for the SuperLearnClasssifier for the MNIST Fashion classification problem.

### Analyse Bases
To set up a search space for the SuperLearner first bases will be tuned a small bit to increase the accuracy of each member.
The top10 results for each base will be tablulated for human interpretation. The reason for this is to verify that each base classifier is reasonably accuracy and will not hamper fit/prediction time during SupearLearner GridSeach.

The best classifiers will be listed and sampled repeatedly and an exhaustive combination of samples will be taken between different ensemble sizes. These samples are passed into the param_grid for a GridSearchCV in the usual fashion.

Similarly a stacked estimator search space will be prepared, to be passed into the GridSearch param_grid.


In [None]:
%%time

base_search_space = {
    #consitently strong learners.
    LogisticRegression:{'fit_intercept':[True,False],
                        'solver':['liblinear',
                                 # 'sag' #too costly for minimal accuracy gain.
                                 ]},
    
    #gini learners were observed to be more successful than entropy
    RandomForestClassifier:{'n_estimators':list(range(10,101,10)),
                            'min_samples_split':list(range(10,80,10)),
                            'criterion':['gini',]},
    
    # Tends to perform weakly
    #DecisionTreeClassifier:{'min_samples_split':list(range(2,103,25)),
    #                        'criterion':['gini',
    #                                     'entropy',
    #                                    ]},
    
    #gini learners were observed to be more successful than entropy.
    ExtraTreesClassifier:{'n_estimators':list(range(10,101,10)),
                          'min_samples_split':list(range(10,80,10)),
                          'criterion':['gini',
                                      #'entropy',
                                      ]},
    
    #range tended to produce inexpensive and accurate MLPClassifiers
    MLPClassifier:{'hidden_layer_sizes':list(range(200,301,10)),
                    'activation':['relu'], #tanh:ok, logistic:weak
                   'learning_rate':['adaptive'], #faster when bigger
                   'early_stopping':[True]}, #good for time
    
    #too weak
    NearestCentroid:{'shrink_threshold':list(np.arange(0,1,0.2))},
    
    #too weak
    MultinomialNB:{'alpha':list(np.arange(0,1,0.2))},
    
    #consistenly good baseline.
    LinearDiscriminantAnalysis:[{'shrinkage':['auto'],
                                 'solver':['lsqr']},
                               {'solver':['lsqr','svd']}
                               ]                 
}

'''todo: score using log-loss to optimise probability based stacked layer'''


#give this an additional param for naming
def make_filepath(class_str):
    class_str = str(class_str).split('.')[-1]
    alphas = [c for c in list(class_str) if c.isalnum()]
    filename = ''.join(alphas+['.pkl'])
    path = os.path.join(tuned_bases, filename)
    return path

try:
    results_clfs = [joblib.load(make_filepath(est)) \
                   for est, est_params in base_search_space.items()]
    print('Loaded...')

except FileNotFoundError:
    def search_base(estimator, params):
        search = GridSearchCV(estimator=estimator, 
                              param_grid=params, 
                              cv=cv_folds, 
                              #verbose=2, 
                              n_jobs=-1, 
                              refit=True,
                              return_train_score=True)
        best = search.fit(X_train, y_train)
        address = make_filepath(str(type(best.best_estimator_)))
        joblib.dump(best, address, compress=9)
        return best
    
    print('Building...')
    results_clfs = [search_base(est(), est_params) \
                    for est,est_params in base_search_space.items()]
           


In [None]:
info = [ 'params',
         'mean_test_score',
         'std_test_score',
         'mean_fit_time',
         'std_fit_time',
         'mean_score_time',
         'std_score_time',
       ]

for search in results_clfs:
    df = pd.DataFrame(search.cv_results_).sort_values('rank_test_score')
    clf_type = type(search.best_estimator_)
    
    #Title
    display(clf_type)
    
    #gives clear overview of params used for top 10
    display(df[info].round(3).head(10)['params'].values)
    
    #performance breakdown
    display(df[info].round(3).head(10))
    print()

In [None]:
%%time

test = [result.best_estimator_ for result in results_clfs]
test_slc = SuperLearnerClassifier(test, stacked_estimator=LogisticRegression())
test_slc.fit(X_train, y_train)



In [None]:
y_pred = test_slc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy :', accuracy)
display(pd.crosstab(np.array(y_test), 
                    y_pred,
                    rownames=['True'],
                    colnames=['Predicted'],
                    margins=True))


display(pd.DataFrame(test_slc.diversity(X_test, y_test)))
display(pd.DataFrame(test_slc.coverage(X_test, y_test), index=list(range(2))))
display(pd.DataFrame(test_slc.score_bases(X_test, y_test)))


### Base Classifiers
GridSearching base learners can be used not only to maximise accuracy but also to get an idea for the overall performance of classifiers with different sets of hyperparameters, accuracy and speed.

It also gives a flavor for which hyperparameters are better suited to this sort of problem. E.g. A liblinear logistic regression fits 5 time faster than a 'saga', with only a 0.1% accuracy difference in accuracy. It is important to not compromise on accuracy too much as members must be strong and predict more difficult examples.


### Pruning
As was seen in the test run, pruning weaker ensemble members can increase the accuracy of the whole as they can be sources of noise rather than diversity. The models complexity is also reduced making it less expensive to train and predict. By increasing  the number of estimators in Random Forests, and adding more layers to an MLP the models can become more expensive to train which can impede gridsearching. By assessing the top results for each base a classifier can be chosen that are fast and also accurate. After exploring the optimal overall architecture, members can be retuned purely for accuracy.


### LinearDiscriminantAnalysis: BaseLine
LDA is a very simple and well understood process. Like KNN it is often used as a baseline to compare other classifiers to. In this case KNN is too expensive for prediction, but LDA shows strong results for little expense. If a model cannot outperform LDA, it will not be included in the ensemble. Its accuracy is consistently 80+, so that will serve as the accuracy cut off.

### LogisticRegression
Consistently strong 83+, using sag and saga tend to be marginally more accuracy by a fraction of a percent but train slowly. Liblinear was used as it negligably weaker but significantly faster.

### Decision Trees 
DecisionTrees had reasonable performnce of around 74%. It however has a weaker performance than random forests and extra trees and despite being ensembles, they train for faster or comparable times. DecisionTrees do not meet the baseline set by LDA, so it is more likely to be a source of noise rather than adding diversity. For this reason, decision trees will be removed.

### RandomForests OR ExtraTrees
Both tend to score 83+
As both are strong learners, both will be kept till the main gridsearch, which will decide if the ensemble is enhanced by the presence of both models or if one should be dropped. It is noted that ExtraTrees tend to fit faster which make them slightly more desirable.

### NearestCentroid and MultinomialNB
Both perform at roughly 68%. NearestCentroid and MultinomialNB did not improve well with tuning and are still weak learners. They probably add more noise than they do diversity. While they are inexpensive to train, they are probably best removed as they are far from the standard set by LDA.

### MLP
A strong learner overall, many combinations of hyperparameters were investigated before hand testing different activation functions and learning rate combinations. The list in the param search was refined to produce models that trained quickly and predicted accurately. Tends to be 84+

In [None]:
#filter the list to only certain allowed classifiers, the usual set. use name.
allowed = (LogisticRegression,
           RandomForestClassifier,
           ExtraTreesClassifier,
           LinearDiscriminantAnalysis,
           MLPClassifier)
allowed_bases = [clf for clf in test if isinstance(clf, allowed)]

In [None]:
#range of desired ensemble sizes to explore
lower, upper = 4,5

best_bases = [base for base in allowed_bases]
all_base_combos = list(chain.from_iterable(
    [combinations(best_bases,i) for i in range(lower,upper+1)]))
stacked_search_space = {LogisticRegression:{
                             'fit_intercept':[True,
                                              #False
                                             ]
                        },
                        #Too weak, removed from gridsearch after testing
                        #DecisionTreeClassifier:{
                        #    'min_samples_split':list(range(2,202,50)),
                        #    'criterion':['gini',
                        #                 'entropy'],
                        #}
                       }
all_stacked_estimators = [[clf(**params) for params in ParameterGrid(space)]\
                         for clf, space in stacked_search_space.items()]
all_stacked_estimators = list(chain.from_iterable(all_stacked_estimators))



In [None]:
%%time

param_grid = {
    'estimators':all_base_combos,
    'probability' : [True, 
                     #False
                    ],
    'keep_original' : [False],
    'stacked_estimator' : all_stacked_estimators,
    'cv': [#3,tends to be weak (especially with smaller data) 
          5,
          #10, #too long with no significant contribution to scores.
          ]
}

address = os.path.join(searches, 'overall_architectures.pkl')

try:
    joblib.load(address)
    print('Loaded...')

except FileNotFoundError:
    print('Building...')
    search = GridSearchCV(SuperLearnerClassifier(), 
                          param_grid, 
                          cv=cv_folds, 
                          #verbose = 2, 
                          return_train_score=True,
                          refit=True).fit(X_train,y_train)
    joblib.dump(search, address, compress=9)


In [None]:
df = pd.DataFrame(search.cv_results_).sort_values('rank_test_score')    
#performance breakdown
top_params = df[info].round(4).head(12)['params'].values
display(df[info].round(4).head(12))
print()

print('Ensemble Sizes')
display([len(entry['estimators']) for entry in top_params])
print()

print('Ensemble Params:\n',top_params)

### GridSearch Results
Top SuperLearners have improved on the accuracys' of their base estimators suggesting that there is some benefit to ensembleing  them. Many results are similar, so it may be worth performing paired t-tests to see if the differences are statistically significant. 

### CV=5 Vs. CV=10
Interestingly at cv=5 tends to perform better on average, but the top model has cv=10 and exclude LDA, which is a very stable classifier that does not require does not require alot of data to fit well. The models using 10xCV tend to need 2-3 times longer to train. But cv=5 is weaker by 0.1-0.2 percent. So as more investigation goes on cv=5 should continue to be used to save time.

### Ensemble Sizes
The space searched allowed for ensmeble sizes of 5/4. In this case the larger ensemble performed best, probably as each classifier does contribute to the diversity, while being accurate. This size seems reasonable as pruning weaker members did raise the accuracy. A good feature to implement to grow the ensemble is to perform a forward search and add more exotic classifiers from the SKLearn toolkit.

In earlier searches with larger ensembles, it seemed that larger ensembles performed mid_pack implying that their is some estimator that should probably be pruned and when that estimator that detracts from the ensemble. However its presence is not as significant as losing an effective ensmeble member. This supported the argument that pruning the weak members was the optimal choice.

Evaluate the performance of the model selected by the grid search on a hold-out dataset

In [None]:
%%time

#best two models params.
top_2 = df[info].round(4).head(10)['params'].values[:2]
filenames = ['first.pkl', 'second.pkl']
addresses = [os.path.join(super_learners, name) for name in filenames]

for i in range(2):
    try:
        tuned_SLC = joblib.load(addresses[i])
        print('Loaded...')
    except FileNotFoundError:
        print('Building...')
        tuned_SLC = SuperLearnerClassifier(**top_2[i])
        tuned_SLC.fit(X_train, y_train)
        joblib.dump(tuned_SLC, addresses[i], compress=9)
    
    y_pred = tuned_SLC.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print('Accuracy :', accuracy)
    display(pd.crosstab(np.array(y_test), 
                        y_pred,
                        rownames=['True'],
                        colnames=['Predicted'],
                        margins=True))



### GridSearch Results
Models have a slightly stronger performance on the hold out set. both are very close to 86 and have marginal difference, between the top two gridsearch results.

## Evaluating the Impact of Adding Original Descriptive Features at the Stack Layer (Task 8)

Evaluate the impact of adding original descriptive features at the stack layer.

In [None]:
%%time

filenames = ['first_with_original.pkl', 'second_with_original.pkl']
addresses = [os.path.join(super_learners, name) for name in filenames]

for i in range(2):
    try:
        new_SLC = joblib.load(addresses[i])
    except FileNotFoundError:
        top_2[i]['keep_original'] = True
        new_SLC = SuperLearnerClassifier(**top_2[i])
        new_SLC.fit(X_train, y_train)
        joblib.dump(new_SLC, addresses[i], compress=9)

top_clfs = [joblib.load(address) for address in addresses]

In [None]:
%%time

for clf in top_clfs:
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print('Accuracy :', accuracy)
    display(pd.crosstab(np.array(y_test), 
                        y_pred,
                        rownames=['True'],
                        colnames=['Predicted'],
                        margins=True))
    
    #accuracys = cross_val_score(clf, X_train, y_train, cv=5)
    #mean = format(np.mean(accuracys),'.4f')
    #std = format(accuracys.std(), '.4f')
    #print('5xCV Accuracy : ', mean, '+-', std)
    



### Adding Original Descriptive Features

Seems to slightly enhance each model. Though on some runs this parameter weakened the models slightly. It seems more likely that these small deviations in accuracy come from the randomness within the algorithm each time a SuperLearner is trained.

By dropping original descriptive features; the logistic regression stacked layer recieves outputs that are somewhat clean and simple. The probability outputs from the base classifiers should correlate well with the actual class labels allowing a logistic model to model these easily. In a sense the base classifiers have extracted most of the usable raw infromation already. Using heterogenous ensembles allows classifiers to cancel some noise from eachother that stems from noise from the raw data. By including the raw data it may just add more noisy data, weakening the ensemble slightly.

An argument can be made for using a DecisionTree stacked layer that can 'in esscence' favor different classifiers depending on the raw data. At this stage it would be interesting to give the DecisionTree stacked layer another chance to see its performance.


## Explore the Ensemble Model (Task 9)

Perform an analysis to investigate the strength of the base estimators and the strengths of the correlations between them.

In [None]:
%%time

first, second = top_clfs

print('Pairwise Analysis')
display(pd.DataFrame(first.diversity(X_test, y_test)).round(2))

print('Individual Analysis')
display(pd.DataFrame(first.score_bases(X_test, y_test)).round(2))    

print('Ensemble Coverage')
display(pd.DataFrame(first.coverage(X_test, y_test), index=list(range(2))))

    

    

### Accuracy Vs. Diversity
It should be noted firstly that each of these classifiers tends to have significant overlap with another. The four base estimators in the ensemble agree on 73% of the data, without any conflict. The other extreme of this is that there is approximately 8% of data that no member of the ensemble has correctly classified. Hypothetically, if the stacked layer were able to identify which estimator's predictions to trust for each example; theoretically the ensembles accuracy could reach 92%. 

Each base is an effective estimator for the hold-out-dataset, all accuracys are above 80. While this is a desirable trait, it may come at the expense of producing models that are highly correlated and show little disagreement. For this reason there is only marginal improvement in the accuracy of the ensemble versus its strongest member. Ideally, another learner can be added that can learn to predict data with missing expertise. Boosting could help in that case, to focus on the tougher 8% of data, decreasing the Missing Expertise and by extension lowering the mean Double Fault measure and encouraging more disagreement.

### RandomForests and ExtraTrees
It can be seen through pairwise comparison that the RandomForestClassifier and ExtraTreesClassifier are the most similar. This makes sense they are both forests of DecisionTrees with similar hyperparmeters. The difference between the two algorithms is minimal, hence they show little disagreement, high correlations and despite being the two "strongest" members, betweem them they have the least combined expertise. This manifests itself through high double faults, low disagreement and large q_statistics and matthews coeficients. 

### LogisticRegression and RandomForest
The best pair is LogisticRegression and RandomForest, their Double Fault metric is only about 10%. This shows there is greater overall covverage. Using two disimilar algorithms allows for increased disagreement, which can lower the DoubleFault measure, resulting in greater overall expertise.

### Q Stat
According to Kuncheva et al (2003), classifiers that tend to guess colinearly will have high q statistics, by extension an ensemble with a high mean pairwise q statistic is not optimised to have broad expertise. Classifiers that tend to commit many errors against one another will have negative Q statistics, at that point, broad expertise tends to just be noise. Ideally the q statistic is zero at which point classifiers are independent and estimators are optimally diverse. This indicates that when searching through different sets of base classifiers, optimising the mean Q stat of the ensemble should produce ensembles that are sufficiently accurate and diverse.

