# Automated Machine Learning

The modelling work-flow of a data scientist is complex but actually finding the right combinations of transformers and estimators is basically a searchable space that can be automated applying the right strategy.

## `tpot`
`tpot` is a data science assistant that iteratively constructs `sklearn` pipelines and optimises them using genetic programming algorithms that are able to optimize multiple criteria simulaneously while minimizing complexity at the same time. It uses a package called [deap](https://deap.readthedocs.io/en/master/)

- Supports regression and classification
- Supports the usual performance metrics
- Is meant to run for hours to days
- We can inspect the process anytime and look at intermediate results
- We can limit algorithms and hyperparameter space (not so usefull at the moment because we have to sepcifiy the whole pyrameter range and basically get stuck doing grid search)
- `tpot` can generate python code to reconstruct the best models

### Load data

In [11]:
import feather

df = feather.read_dataframe('./data/mapped_df.feather')

y = df['y']
X = df.drop('y', axis = 1)\
 .as_matrix()


array([[ 1.        ,  0.        ,  0.        , ...,  0.43279337,
        -0.47367361, -0.50244517],
       [ 0.        ,  1.        ,  0.        , ...,  0.43279337,
        -0.47367361,  0.78684529],
       [ 0.        ,  0.        ,  0.        , ..., -0.4745452 ,
        -0.47367361, -0.48885426],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.43279337,
         2.00893337, -0.17626324],
       [ 1.        ,  1.        ,  0.        , ..., -0.4745452 ,
        -0.47367361, -0.04438104],
       [ 1.        ,  0.        ,  1.        , ..., -0.4745452 ,
        -0.47367361, -0.49237783]])

### Limited run
First we will use only a decision tree and logistic regression

In [28]:
from tpot import TPOTClassifier
from scipy import stats

# at least one of the Estimators with the corresponding parameters must
# be part of the pipeline, not so useful at the moment.

# t_pot_config = {'sklearn.tree.DecisionTreeClassifier'
#                 :{ 'min_samples_split' : [2, 5, 8, 15, 30, 50] }
#                , 'sklearn.linear_model.LogisticRegression':{} 
#               }

pipeline_optimizer = TPOTClassifier(generations=20
                                    , population_size=20
                                    , offspring_size = 100
                                    ## TPOT will evaluate population_size 
                                    ## + generations × offspring_size = pipelines 
                                    ## in total.
                                    , cv=5
                                    , random_state=42 ##seed
                                    , verbosity=2 ## print progressbar
                                    , n_jobs = 4
                                    , warm_start = True ## allows us to restart
                                    , scoring = 'roc_auc'
                                    , config_dict = 'TPOT light' ## only uses fast algorithms
                                   )

pipeline_optimizer.fit(X,y)

                                                                                                                       

Generation 1 - Current best internal CV score: 0.875048057951106


                                                                                                                       

Generation 2 - Current best internal CV score: 0.8751402845650718


                                                                                                                       

Generation 3 - Current best internal CV score: 0.8760511868017323


                                                                                                                       

Generation 4 - Current best internal CV score: 0.8760511868017323


                                                                                                                       

Generation 5 - Current best internal CV score: 0.8762918285129622


                                                                                                                       

Generation 6 - Current best internal CV score: 0.8762918285129622


                                                                                                                       

Generation 7 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 8 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 9 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 10 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 11 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 12 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 13 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 14 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 15 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 16 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 17 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 18 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 19 - Current best internal CV score: 0.8763840551269281


                                                                                                                       

Generation 20 - Current best internal CV score: 0.876581683585426


                                                                                                                       


Best pipeline: DecisionTreeClassifier(CombineDFs(Binarizer(input_matrix, threshold=0.9), input_matrix), criterion=entropy, max_depth=9, min_samples_leaf=20, min_samples_split=19)


TPOTClassifier(config_dict={'sklearn.naive_bayes.GaussianNB': {}, 'sklearn.naive_bayes.BernoulliNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.naive_bayes.MultinomialNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.tree.DecisionT...e_selection.VarianceThreshold': {'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]}},
        crossover_rate=0.1, cv=5, disable_update_check=False,
        early_stop=None, generations=20, max_eval_time_mins=5,
        max_time_mins=None, memory=None, mutation_rate=0.9, n_jobs=4,
        offspring_size=100, periodic_checkpoint_folder=None,
        population_size=20, random_state=42, scoring=None, subsample=1.0,
        verbosity=2, warm_start=True)

In [30]:
pipeline_optimizer.evaluated_individuals_

{'KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=21, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)': {'crossover_count': 0,
  'generation': 0,
  'internal_cv_score': 0.8470927552585381,
  'mutation_count': 0,
  'operator_count': 1,
  'predecessor': ('ROOT',)},
 'GaussianNB(input_matrix)': {'crossover_count': 0,
  'generation': 0,
  'internal_cv_score': 0.816504621285001,
  'mutation_count': 0,
  'operator_count': 1,
  'predecessor': ('ROOT',)},
 'KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=100, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=distance)': {'crossover_count': 0,
  'generation': 0,
  'internal_cv_score': 0.8403183901475156,
  'mutation_count': 0,
  'operator_count': 1,
  'predecessor': ('ROOT',)},
 'DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=2, DecisionTreeClassifier__min_samples_leaf=12, DecisionTreeClassifier__min_samples_split=

In [29]:
pipeline_optimizer.export('./data/lim_pipe.py')

True

# `auto-sklearn`
