# Part 4: TPOT

Without an extensive background in the statistics and mathematics behind different machine learning models, it can be difficult to determine what the best model for a given dataset is. This also applies to tuning the parameters. As you have probably noticed, the models we've used in this workshop so far have many different parameters, and it's by no means obvious how to tune them. 

Moreover, testing out many different models, along with many different combinations of parameters, could be extremely time consuming and impractical. 

[TPOT](https://github.com/rhiever/tpot) is a new tool that automates the model selection and hyperparameter tuning process using genetic programming. It also determines what preprocessing, if any, is necessary, such as PCA or standard scaling. It then exports this model to a file with the scikit-learn code written for you. 

Although it is in your best interest to learn as much about the theory behind machine learning as possible, tools like TPOT can theoretically do the work for you. 

TPOT can be used for both classification and regression.

Let's set the random seed.

In [1]:
import numpy as np

np.random.seed(10)

## Classification

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    train_size=0.75, test_size=0.25)

In [5]:
from tpot import TPOTClassifier

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')



Best pipeline: GaussianNB(LogisticRegression(input_matrix, C=10.0, dual=False, penalty=l2))
1.0


Let's look at the model TPOT created for us:

In [6]:
!cat tpot_iris_pipeline.py

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)

# Average CV score on the training set was: 0.9826086956521738
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=LogisticRegression(C=10.0, dual=False, penalty="l2")),
    GaussianNB()
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)


## Regression

First we'll load the preprocessed heart data that we created during the regression tutorial.

In [7]:
heart_data = np.load('data/heart_preproc.npz')

X_train = heart_data['X_train']
X_test = heart_data['X_test']
y_train = heart_data['y_train']
y_test = heart_data['y_test']

Now TPOT will make the model.

In [8]:
from tpot import TPOTRegressor

tpot = TPOTRegressor(generations=5, population_size=20, verbosity=1, scoring='r2')  # generations for optimization, pop size is models
tpot.fit(X_train, y_train.ravel())
print(tpot.score(X_test, y_test.ravel()))
tpot.export('tpot_heart_pipeline.py')

Best pipeline: AdaBoostRegressor(ZeroCount(GradientBoostingRegressor(input_matrix, alpha=0.95, learning_rate=0.1, loss=huber, max_depth=7, max_features=0.5, min_samples_leaf=10, min_samples_split=13, n_estimators=100, subsample=0.05)), learning_rate=0.01, loss=exponential, n_estimators=100)
0.31193089935474727


In [9]:
!cat tpot_heart_pipeline.py

import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator, ZeroCount

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)

# Average CV score on the training set was: 0.24848178023575235
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=GradientBoostingRegressor(alpha=0.95, learning_rate=0.1, loss="huber", max_depth=7, max_features=0.5, min_samples_leaf=10, min_samples_split=13, n_estimators=100, subsample=0.05)),
    ZeroCount(),
    AdaBoostRegressor(learning_