# TPOT Demo

Small demo notebook to create a trained model using TPOT.

## Setup

First we need to import our dependencies and load the Boston housing price dataset.

In [1]:
from pathlib import Path

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from tpot import TPOTRegressor

boston = load_boston()

## Dataset

We're using a common dataset comparing housing prices in Boston in the 1970s. Scikit-Learn provides a description of the dataset and its training features:

In [2]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

Let's look at the dataset a bit before building our model by loading it into a DataFrame. We need to add the home values back into the data rows.

In [3]:
bdf = pd.DataFrame(data=boston.data, columns=boston.feature_names)
bdf['MDEV'] = pd.Series(boston.target)

bdf.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MDEV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3.0,222.0,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.1,18.9


## Training

We need to randomly divide our data into training and testing sets. Testing a model on training data causes overfitting and reduces the model's usefulness beyond data found in the dataset.

In [4]:
TRAIN_SIZE = 0.8

X_train, X_test, Y_train, Y_test = train_test_split(
    boston.data,
    boston.target,
    train_size=TRAIN_SIZE,
    test_size=1-TRAIN_SIZE,
)

Now we train the model. TPOT will run the data through many possible model types and metadata configurations. This should give us a finished model that fits the data the best without having to select that information ourselves.

Because the predicted value is a dollar amount (a non-discreate value), we use a regression model instead of a classification model.

In [5]:
model = TPOTRegressor(generations=10, population_size=50, verbosity=2, n_jobs=1)
model.fit(X_train, Y_train)

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=550, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: -11.668290331251995
Generation 2 - Current best internal CV score: -11.668290331251995
Generation 3 - Current best internal CV score: -11.668290331251995
Generation 4 - Current best internal CV score: -11.452473568925464
Generation 5 - Current best internal CV score: -11.237423444555033
Generation 6 - Current best internal CV score: -11.197742650314488
Generation 7 - Current best internal CV score: -11.197742650314488
Generation 8 - Current best internal CV score: -11.197742650314488
Generation 9 - Current best internal CV score: -10.868745779772542
Generation 10 - Current best internal CV score: -10.868745779772542

Best pipeline: XGBRegressor(ExtraTreesRegressor(input_matrix, bootstrap=False, max_features=0.7000000000000001, min_samples_leaf=6, min_samples_split=10, n_estimators=100), learning_rate=0.1, max_depth=9, min_child_weight=8, n_estimators=100, nthread=1, objective=reg:squarederror, subsample=0.7000000000000001)


TPOTRegressor(config_dict=None, crossover_rate=0.1, cv=5,
              disable_update_check=False, early_stop=None, generations=10,
              max_eval_time_mins=5, max_time_mins=None, memory=None,
              mutation_rate=0.9, n_jobs=1, offspring_size=None,
              periodic_checkpoint_folder=None, population_size=50,
              random_state=None, scoring=None, subsample=1.0, template=None,
              use_dask=False, verbosity=2, warm_start=False)

## Score and Export

Now that our model is trained, let's see the final score. Because this is a regression model, the closer the score is to zero, the more accurate our model is.

In [6]:
print(model.score(X_test, Y_test))

-10.4003523971119


Now that TPOT has found a good model to use, we need to export the code to run that model. TPOT makes it easy to save this code in a Python file. It even plays well with `Path` objects 😃

In [7]:
output_code = Path('pipeline.py')
model.export(output_code)

That's great for ML pipelines, but we want to see that code. Let's just read it back in.

In [8]:
print(output_code.open().read())

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBRegressor

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:-10.868745779772542
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=ExtraTreesRegressor(bootstrap=False, max_features=0.7000000000000001, min_samples_leaf=6, min_samples_split=10, n_estimators=100)),
    XGBRegressor(learning_rate=0.1, max_depth=9, min_child_weight=8, n_estimators=100, nthread=1

That's it. We loaded a dataset, saw some values, trained a model, and exported the necessary code.