# TPOP

<https://epistasislab.github.io/tpot/>

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

![](./data_image/tpot-ml-pipeline.png)

An example machine learning pipeline



Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

![](./data_image/tpot-pipeline-example.png)

In [3]:
!conda install -c conda-forge tpot -y
!conda install -c conda-forge tpot xgboost dask dask-ml scikit-mdr skrebate -y

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Retrieving notices: ...working... done


<https://epistasislab.github.io/tpot/api/>

class tpot.TPOTClassifier(generations=100, population_size=100,

      offspring_size=None, mutation_rate=0.9,

      crossover_rate=0.1,
      
      scoring='accuracy', cv=5,
      
      subsample=1.0, 
      
      n_jobs=1,
      
      max_time_mins=None, 
      
      max_eval_time_mins=5,
      
      random_state=None, 
      
      config_dict=None,
      
      template=None,
      
      warm_start=False,
      
      memory=None,
      
      use_dask=False,
      
      periodic_checkpoint_folder=None,
      
      early_stop=None,
      
      verbosity=0,
      
      disable_update_check=False,
      
      log_file=None
                          )
                          

class tpot.TPOTRegressor(
         
     generations=100, 
     
     population_size=100,
         
     offspring_size=None, 
      
     mutation_rate=0.9,
 
     crossover_rate=0.1,
 
     scoring='neg_mean_squared_error', 
     
     cv=5,
 
     subsample=1.0, 
     
     n_jobs=1,
     
     max_time_mins=None, 
     
     max_eval_time_mins=5,
 
     random_state=None, 
     
     config_dict=None,
 
     template=None,
 
     warm_start=False,
 
     memory=None,
 
     use_dask=False,
     
     periodic_checkpoint_folder=None,
 
     early_stop=None,
 
     verbosity=0,
 
     disable_update_check=False)


<https://epistasislab.github.io/tpot/using/>

<https://epistasislab.github.io/tpot/using/#built-in-tpot-configurations>





| Configuration Name | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Operators                        |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
| Default TPOT       | TPOT will search over a broad range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Some of these operators are complex and may take a long time to run, especially on larger datasets.<br><br>Note: This is the default configuration for TPOT. To use this configuration, use the default value (None) for the config_dict parameter.                                                                                                      | Classification<br><br>Regression |
| TPOT light         | TPOT will search over a restricted range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Only simpler and fast-running operators will be used in these pipelines, so TPOT light is useful for finding quick and simple pipelines for a classification or regression problem.<br><br>This configuration works for both the TPOTClassifier and TPOTRegressor.                                                                                  | Classification<br><br>Regression |
| TPOT MDR           | TPOT will search over a series of feature selectors and Multifactor Dimensionality Reduction models to find a series of operators that maximize prediction accuracy. The TPOT MDR configuration is specialized for genome-wide association studies (GWAS), and is described in detail online here.<br><br>Note that TPOT MDR may be slow to run because the feature selection routines are computationally expensive, especially on large datasets.                                                                                                            | Classification<br><br>Regression |
| TPOT sparse        | TPOT uses a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.<br><br>This configuration works for both the TPOTClassifier and TPOTRegressor.                                                                                                                                                                                                                                                                                                                                      | Classification<br><br>Regression |
| TPOT NN            | TPOT uses the same configuration as "Default TPOT" plus additional neural network estimators written in PyTorch (currently only `tpot.builtins.PytorchLRClassifier` and `tpot.builtins.PytorchMLPClassifier`).<br><br>Currently only classification is supported, but future releases will include regression estimators.                                                                                                                                                                                                                                      | Classification                   |
| TPOT cuML          | TPOT will search over a restricted configuration using the GPU-accelerated estimators in RAPIDS cuML and DMLC XGBoost. This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+, and that the library cuML is installed. With this configuration, all model training and predicting will be GPU-accelerated.<br><br>This configuration is particularly useful for medium-sized and larger datasets on which CPU-based estimators are a common bottleneck, and works for both the TPOTClassifier and TPOTRegressor. | Classification<br><br>Regression |

# Classification

## iris Dataset

In [10]:
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42 ,scoring="accuracy")
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')


                                                                                
Generation 1 - Current best internal CV score: 0.9727272727272727
                                                                                
Generation 2 - Current best internal CV score: 0.9727272727272727
                                                                                
Generation 3 - Current best internal CV score: 0.9727272727272727
                                                                                
Generation 4 - Current best internal CV score: 0.9731225296442687
                                                                                
Generation 5 - Current best internal CV score: 0.9818181818181818
                                                                                
Best pipeline: RandomForestClassifier(LinearSVC(RandomForestClassifier(RBFSampler(input_matrix, gamma=0.30000000000000004), bootstrap=False, criterion=entropy, max_features=0.75, min_s



In [1]:
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42 ,
                      scoring="accuracy" , use_dask=True)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))



                                                                                                                                      
Generation 1 - Current best internal CV score: 0.9727272727272727
                                                                                                                                      
Generation 2 - Current best internal CV score: 0.9727272727272727
                                                                                                                                      
Generation 3 - Current best internal CV score: 0.9727272727272727
                                                                                                                                      
Generation 4 - Current best internal CV score: 0.9731225296442687
                                                                                                                                      
Generation 5 - Current best internal CV score: 0.981818181818



In [3]:
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42 ,
                      scoring="accuracy" , use_dask=True , config_dict='TPOT light')
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

                                                                                                                                      
Generation 1 - Current best internal CV score: 0.9727272727272727
                                                                                                                                      
Generation 2 - Current best internal CV score: 0.9818181818181818
                                                                                                                                      
Generation 3 - Current best internal CV score: 0.9818181818181818
                                                                                                                                      
Generation 4 - Current best internal CV score: 0.9818181818181818
                                                                                                                                      
Generation 5 - Current best internal CV score: 0.981818181818



In [6]:


from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42 ,
                      scoring="accuracy" , use_dask=True , config_dict='')
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

                                                                                                                                      
Generation 1 - Current best internal CV score: 0.9727272727272727
                                                                                                                                      
Generation 2 - Current best internal CV score: 0.9727272727272727
                                                                                                                                      
Generation 3 - Current best internal CV score: 0.9727272727272727
                                                                                                                                      
Generation 4 - Current best internal CV score: 0.9731225296442687
                                                                                                                                      
Generation 5 - Current best internal CV score: 0.981818181818



In [9]:
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42 ,
                      scoring="f1_weighted" , use_dask=True , config_dict='' , n_jobs=-1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

                                                                                                                                      
Generation 1 - Current best internal CV score: 0.9711613745106569
                                                                                                                                      
Generation 2 - Current best internal CV score: 0.9711613745106569
                                                                                                                                      
Generation 3 - Current best internal CV score: 0.9725833785424578
                                                                                                                                      
Generation 4 - Current best internal CV score: 0.9725833785424578
                                                                                                                                      
Generation 5 - Current best internal CV score: 0.972583378542



## Other Scores
List OF Scores:

- 'accuracy', 
- 'adjusted_rand_score', 
- 'average_precision', 
- 'balanced_accuracy',
- 'f1', 
- 'f1_macro', 
- 'f1_micro', 
- 'f1_samples', 
- 'f1_weighted', 
- 'neg_log_loss',
- 'neg_mean_absolute_error', 
- 'neg_mean_squared_error', 
- 'neg_median_absolute_error', 
- 'precision', 
- 'precision_macro', 
- 'precision_micro', 
- 'precision_samples', 
- 'precision_weighted',
- 'r2', 
- 'recall', 
- 'recall_macro', 
- 'recall_micro', 
- 'recall_samples', 
- 'recall_weighted', 
- 'roc_auc', 
- 'my_module.scorer_name*'

In [15]:
tpot_precision = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42 ,
                                scoring = 'precision_macro')
tpot_precision.fit(X_train, y_train)
print(tpot_precision.score(X_test, y_test))


                                                                                
Generation 1 - Current best internal CV score: 0.9818181818181818
                                                                                
Generation 2 - Current best internal CV score: 0.9818181818181818
                                                                                
Generation 3 - Current best internal CV score: 0.9818181818181818
                                                                                
Generation 4 - Current best internal CV score: 0.9818181818181818
                                                                                
Generation 5 - Current best internal CV score: 0.9866666666666667
                                                                                
Best pipeline: DecisionTreeClassifier(KNeighborsClassifier(Normalizer(input_matrix, norm=l2), n_neighbors=14, p=1, weights=uniform), criterion=gini, max_depth=5, min_samples_leaf=1, mi



In [17]:
tpot = TPOTClassifier(generations=2, population_size=20, verbosity=3, scoring = 'recall_macro',config_dict='TPOT light')
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))


19 operators have been imported by TPOT.
_pre_test decorator: _random_mutation_operator: num_test=0 Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty..
_pre_test decorator: _random_mutation_operator: num_test=0 Negative values in data passed to MultinomialNB (input X).
                                                                                
Generation 1 - Current Pareto front scores:
                                                                                
-1	0.9571428571428573	MultinomialNB(input_matrix, MultinomialNB__alpha=1.0, MultinomialNB__fit_prior=True)
                                                                                
-2	0.9654761904761905	KNeighborsClassifier(MultinomialNB(input_matrix, MultinomialNB__alpha=100.0, MultinomialNB__fit_prior=False), KNeighborsClassifier__n_neighbors=45, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)
_pre_test decorator: _random_mutation_operator: num_test=0 Solver lbfgs supports 



In [None]:
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=3, random_state=42 ,
                      scoring="accuracy" , use_dask=True , con)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('./data_image/iris.csv', sep=',')
features = tpot_data.drop('species', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['species'], random_state=42)

# Average CV score on the training set was: 0.9826086956521738
exported_pipeline = make_pipeline(
    Normalizer(norm="l2"),
    KNeighborsClassifier(n_neighbors=5, p=2, weights="distance")
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)


In [4]:
exported_pipeline.feature_names_in_

array(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
      dtype=object)

In [5]:
exported_pipeline.get_params

<bound method Pipeline.get_params of Pipeline(steps=[('normalizer', Normalizer()),
                ('kneighborsclassifier',
                 KNeighborsClassifier(weights='distance'))])>

## Digits dataset


In [6]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
# tpot.export('tpot_digits_pipeline.py')


                                                                                
Generation 1 - Current best internal CV score: 0.9844058928817294
                                                                                
Generation 2 - Current best internal CV score: 0.9866363761531047
                                                                                
Generation 3 - Current best internal CV score: 0.9866363761531047
                                                                                
Generation 4 - Current best internal CV score: 0.9866363761531047
                                                                                
Generation 5 - Current best internal CV score: 0.9866363761531047
                                                                                
Best pipeline: KNeighborsClassifier(input_matrix, n_neighbors=2, p=2, weights=distance)
0.9822222222222222




# Regression
Similarly, TPOT can optimize pipelines for regression problems. Below is a minimal working example with the practice Boston housing prices data set.

In [18]:
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
# tpot.export('tpot_boston_pipeline.py')


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

                                                                                
Generation 1 - Current best internal CV score: -11.644597516229156
                                                                                
Generation 2 - Current best internal CV score: -11.644597516229156
                                                                                
Generation 3 - Current best internal CV score: -11.644597516229156
                                                                                
Generation 4 - Current best internal CV score: -11.644597516229156
                                                                                
Generation 5 - Current best internal CV score: -11.644597516229156
                                                                                
Best pipeline: XGBRegressor(input_matrix, learning_rate=0.1, max_depth=6, min_child_weight=11, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.7000000000000001

