# TPOT: automated Machine Learning using genetic programming


## Installation

See http://rhiever.github.io/tpot/installing/ for TPOT installation instructions.

## Minimal working example

In [2]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')

Optimization Progress:  33%|███▎      | 40/120 [05:03<05:55,  4.45s/pipeline]  

Generation 1 - Current best internal CV score: 0.972552911410494


Optimization Progress:  50%|█████     | 60/120 [06:27<03:31,  3.52s/pipeline]

Generation 2 - Current best internal CV score: 0.9732964058342859


Optimization Progress:  67%|██████▋   | 80/120 [08:03<02:34,  3.86s/pipeline]

Generation 3 - Current best internal CV score: 0.9843984805774681


Optimization Progress:  83%|████████▎ | 100/120 [10:04<00:34,  1.73s/pipeline]

Generation 4 - Current best internal CV score: 0.9843984805774681


                                                                              

Generation 5 - Current best internal CV score: 0.9843984805774681

Best pipeline: LogisticRegression(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), C=0.001, dual=False, penalty=l2)
0.995555555556


True

## Speeding up: parallel jobs and/or max time threshold

Tips:

- Use the `n_jobs = -1` parameter setting to run pipelines in parallel on all CPU cores.
- Set `max_eval_time_mins` as an upper time limit (in minutes) per evaluation.

In [107]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import time

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

# Normal run
#tpot = TPOTClassifier(generations=2, population_size=10, verbosity=2, random_state = 42)
# Parellel jobs
tpot = TPOTClassifier(generations=2, population_size=10, verbosity=2, n_jobs=-1, random_state = 42)
# Parallel jobs and max time threshold
#tpot = TPOTClassifier(generations=2, population_size=10, verbosity=2, max_eval_time_mins=0.02, n_jobs=-1, random_state = 42)

time_start = time.time()
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
print('\nTime used (seconds):',time.time()-time_start)

Optimization Progress:  67%|██████▋   | 20/30 [01:56<07:27, 44.79s/pipeline]

Generation 1 - Current best internal CV score: 0.9881273426019426


                                                                            

Generation 2 - Current best internal CV score: 0.9881273426019426

Best pipeline: LogisticRegression(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), C=25.0, dual=False, penalty=l1)
0.986666666667

Time used (seconds): 131.5601029396057
