# tpot - Data Science Automated!

### Parameters
1. generation - any positive integer 
    - The number of generations to run pipeline optimization over. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize over. TPOT will evaluate generations x population_size number of pipelines in total.
2. population_size - any positive integer
    - The number of individuals in the GP population. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize over. TPOT will evaluate generations x population_size number of pipelines in total.
3. mutation_rate - [0.0,1.0]
    - The mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to apply random changes to every generation. We don't recommend that you tweak this parameter unless you know what you're doing.
4. crossover_rate - [0.0,1.0]
    - The crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to "breed" every generation. We don't recommend that you tweak this parameter unless you know what you're doing.
5. num_cv_folds - [2,10]
    - The number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT pipeline optimization process.
6. scoring - ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' or a callable function with signature scorer(y_true, y_pred)]
    - Function used to evaluate the quality of a given pipeline for the problem. By default, balanced accuracy is used for classification and mean squared error is used for regression. TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized. See the section on scoring functions for more details.
7. max_time_mins - Any positive integer
    - How many minutes TPOT has to optimize the pipeline. This setting will override the generations parameter.
8. max_eval_time_mins - Any positive integer
    - How many minutes TPOT has to optimize a single pipeline. Setting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer.
9. random_state
    - The random number generator seed for TPOT. Use this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.
10. verbosity {0,1,2,3}
    - How much information TPOT communicates while it's running. 0 = none, 1 = minimal, 2 = high, 3 = all. A setting of 2 or higher will add a progress bar to calls to fit().
11. disable_update_check - [True, False]
    - Flag indicating whether the TPOT version checker should be disabled.


# Example using sklearn Digits dataset

In [1]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')



Generation 1 - Current best internal CV score: 0.9891333737360423




Generation 2 - Current best internal CV score: 0.9891333737360423




Generation 3 - Current best internal CV score: 0.9891333737360423




Generation 4 - Current best internal CV score: 0.9891333737360423




Generation 5 - Current best internal CV score: 0.989584050069423





Best pipeline: GradientBoostingClassifier(KNeighborsClassifier(MultinomialNB(input_matrix, 0.72999999999999998), 27, 4), 0.14999999999999999, 0.87)
0.993186778345


In [None]:
#Output - 'tpot_mnist_pipeline.py'
import numpy as np

from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(
    make_union(VotingClassifier([("est", MultinomialNB(alpha=0.73, fit_prior=True))]), FunctionTransformer(lambda X: X)),
    make_union(VotingClassifier([("est", KNeighborsClassifier(n_neighbors=5, weights="uniform"))]), FunctionTransformer(lambda X: X)),
    GradientBoostingClassifier(learning_rate=0.15, max_features=0.15, n_estimators=500)
)

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)


# Example using IRIS dataset

In [2]:
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')



Generation 1 - Current best internal CV score: 0.9651554001554001




Generation 2 - Current best internal CV score: 0.9714154364154365




Generation 3 - Current best internal CV score: 0.9791945091945092




Generation 4 - Current best internal CV score: 0.9791945091945092




Generation 5 - Current best internal CV score: 0.9791945091945092





Best pipeline: KNeighborsClassifier(Normalizer(input_matrix, 14), 22, 32)
0.943677849928


In [None]:
#output - tpot_iris_pipeline.py
import numpy as np

from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer, Normalizer

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(
    Normalizer(norm="max"),
    KNeighborsClassifier(n_neighbors=5, weights="uniform")
)

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)
