# TPOT Classification Tutorial

In this tutorial we’ll explore how to use tpot to automatically tune ML models to our data. In this tutorial we’ll be looking at classification models, tpot also supports regression for more details on regression please see the documentation at https://epistasislab.github.io/tpot/.

First we’ll import the necessary packages as well as the data we want to work with. Please note the notebook below is a Python 3.x notebook. Basic familiarity with Python is therefore assumed. 


In [None]:
##############################################
##############   SETTINGS   ##################

DATASET_NAME = "data.csv"
TARGET_VARIABLE = "REPEATER"
TEST_SIZE = 0.2

GENERATIONS = 10
POPULATION = 10
CROSSVALIDATION_SPLIT = 3

##############################################
##############################################

In [None]:
# <--- Press play for step one:
# Installs the TPOT AutoML library as well as its dependecies
!pip install -q deap update_checker tqdm stopit tpot

The code below will download our sample dataset. To load your own dataset, you can simply upload a csv file called "dataset.csv". In case of uploading your own dataset, you don't need to run the code below.

In [None]:
# <--- Press play:
# Downloads the sample dataset.
# If you want to upload your own dataset, please use the upload
# folder on the left-hand side of Google Colab. For simplicity upload
# your data as a csv file.
!wget [EXCLUDED FOR DATA PRIVACY REASONS]

We now have to separate our target variable from the rest of the dataset. Our target variable, is the one we want our model to predict - in our case regress. We’ll use the conventional X, y notation as common in the ML community. X is the matrix of predictor variables, while y is the target variable. The outputs of our model will be the ŷ (y hat). 

With our data separated, we can use the sklearn library to easily split our data into training and test sets. The train_test_split class simply takes some data and splits it into 4 sets: X_train, X_test, y_train, y_test. The purpose of the test set is to later assess our model on data it has not seen beforehand, thereby attempting to get an accurate measure of the ability of our model to generalise on unforeseen data. 


In [None]:
# <--- Press play:

# Imports the modules needed
from tpot import TPOTClassifier
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the dataset.csv file. In the case of uploading
# an excel file, it is possible to read the data by using
# pd.read_excel("dataset.xlsx")
X = pd.read_csv(DATASET_NAME)
X = X.select_dtypes([np.number])
X.dropna(inplace=True)

# Uses the "target" column of the data as the target variable
# make sure your target variable is called target, or change
# it below to your target variable.
y = X[TARGET_VARIABLE]
X.drop(TARGET_VARIABLE, axis=1, inplace=True)

# Splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE)

To initialize tpot we need to define a number of parameters. For a full and detailed overview of all parameters please see the documentation at https://epistasislab.github.io/tpot/api/#regression.  

Parameters set below:  
- generations: defines the number of generations (iterations) to run the search process.  
- population size: defines the number of individuals (models) in each generation.  
- scoring: the objective function, in our case negative mean squared error.  
- cv: how many splits to use for cross-validation.  
- n jobs: how many cores of your CPU to use. -1 will use all cores, -2 all but one core.  
- random state: seed for reproducability.  
- verbosity: defines the output of the model during the training.  

In [None]:
# <--- Press play:

# Initializes tpot
tpot = TPOTClassifier(generations=GENERATIONS, population_size=POPULATION,
                      cv=CROSSVALIDATION_SPLIT, verbosity=2, n_jobs=-1,
                      periodic_checkpoint_folder='/content/results')

While in the above section we simply defined the parameters of tpot we now call fit on it and pass the training data to it. This will start exploring models and hyperparameters - the amount of training depends on the parameters set above. For smaller datasets, a couple of hours of search can already find good performing models. Yet, it is norm in the ML community to run automl for a couple of days or even longer - depinding on the data we are working with. 

In [None]:
# <--- Press play:
# Starts the search for models (will take some time)
tpot.fit(X_train, y_train)

Please note that for the purpose of this tutorial we only ran tpot for roughly ~5 minutes. As already mentioned above, the typical range for searching models can range anywhere from 2-3 hours up to 48-72 and beyond that for large, complex datasets. To increase the search time in tpot one should increase the number of generations, population size will also increase the search space and time of tpot. tpot automatically prints the best pipeline and parameters at the end of a run.

Once our training process has ended, our tpot class itself can be used as a model / ensemble to predict. In fact, we can use the class to predict our error rate on the test data we split initially to see how well it performs on unseen data.

In [None]:
# <--- Press play:

import sklearn.metrics as skm

y_pred = tpot.predict(X_test)

print("RESULTS OF BEST MODEL:\n")
print(f"Accuracy:             {skm.accuracy_score(y_test, y_pred)}")
print(f"Balanced Accuracy:    {skm.balanced_accuracy_score(y_test, y_pred)}")
print(f"F1 Score:             {skm.f1_score(y_test, y_pred, average='weighted')}")

In [None]:
skm.confusion_matrix(y_test, y_pred)

We can export our best model by running the code below.


In [None]:
individuals = tpot.evaluated_individuals_
pareto_indiv = tpot.pareto_front_fitted_pipelines_
dict_keys = list_models = [i for i in pareto_indiv]

accuracy = []
model_name = []
model_params = []

for model in dict_keys:
  accuracy.append(individuals.get(dict_keys[0])['internal_cv_score'])
  model_name.append(pareto_indiv[model].steps[-1][0])
  model_params.append(pareto_indiv[model].steps[-1])

print(f"Total models evaluated: {len(individuals)}")

df = pd.DataFrame(list(zip(accuracy, model_name, model_params)), columns=['CV_Accuracy', 'Model', 'Model_Params'])

df