## TPOT Automated Machine Learning in Python

This is my experimentation with TPOT, an automated machine learning (autoML) tool in Python.

1. [Load dataset](#1)

2. [Splitting The Data into Training And Testing Dataset](#2)

3. [Implementing TPOT](#3)

## <a name="1">Load dataset</a>

In [1]:
from sklearn.datasets import load_iris

iris=load_iris()
iris.data[0:5],iris.target

(array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2]]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]))

## <a name="2">Splitting The Data into Training And Testing Dataset</a>

In [2]:
from sklearn.model_selection import train_test_split

# Splitting data into training and test set
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.25,random_state=0)

X_train.shape,X_test.shape,y_train.shape,y_test.shape

((112, 4), (38, 4), (112,), (38,))

## <a name="3">Implementing TPOT</a>

TPOT is a Tree-Based Pipeline Optimization Tool (TPOT) is using genetic programming to find the best performing ML pipelines, and it is built on top of Scikit-Learn.

Once our dataset is cleaned and ready to be used, TPOT will help us with the following steps of our ML pipeline:
- Feature preprocessing
- Feature construction and selection
- Model selection
- Hyper-parameter optimization

In [6]:
from tpot import TPOTClassifier

tpot=TPOTClassifier(generations=8,population_size=50,verbosity=2)
tpot.fit(X_train,y_train)
print("Accuracy is {}%".format(tpot.score(X_test,y_test)*100))

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=450, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.9731225296442687
Generation 2 - Current best internal CV score: 0.9731225296442687
Generation 3 - Current best internal CV score: 0.9731225296442687
Generation 4 - Current best internal CV score: 0.9731225296442687
Generation 5 - Current best internal CV score: 0.9735177865612649
Generation 6 - Current best internal CV score: 0.9735177865612649
Generation 7 - Current best internal CV score: 0.9822134387351777
Generation 8 - Current best internal CV score: 0.9822134387351777

Best pipeline: LinearSVC(input_matrix, C=10.0, dual=False, loss=squared_hinge, penalty=l2, tol=0.01)
Accuracy is 94.73684210526315%


Finally, we are going to export this pipeline:

In [None]:
tpot.export("tpot_pipeline.py")

Due to genetic programming, the resulting model can be different every time we run the model.