# AutoML and TPOT
**OPIM 5512: Data Science Using Python - University of Connecticut**

---------------------------------
Someone wrote a code to write all of your codes... this is the future...

With AutoML, you can spend more time bringing diverse datasets together and less time 'tuning'/'turning knobs').



# Intro to TPOT

<center>

![tpot logo](https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-logo.jpg)

</center>

Consider TPOT your Data Science Assistant! TPOT is a Python Automated Machine Learning tool that **optimizes machine learning pipelines using genetic programming**

### What does TPOT do?

It performs an intelligent search over machine learning pipelines that can contain supervised regression models, preprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API. The TPOT Classifier and TPOTRegressor will also search over the hyperparameters of all objects in the pipeline.

![what does TPOT do?](https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-ml-pipeline.png)

### Example of an optimized pipeline from TPOT

This is an example of an 'optimal' pipeline derived from TPOT.

![example pipeline](https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-pipeline-example.png)

### Just how easy is it?

Only a few lines of code... this is the future of machine learning!

![Example of TPOT on MNIST](https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-demo.gif)

# Import Modules/ Install TPOT

In [None]:
# relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [None]:
# if it hasn't been already... this is how you install a new module to colab
!pip install tpot

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Read Classification Data
We will use the Breast Cancer dataset for demonstration purposes.

In [None]:
# let's use gdown to get the data instead of mounting the drive
# https://drive.google.com/file/d/1UwCOmgdOwvpMd58lVlwqUL3w1IRaYJa-/view?usp=sharing
!gdown --id 1UwCOmgdOwvpMd58lVlwqUL3w1IRaYJa-

Downloading...
From: https://drive.google.com/uc?id=1UwCOmgdOwvpMd58lVlwqUL3w1IRaYJa-
To: /content/breastcancer.csv
100% 125k/125k [00:00<00:00, 55.8MB/s]


In [None]:
df = pd.read_csv('breastcancer.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


The target variable will be `diagnosis`. Let's drop that last unnamed column while we are here. And since `id` doesn't have predictive power, let's drop that too.

In [None]:
df.drop('Unnamed: 32', axis=1, inplace=True)
df.drop('id', axis=1, inplace=True)
df.columns # voila - it's gone!

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [None]:
df.info() # check for any missing values - all looks good!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

If you look at the unique values in the `diagnosis`, we see that these are... **M** for malignant and **B** for benign.



In [None]:
from collections import Counter
Counter(df['diagnosis'])

Counter({'M': 212, 'B': 357})

Our data is imbalanced, and we will ignore this for now - we can use SMOTE later on with an imblearn Pipeline (different than an sklearn pipeline - be careful!) So

So that we don't have to deal with problems in a logistic regression, let's use `LabelEncoder()` from `sklearn`.

In [None]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df['diagnosis'] = LE.fit_transform(df['diagnosis'])
Counter(df['diagnosis'])

Counter({1: 212, 0: 357})

As you can see, B is 0 and M is 1. You could use SMOTE now before all of your pipelines (if you wanted to use it for everything). But for now, we simply ignore the class balance.

In [None]:
# Split-out validation df
X = df.drop('diagnosis', axis=1) #covariates - just drop the target!
y = df['diagnosis'] #target variable
validation_size = 0.20
seed = 123 # so you will split the same way and evaluate the SAME dataset

# split!
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=validation_size,
                                                    random_state=seed)

# TPOT for Classification (Breast Cancer)
Warning: this will take quite a bit of time!

In [None]:
from tpot import TPOTClassifier
import time

# Construct and fit TPOT classifier
start_time = time.time()
tpot = TPOTClassifier(generations=5, max_time_mins = 3, verbosity=2)
tpot.fit(X_train, y_train)
end_time = time.time()

# Results
print('TPOT classifier finished in %s seconds' % (end_time - start_time))
print('Best pipeline test accuracy: %.3f' % tpot.score(X_test, y_test))

# Save best pipeline as Python script file
# make sure you update this path
tpot.export('tpot_breastcancer_pipeline.py') # this will locally download on the left

Optimization Progress:   0%|          | 0/100 [00:00<?, ?pipeline/s]


3.06 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.1, max_depth=10, max_features=0.7000000000000001, min_samples_leaf=18, min_samples_split=12, n_estimators=100, subsample=0.8)
TPOT classifier finished in 184.41928005218506 seconds
Best pipeline test accuracy: 0.974


Depending on randomness, you may see **STACKED ESTIMATORS** with models going into other models... AWESOME!!!

## Run the Model
Note: your pipeline may be a little different due to randomness... but you will still get a GREAT pipeline!

In [None]:
# go look at the resulting pipeline, on the left...
# copy and paste it in
# update paths, evaluate it!
from sklearn.metrics import accuracy_score

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectFwe, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
# tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
tpot_data = df # the clean data set

# RENAME THE TARGET VARIABLE!
# you need to add this line of code...
# this is not included in the TPOT output!
tpot_data.rename(columns={'diagnosis' : 'target'}, inplace=True)

features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)

# Average CV score on the training set was: 0.9824175824175825
exported_pipeline = make_pipeline(
    RobustScaler(),
    SelectFwe(score_func=f_classif, alpha=0.002),
    LogisticRegression(C=1.0, dual=False, penalty="l2")
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

# you need to add this line of code to
# evalaute the accuracy - 98%!!!
accuracy_score(testing_target, results)

0.965034965034965

# Read Regression Data

In [None]:
# let's use a regressor instead of a classifier
from tpot import TPOTRegressor

In [None]:
# Load dataset
# we will use Gdown to load our Boston Housing dataset
# https://drive.google.com/file/d/1a0aNGSFWB-pf5ut1NsjE5ECIsbHHoAwI/view?usp=sharing
!gdown --id 1a0aNGSFWB-pf5ut1NsjE5ECIsbHHoAwI

# look left! it downloaded a local copy of 'BostonHousing.csv'

Downloading...
From: https://drive.google.com/uc?id=1a0aNGSFWB-pf5ut1NsjE5ECIsbHHoAwI
To: /content/BostonHousing.csv
100% 35.2k/35.2k [00:00<00:00, 22.4MB/s]


In [None]:
df = pd.read_csv('BostonHousing.csv')
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [None]:
# Split-out validation df
X = df.drop('medv', axis=1) #covariates - just drop the target!
y = df['medv'] #target variable
validation_size = 0.20
seed = 123 # so you will split the same way and evaluate the SAME dataset

# split!
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                                test_size=validation_size,
                                                                random_state=seed)

# TPOT for Regression (Boston Housing)
The more generations you use, the longer it will take. Check the documentation to see all arguments you can use.

In [None]:
# Construct and fit TPOT classifier
start_time = time.time()
tpot = TPOTRegressor(generations=5, max_time_mins=3, verbosity=2, scoring='neg_mean_absolute_error')
tpot.fit(X_train, y_train)
end_time = time.time()

# Results
print('TPOT regressor finished in %s seconds' % (end_time - start_time))
print('Best pipeline test neg(MAE): %.3f' % tpot.score(X_test, y_test))

# Save best pipeline as Python script file
tpot.export('tpot_BostonRegressor_pipeline.py') # look left!

Optimization Progress:   0%|          | 0/100 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -2.2507526986937063

3.01 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: LinearSVR(RandomForestRegressor(XGBRegressor(input_matrix, learning_rate=0.1, max_depth=9, min_child_weight=4, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.5, verbosity=0), bootstrap=False, max_features=0.35000000000000003, min_samples_leaf=11, min_samples_split=17, n_estimators=100), C=25.0, dual=False, epsilon=0.001, loss=squared_epsilon_insensitive, tol=0.0001)
TPOT regressor finished in 181.90155959129333 seconds
Best pipeline test neg(MAE): -2.319


## Run The Model
Not sure what something in the pipeline is? Google it! There's so much functionality in `sklearn` that there is bound to be other stuff you've never seen before...

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
# this is from the .py file
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LassoLarsCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator, ZeroCount

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
# make sure you update the paths!
tpot_data = df

# RENAME THE TARGET VARIABLE!
# this is not included in the TPOT output!
tpot_data.rename(columns={'medv' : 'target'}, inplace=True)

features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)

# Average CV score on the training set was: -2.0421724260530723
exported_pipeline = make_pipeline(
    ZeroCount(),
    StackingEstimator(estimator=GradientBoostingRegressor(alpha=0.75, learning_rate=0.1, loss="quantile", max_depth=6, max_features=0.6500000000000001, min_samples_leaf=4, min_samples_split=14, n_estimators=100, subsample=0.7500000000000001)),
    LassoLarsCV(normalize=False)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
print(mean_absolute_error(testing_target, results))

# results is just a vector - you can make scatterplots and calculate error metrics.

You can do all of your typical analysis of the results... like this...

In [None]:
# make a scatterplot
plt.scatter(x=testing_target, y=results)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted: Regression')
plt.show()

HOW EASY WAS THAT?!

# Other Thoughts
Imagine if you use this in a loop... you could get MANY candidate models and see how the pipeline changes...

Try looking at [the arguments](http://epistasislab.github.io/tpot/api/) you can update.

Note that all of the pre-processing needs to happen BEFORE you fit your models.

# Resources
Here are some excellent resources to review.

**Examples**
* https://www.kdnuggets.com/2018/01/managing-machine-learning-workflows-scikit-learn-pipelines-part-4.html
* https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9

**Documentation**
* https://epistasislab.github.io/tpot/using/
* https://epistasislab.github.io/tpot/examples/

**More on Pipelines**
(These are the three posts that lead up to the TPOT post above - Part 4.)
* https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html
* https://www.kdnuggets.com/2018/01/managing-machine-learning-workflows-scikit-learn-pipelines-part-2.html
* https://www.kdnuggets.com/2018/01/managing-machine-learning-workflows-scikit-learn-pipelines-part-3.html