## TPOT
Tree-based Pipeline Optimization Tool:

In [None]:
!pip install tpot

Collecting tpot
  Downloading TPOT-0.11.7-py3-none-any.whl (87 kB)
[?25l[K     |███▊                            | 10 kB 29.8 MB/s eta 0:00:01[K     |███████▌                        | 20 kB 31.5 MB/s eta 0:00:01[K     |███████████▎                    | 30 kB 14.9 MB/s eta 0:00:01[K     |███████████████                 | 40 kB 11.3 MB/s eta 0:00:01[K     |██████████████████▉             | 51 kB 8.1 MB/s eta 0:00:01[K     |██████████████████████▋         | 61 kB 8.6 MB/s eta 0:00:01[K     |██████████████████████████▎     | 71 kB 7.1 MB/s eta 0:00:01[K     |██████████████████████████████  | 81 kB 7.9 MB/s eta 0:00:01[K     |████████████████████████████████| 87 kB 4.1 MB/s 
[?25hCollecting update-checker>=0.16
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)
Collecting xgboost>=1.1.0
  Downloading xgboost-1.5.2-py3-none-manylinux2014_x86_64.whl (173.6 MB)
[K     |████████████████████████████

In [None]:
# check tpot version
import tpot
print('tpot: %s' % tpot.__version__)

tpot: 0.11.7


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from tensorflow.keras.utils import plot_model

# Setting up dataset

## Superconductors dataset (regression task)

Source: https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data


The dataset contains 81 numerical features of 21263 superconductors. The label corresponds to their critical temperature measured in Kelvin.

In [None]:
!wget 'https://raw.githubusercontent.com/abcom-mltutorials/automl/main/superconductors.csv'

--2022-02-02 12:28:57--  https://raw.githubusercontent.com/abcom-mltutorials/automl/main/superconductors.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23859780 (23M) [text/plain]
Saving to: ‘superconductors.csv’


2022-02-02 12:28:58 (218 MB/s) - ‘superconductors.csv’ saved [23859780/23859780]



In [None]:
regressor_df=pd.read_csv('/content/superconductors.csv')

In [None]:
regressor_df.shape

(21263, 82)

In [None]:
features_regressor = regressor_df.iloc[:,:-1]
label_regressor = regressor_df.iloc[:,-1]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train_regressor, X_val_regressor, label_train_regressor, label_val_regressor = train_test_split(features_regressor, label_regressor, test_size=0.2, random_state=42)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

def error_metrics(y_pred,y_val):
  print('MSE: ',mean_squared_error(y_pred,y_val))
  print('RMSE: ',np.sqrt(mean_squared_error(y_pred,y_val)))
  print('Coefficient of determination: ',r2_score(y_pred,y_val))

## Biodegradation dataset (classification task)

Source: https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation


The dataset contains 21 numerical features (molecular attributes) of 1055 chemicals. The label corresponds to their experimental class (ready biodegradable "RB" or not ready biodegradable "NRB")

In [None]:
!wget 'https://raw.githubusercontent.com/abcom-mltutorials/automl/main/biodeg.csv'

--2022-02-02 12:28:59--  https://raw.githubusercontent.com/abcom-mltutorials/automl/main/biodeg.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155987 (152K) [text/plain]
Saving to: ‘biodeg.csv’


2022-02-02 12:28:59 (9.93 MB/s) - ‘biodeg.csv’ saved [155987/155987]



In [None]:
classifier_df=pd.read_csv('/content/biodeg.csv', delimiter=';', header=None)

In [None]:
classifier_df.shape

(1055, 42)

In [None]:
classifier_df.rename(columns={41:'label'}, inplace=True)

In [None]:
classifier_df.columns = classifier_df.columns.astype(str)

In [None]:
features_classifier = classifier_df.iloc[:,:-1]
label_classifier = classifier_df.iloc[:,-1]

In [None]:
!pip install imblearn



In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
sm = SMOTE(random_state=42)
X_classifier, y_classifier = sm.fit_resample(features_classifier, label_classifier)

In [None]:
y_classifier.value_counts()

RB     699
NRB    699
Name: label, dtype: int64

In [None]:
y_classifier = y_classifier.replace('NRB',0).replace('RB',1)

In [None]:
X_train_classifier, X_val_classifier, label_train_classifier, label_val_classifier = train_test_split(X_classifier, y_classifier, random_state=42, test_size = 0.2)

### Classifier

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=2, n_repeats=2, random_state=1)

In [None]:
from tpot import TPOTClassifier
model_class = TPOTClassifier(generations=3, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)

In [None]:
import time
tic = time.perf_counter()

In [None]:
model_class.fit(X_train_classifier, label_train_classifier)

Optimization Progress:   0%|          | 0/200 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8895348837209303

Generation 2 - Current best internal CV score: 0.8895348837209303

Generation 3 - Current best internal CV score: 0.8895348837209303

Best pipeline: GradientBoostingClassifier(GaussianNB(input_matrix), learning_rate=0.1, max_depth=10, max_features=0.15000000000000002, min_samples_leaf=13, min_samples_split=18, n_estimators=100, subsample=0.9500000000000001)


TPOTClassifier(cv=RepeatedStratifiedKFold(n_repeats=2, n_splits=2, random_state=1),
               generations=3, n_jobs=-1, population_size=50, random_state=1,
               scoring='accuracy', verbosity=2)

In [None]:
toc = time.perf_counter()
print (f"Elapsed time {toc - tic:0.4f} seconds")

Elapsed time 321.0972 seconds


In [None]:
model_class.predict(X_val_classifier)

  "X does not have valid feature names, but"
  "X does not have valid feature names, but"


array([0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1])

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(label_val_classifier,model_class.predict(X_val_classifier)))

              precision    recall  f1-score   support

           0       0.94      0.93      0.93       143
           1       0.93      0.93      0.93       137

    accuracy                           0.93       280
   macro avg       0.93      0.93      0.93       280
weighted avg       0.93      0.93      0.93       280



  "X does not have valid feature names, but"
  "X does not have valid feature names, but"


### Regressor

In [None]:
from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=1)

In [None]:
from tpot import TPOTRegressor
model_reg = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)

In [None]:
import time
tic = time.perf_counter()

In [None]:
model_reg.fit(X_train_regressor,label_train_regressor)

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -6.528235754680916

Generation 2 - Current best internal CV score: -6.528235754680916

Generation 3 - Current best internal CV score: -6.155736193843216

Generation 4 - Current best internal CV score: -6.012741228500996

Generation 5 - Current best internal CV score: -5.91625950912702

Best pipeline: RandomForestRegressor(MaxAbsScaler(input_matrix), bootstrap=False, max_features=0.6500000000000001, min_samples_leaf=3, min_samples_split=11, n_estimators=100)


TPOTRegressor(cv=RepeatedKFold(n_repeats=2, n_splits=2, random_state=1),
              generations=5, n_jobs=-1, population_size=50, random_state=1,
              scoring='neg_mean_absolute_error', verbosity=2)

In [None]:
toc = time.perf_counter()
print (f"Elapsed time {toc - tic:0.4f} seconds")

Elapsed time 7997.2848 seconds


In [None]:
model_reg.predict(X_val_regressor)

array([11.54500873, 84.40896594, 27.01459802, ...,  4.0760195 ,
       10.2324228 ,  3.71842276])

In [None]:
error_metrics(model_reg.predict(X_val_regressor),label_val_regressor)

MSE:  78.55015022333929
RMSE:  8.862852262299045
Coefficient of determination:  0.9260222950313017
