<a href="https://colab.research.google.com/github/csanicola74/AutoML-examples/blob/main/AutoML_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install tpot mljar-supervised

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tpot
  Downloading TPOT-0.11.7-py3-none-any.whl (87 kB)
[K     |████████████████████████████████| 87 kB 1.1 MB/s 
[?25hCollecting mljar-supervised
  Downloading mljar-supervised-0.11.3.tar.gz (112 kB)
[K     |████████████████████████████████| 112 kB 7.6 MB/s 
Collecting update-checker>=0.16
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting deap>=1.2
  Downloading deap-1.3.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (139 kB)
[K     |████████████████████████████████| 139 kB 35.4 MB/s 
[?25hCollecting xgboost>=1.1.0
  Downloading xgboost-1.6.2-py3-none-manylinux2014_x86_64.whl (255.9 MB)
[K     |████████████████████████████████| 255.9 MB 33 kB/s 
Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)
Collecting lightgbm>=3.0.0
  Downloading lightgbm-3.3.3-py3-none-manylinux1_x86_64.whl 

In [2]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

# Options Available

- mode — the package ships with four built-in models. 
  - The Explain mode is ideal for explaining and understanding the data. It results in visualizations of feature importance as well as tree visualizations.
  - The Perform is used when building ML models for production. 
  - The Compete is meant to build models used in machine learning competitions. 
  - The Optuna mode is used to search for highly-tuned ML models.
- algorithms — specifies the algorithms you would like to use. They are usually passed in as a list.
- results_path — the path where the results will be stored
- total_time_limit — the total time in seconds for training the model
- train_ensemble — dictates if an ensemble will be created at the end of the training process
- stack_models — determines if a models stack will be created
- eval_metric — the metric that will be optimized. If auto the logloss is used for classification problems while the rmse is used for regression problems

In [None]:
#automl = AutoML(
    # mode="Explain"
    # algorithms=""
    # results_path="AutoML_22",
    # total_time_limit=30 * 60,
    # train_ensemble=True,
    # stack_models="",
    # eval_metric=""
#)

# Risk Factors for Cervical Cancer

## Load in Dataset

In [3]:
import pandas as pd
heart = pd.read_csv('https://raw.githubusercontent.com/csanicola74/AutoML-examples/main/data/heart.csv')
heart

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


**About this dataset**
- Age : Age of the patient

- Sex : Sex of the patient (1 = male; 0 = female)

- exang: exercise induced angina (1 = yes; 0 = no)

- ca: number of major vessels (0-3)

- cp : Chest Pain type chest pain type
  - Value 1: typical angina
  - Value 2: atypical angina
  - Value 3: non-anginal pain
  - Value 4: asymptomatic

- trtbps : resting blood pressure (in mm Hg)

- chol : cholestoral in mg/dl fetched via BMI sensor

- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

- rest_ecg : resting electrocardiographic results
  - Value 0: normal
  - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalach : maximum heart rate achieved

- target : 0= less chance of heart attack 1= more chance of heart attack

[Source](https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset)


In [4]:
heart.columns

Index(['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
       'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output'],
      dtype='object')

# Potential Variables of Interest

1. Focus on classification (binary or multi-class outcome variable)
  - Sex (sex)
  - Exercised Induced Angina (exng)
  - Chest Pain Type (cp)
  - Resting EKG Results (restecg)
  - Fasting Blood Sugar (fbs)
2. Focus on regression (continuous outcome variable)
  - Age (age)
  - Maximum Heart Rate Achieved (thalachh)
  - Resting Blood Pressure (trtbps)
  - Cholesterol (chol)

# Experiment #1
---
## Focus on Classification

In [5]:
heart['sex'].value_counts()

1    207
0     96
Name: sex, dtype: int64

In [6]:
heart['cp'].value_counts()

0    143
2     87
1     50
3     23
Name: cp, dtype: int64

In [7]:
heart['fbs'].value_counts()

0    258
1     45
Name: fbs, dtype: int64

In [8]:
heart['restecg'].value_counts()

1    152
0    147
2      4
Name: restecg, dtype: int64

In [9]:
heart['exng'].value_counts()

0    204
1     99
Name: exng, dtype: int64

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    heart[heart.columns[:-1]], heart["cp"], test_size=0.25)

In [11]:
automl = AutoML()
automl.fit(X_train, y_train)

AutoML directory: AutoML_1
The task is multiclass_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
1_Baseline logloss 1.224215 trained in 0.4 seconds




2_DecisionTree logloss 3e-06 trained in 16.31 seconds
3_Linear logloss 0.302134 trained in 11.87 seconds
* Step default_algorithms will try to check up to 3 models
4_Default_Xgboost logloss 0.014134 trained in 15.93 seconds
5_Default_NeuralNetwork logloss 0.168143 trained in 1.17 seconds
6_Default_RandomForest logloss 3e-06 trained in 13.33 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 3e-06 trained in 0.3 seconds
AutoML fit time: 69.51 seconds
AutoML best model: 2_DecisionTree


AutoML()

In [12]:
predictions = automl.predict(X_test)
predictions

array([0, 1, 2, 0, 0, 2, 1, 2, 2, 2, 1, 3, 2, 0, 0, 1, 0, 2, 2, 3, 0, 0,
       2, 2, 2, 2, 2, 1, 0, 0, 1, 3, 2, 0, 1, 0, 1, 0, 0, 0, 1, 2, 0, 0,
       0, 0, 0, 0, 0, 3, 0, 2, 2, 0, 0, 0, 0, 2, 2, 0, 0, 2, 1, 1, 2, 2,
       0, 2, 0, 2, 0, 2, 2, 0, 0, 2], dtype=int32)

In [13]:
automl.report()

Best model,name,model_type,metric_type,metric_value,train_time
,1_Baseline,Baseline,logloss,1.22422,1.07
the best,2_DecisionTree,Decision Tree,logloss,3e-06,17.31
,3_Linear,Linear,logloss,0.302134,13.09
,4_Default_Xgboost,Xgboost,logloss,0.0141337,17.02
,5_Default_NeuralNetwork,Neural Network,logloss,0.168143,2.02
,6_Default_RandomForest,Random Forest,logloss,3e-06,14.45
,Ensemble,Ensemble,logloss,3e-06,0.3

Model,Weight
2_DecisionTree,1

Unnamed: 0,0,1,2,3,accuracy,macro avg,weighted avg,logloss
precision,1,1,1,1,1,1,1,3e-06
recall,1,1,1,1,1,1,1,3e-06
f1-score,1,1,1,1,1,1,1,3e-06
support,27,10,15,5,1,57,57,3e-06

Unnamed: 0,Predicted as 0,Predicted as 1,Predicted as 2,Predicted as 3
Labeled as 0,27,0,0,0
Labeled as 1,0,10,0,0
Labeled as 2,0,0,15,0
Labeled as 3,0,0,0,5

Unnamed: 0,0,1,2,3,accuracy,macro avg,weighted avg,logloss
precision,1,1,1,1,1,1,1,3e-06
recall,1,1,1,1,1,1,1,3e-06
f1-score,1,1,1,1,1,1,1,3e-06
support,27,10,15,5,1,57,57,3e-06

Unnamed: 0,Predicted as 0,Predicted as 1,Predicted as 2,Predicted as 3
Labeled as 0,27,0,0,0
Labeled as 1,0,10,0,0
Labeled as 2,0,0,15,0
Labeled as 3,0,0,0,5

Unnamed: 0,0,1,2,3,accuracy,macro avg,weighted avg,logloss
precision,1,1,1,1,1,1,1,0.0141337
recall,1,1,1,1,1,1,1,0.0141337
f1-score,1,1,1,1,1,1,1,0.0141337
support,27,10,15,5,1,57,57,0.0141337

Unnamed: 0,Predicted as 0,Predicted as 1,Predicted as 2,Predicted as 3
Labeled as 0,27,0,0,0
Labeled as 1,0,10,0,0
Labeled as 2,0,0,15,0
Labeled as 3,0,0,0,5

Unnamed: 0,0,1,2,3,accuracy,macro avg,weighted avg,logloss
precision,0.473684,0,0,0,0.473684,0.118421,0.224377,1.22422
recall,1.0,0,0,0,0.473684,0.25,0.473684,1.22422
f1-score,0.642857,0,0,0,0.473684,0.160714,0.304511,1.22422
support,27.0,10,15,5,0.473684,57.0,57.0,1.22422

Unnamed: 0,Predicted as 0,Predicted as 1,Predicted as 2,Predicted as 3
Labeled as 0,27,0,0,0
Labeled as 1,10,0,0,0
Labeled as 2,15,0,0,0
Labeled as 3,5,0,0,0

Unnamed: 0,0,1,2,3,accuracy,macro avg,weighted avg,logloss
precision,0.964286,1.0,0.833333,1.0,0.929825,0.949405,0.939223,0.302134
recall,1.0,0.8,1.0,0.6,0.929825,0.85,0.929825,0.302134
f1-score,0.981818,0.888889,0.909091,0.75,0.929825,0.882449,0.926041,0.302134
support,27.0,10.0,15.0,5.0,0.929825,57.0,57.0,0.302134

Unnamed: 0,Predicted as 0,Predicted as 1,Predicted as 2,Predicted as 3
Labeled as 0,27,0,0,0
Labeled as 1,1,8,1,0
Labeled as 2,0,0,15,0
Labeled as 3,0,0,2,3

Unnamed: 0,0,1,2,3
intercept,1.15734,1.72402,0.727319,-3.60868
age,0.0138502,0.178015,0.139614,-0.33148
sex,0.205235,-0.171231,-0.200061,0.166058
cp,-4.46316,-1.21802,1.62929,4.05189
trtbps,-0.126326,-0.170597,-0.173924,0.470847
chol,-0.0733405,0.0189566,0.111197,-0.0568132
fbs,-0.00214881,-0.462543,0.401716,0.0629758
restecg,0.0736313,0.116551,0.00792824,-0.198111
thalachh,-0.0831523,0.259048,0.0543465,-0.230242
exng,0.599549,-0.624016,0.174684,-0.150217

Unnamed: 0,0,1,2,3,accuracy,macro avg,weighted avg,logloss
precision,1,1,1,1,1,1,1,3e-06
recall,1,1,1,1,1,1,1,3e-06
f1-score,1,1,1,1,1,1,1,3e-06
support,27,10,15,5,1,57,57,3e-06

Unnamed: 0,Predicted as 0,Predicted as 1,Predicted as 2,Predicted as 3
Labeled as 0,27,0,0,0
Labeled as 1,0,10,0,0
Labeled as 2,0,0,15,0
Labeled as 3,0,0,0,5

Unnamed: 0,0,1,2,3,accuracy,macro avg,weighted avg,logloss
precision,0.964286,1.0,1,1,0.982456,0.991071,0.983083,0.168143
recall,1.0,0.9,1,1,0.982456,0.975,0.982456,0.168143
f1-score,0.981818,0.947368,1,1,0.982456,0.982297,0.982154,0.168143
support,27.0,10.0,15,5,0.982456,57.0,57.0,0.168143

Unnamed: 0,Predicted as 0,Predicted as 1,Predicted as 2,Predicted as 3
Labeled as 0,27,0,0,0
Labeled as 1,1,9,0,0
Labeled as 2,0,0,15,0
Labeled as 3,0,0,0,5


In [18]:
automl = AutoML(results_path='heart_cp',mode='Explain')

## Test the Data Model Selected

In [14]:
## Create a new model

x = heart.drop(columns=['cp'])

In [15]:
y = heart['cp']

In [16]:
x

Unnamed: 0,age,sex,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,130,131,0,1,115,1,1.2,1,1,3,0


In [17]:
y

0      3
1      2
2      1
3      1
4      0
      ..
298    0
299    3
300    0
301    0
302    1
Name: cp, Length: 303, dtype: int64

In [21]:
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=0.25)

In [22]:
x_test

Unnamed: 0,age,sex,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
30,41,0,105,198,0,1,168,0,0.0,2,1,2,1
74,43,0,122,213,0,1,165,0,0.2,1,0,2,1
123,54,0,108,267,0,0,167,0,0.0,2,0,2,1
84,42,0,102,265,0,0,122,0,0.6,1,0,2,1
160,56,1,120,240,0,1,169,0,0.0,0,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,52,1,152,298,1,1,178,0,1.2,1,0,3,1
281,52,1,128,204,1,1,156,1,1.0,1,0,0,0
52,62,1,130,231,0,1,146,0,1.8,1,3,3,1
159,56,1,130,221,0,0,163,0,0.0,2,0,3,1


In [23]:
automl.fit(x_train, y_train)

AutoML directory: heart_cp
The task is multiclass_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
1_Baseline logloss 1.226009 trained in 1.81 seconds
2_DecisionTree logloss 2.153685 trained in 17.77 seconds
3_Linear logloss 1.289912 trained in 9.61 seconds
* Step default_algorithms will try to check up to 3 models
4_Default_Xgboost logloss 1.175089 trained in 11.58 seconds
5_Default_NeuralNetwork logloss 1.314182 trained in 1.24 seconds
6_Default_RandomForest logloss 1.164299 trained in 14.13 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 1.11079 trained in 0.3 seconds
AutoML fit time: 66.53 seconds
AutoML best model: Ensemble


AutoML(results_path='heart_cp')

In [24]:
pred = automl.predict(x_test)
pred

array([1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 0, 0, 2, 0, 1, 0, 1, 2, 2, 0, 0,
       0, 2, 1, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
       2, 1, 0, 2, 2, 2, 0, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 0, 2, 0, 0, 0,
       0, 0, 2, 0, 0, 2, 0, 3, 1, 1], dtype=int32)

In [30]:
heart_withcp = heart.sample(76)


In [32]:
values_actual = heart_withcp['cp'].values.tolist()
values_predicted = pred.tolist()
output = pd.DataFrame({'actual': values_actual, 'predicted': values_predicted})
output

Unnamed: 0,actual,predicted
0,2,1
1,0,2
2,0,2
3,1,2
4,2,2
...,...,...
71,0,2
72,2,0
73,2,3
74,1,1


# Experiment #2
---
## Focus on Regression

In [33]:
heart['age'].describe()

count    303.000000
mean      54.366337
std        9.082101
min       29.000000
25%       47.500000
50%       55.000000
75%       61.000000
max       77.000000
Name: age, dtype: float64

In [34]:
heart['thalachh'].describe()

count    303.000000
mean     149.646865
std       22.905161
min       71.000000
25%      133.500000
50%      153.000000
75%      166.000000
max      202.000000
Name: thalachh, dtype: float64

In [35]:
heart['trtbps'].describe()

count    303.000000
mean     131.623762
std       17.538143
min       94.000000
25%      120.000000
50%      130.000000
75%      140.000000
max      200.000000
Name: trtbps, dtype: float64

In [36]:
heart['chol'].describe()

count    303.000000
mean     246.264026
std       51.830751
min      126.000000
25%      211.000000
50%      240.000000
75%      274.500000
max      564.000000
Name: chol, dtype: float64

In [37]:
X_without_chol = heart.drop(columns=['chol'])
y_chol = heart['chol']

In [38]:
X_without_chol

Unnamed: 0,age,sex,cp,trtbps,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,0,1,115,1,1.2,1,1,3,0


In [39]:
y_chol

0      233
1      250
2      204
3      236
4      354
      ... 
298    241
299    264
300    193
301    131
302    236
Name: chol, Length: 303, dtype: int64

In [40]:
automl_2 = AutoML()
automl_2.fit(X_without_chol,y_chol)

AutoML directory: AutoML_2
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
1_Baseline rmse 40.458838 trained in 2.12 seconds
2_DecisionTree rmse 46.646306 trained in 11.2 seconds
3_Linear rmse 40.094017 trained in 3.12 seconds
* Step default_algorithms will try to check up to 3 models
4_Default_Xgboost rmse 43.4703 trained in 4.94 seconds
5_Default_NeuralNetwork rmse 36.975542 trained in 1.04 seconds
6_Default_RandomForest rmse 37.908018 trained in 8.32 seconds
* Step ensemble will try to check up to 1 model
Ensemble rmse 36.759846 trained in 0.31 seconds
AutoML fit time: 39.24 seconds
AutoML best model: Ensemble


AutoML()

In [43]:
automl_2.report()

Best model,name,model_type,metric_type,metric_value,train_time
,1_Baseline,Baseline,rmse,40.4588,2.46
,2_DecisionTree,Decision Tree,rmse,46.6463,11.96
,3_Linear,Linear,rmse,40.094,3.88
,4_Default_Xgboost,Xgboost,rmse,43.4703,5.78
,5_Default_NeuralNetwork,Neural Network,rmse,36.9755,1.66
,6_Default_RandomForest,Random Forest,rmse,37.908,9.15
the best,Ensemble,Ensemble,rmse,36.7598,0.31

Model,Weight
3_Linear,1
5_Default_NeuralNetwork,4
6_Default_RandomForest,1

Metric,Score
MAE,29.6805
MSE,1351.29
RMSE,36.7598
R2,0.156103
MAPE,0.130562

Metric,Score
MAE,31.2722
MSE,1437.02
RMSE,37.908
R2,0.102562
MAPE,0.137837

Metric,Score
MAE,35.4884
MSE,1889.67
RMSE,43.4703
R2,-0.180123
MAPE,0.157708

Metric,Score
MAE,33.7845
MSE,1636.92
RMSE,40.4588
R2,-0.0222779
MAPE,0.148408

Metric,Score
MAE,31.3677
MSE,1607.53
RMSE,40.094
R2,-0.00392512
MAPE,0.136148

feature,Learner_1
thall,0.169459
age,0.162651
exng,0.161141
thalachh,0.114071
slp,0.0711428
fbs,0.0570083
trtbps,0.048487
caa,0.00736213
oldpeak,-0.0292982
cp,-0.0378626

Metric,Score
MAE,38.0484
MSE,2175.88
RMSE,46.6463
R2,-0.358866
MAPE,0.162539

Metric,Score
MAE,29.9102
MSE,1367.19
RMSE,36.9755
R2,0.14617
MAPE,0.13152


In [41]:
heart['reg_pred'] = automl_2.predict(X_without_chol)

In [42]:
print('reg_pred')
print(heart[['chol', 'reg_pred']].head())

reg_pred
   chol    reg_pred
0   233  242.030588
1   250  233.076180
2   204  256.958405
3   236  236.239116
4   354  253.012820
