<a href="https://colab.research.google.com/github/samsung-ai-course/6-7-edition/blob/main/Supervised%20Learning/wine_auto_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AutoML with TPOT - Wine Quality Dataset
This notebook demonstrates the use of AutoML through the **TPOT** library on the Wine Quality dataset. [TPOT](https://epistasislab.github.io/tpot/using/) automatically finds the best machine learning pipeline for your data.

All major cloud providers have sometype of AutoML services where anyone can just drop data and a model is prepared and deployed for you. As you might wonder, its performance it tends to be supbar to customized solutions, however its always about trade-offs.

In [2]:
pip install tpot #check https://epistasislab.github.io/tpot/installing/ for the proper walkthrough to install this

Collecting tpot
  Downloading TPOT-0.12.2-py3-none-any.whl.metadata (2.0 kB)
Collecting deap>=1.2 (from tpot)
  Downloading deap-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting update-checker>=0.16 (from tpot)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Collecting stopit>=1.1.1 (from tpot)
  Downloading stopit-1.1.2.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading TPOT-0.12.2-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading deap-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (135 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.4/135.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Building wheel

In [4]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier

# Load the dataset
data = pd.read_csv('https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/Supervised%20Learning/Datasets/WineQT.csv')
X = data.drop('quality', axis=1)
y = data['quality']

#NOTE: no binning done here, you are free to add it and see how it changes.

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

Training data shape: (914, 12)
Testing data shape: (229, 12)


## Running AutoML with TPOT
TPOT will explore various pipelines to find the best one for the dataset.

In [5]:
# Initialize TPOT
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42, n_jobs=-1)

# Fit TPOT on the training data
tpot.fit(X_train, y_train)

# Evaluate the best pipeline on the test set
print(f"Test Accuracy: {tpot.score(X_test, y_test):.4f}")

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.6181108508977362

Generation 2 - Current best internal CV score: 0.6432354530715185

Generation 3 - Current best internal CV score: 0.6476250525430853

Generation 4 - Current best internal CV score: 0.6476250525430853

Generation 5 - Current best internal CV score: 0.6476250525430853

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=False, criterion=entropy, max_features=0.35000000000000003, min_samples_leaf=1, min_samples_split=5, n_estimators=100)
Test Accuracy: 0.6987


## Exporting the Best Pipeline
TPOT can save the best pipeline as a Python script for further analysis.

In [6]:
# Export the best pipeline
tpot.export('best_pipeline.py')

## Analysis of the Best Pipeline
The exported Python script contains the code for the best pipeline discovered by TPOT. You can load and analyze it to understand the steps TPOT took.

In [7]:
# prompt: read best_pipeline.py file content and  print it

!cat best_pipeline.py

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.6476250525430853
exported_pipeline = ExtraTreesClassifier(bootstrap=False, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=1, min_samples_split=5, n_estimators=100)
# Fix random state in exported estimator
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
