## Image Classification Pipeline Solving Selfie Classification Task

First, import the class `AutoMLClassifier`

In this example, we are generating pipelines for a CSV dataset. The selfie dataset is used for this example.
Sample and devide the dataset using _train_test_split_.

For this task, we use the Selfie Dataset, a customize dataset for recognizing selfie from various images.  
The original image dataset is collected from [Selfie-Image-Detection-Dataset](https://www.kaggle.com/datasets/jigrubhatt/selfieimagedetectiondataset) from Kaggle.  
You can download the dataset via the following [google drive link](https://drive.google.com/file/d/1y5d_3LT5jQ4RF7LAKmXjEFu041dH7-Uk/view?usp=drive_link).

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from alpha_automl import AutoMLClassifier

media_path = os.path.join(os.getcwd(), 'datasets/selfie/')
dataset = pd.read_csv('datasets/selfie/learningData.csv').sample(1000)
dataset["image"] = dataset["image"].apply(lambda x: os.path.join(media_path, x))
X = dataset[["image"]]
y = dataset[["label"]]
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    shuffle=True,
    random_state=42,
)
X_train

Unnamed: 0,image
1513,/home/yfw215/alpha-automl/examples/datasets/se...
5341,/home/yfw215/alpha-automl/examples/datasets/se...
1334,/home/yfw215/alpha-automl/examples/datasets/se...
3615,/home/yfw215/alpha-automl/examples/datasets/se...
3783,/home/yfw215/alpha-automl/examples/datasets/se...
...,...
6413,/home/yfw215/alpha-automl/examples/datasets/se...
7205,/home/yfw215/alpha-automl/examples/datasets/se...
4624,/home/yfw215/alpha-automl/examples/datasets/se...
3891,/home/yfw215/alpha-automl/examples/datasets/se...


In [2]:
y_train.value_counts()

label
0        406
1        394
dtype: int64

### Adding New Primitives into AlphaAutoML's Search Space

In [3]:

automl = AutoMLClassifier(time_bound=20)

In [4]:
from alpha_automl.wrapper_primitives.huggingface_image import HuggingfaceImageTransformer 

model_id = 'openai/clip-vit-base-patch32'
my_clip_encoder = HuggingfaceImageTransformer(model_id=model_id)
automl.add_primitives([(my_clip_encoder, 'IMAGE_ENCODER')])

DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
INFO:gluonts.mx.context:Using CPU
DEBUG:matplotlib:matplotlib data path: /ext3/miniconda3/lib/python3.10/site-packages/matplotlib/mpl-data
DEBUG:matplotlib:CONFIGDIR=/home/yfw215/.config/matplotlib
DEBUG:matplotlib:interactive is False
DEBUG:matplotlib:platform is linux
DEBUG:matplotlib:CACHEDIR=/home/yfw215/.cache/matplotlib
DEBUG:matplotlib.font_manager:Using fontManager instance from /home/yfw215/.cache/matplotlib/fontlist-v330.json


In [5]:
automl.fit(X_train, y_train)

INFO:alpha_automl.automl_api:Found pipeline, time=0:00:07, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.635
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:14, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.51
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:21, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.635
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:29, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.51
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:36, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.49
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:44, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.595
INFO:alpha_automl.automl_api:Found pipeline, time=0:03:05, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.88
INFO:alpha_automl.automl_api:Found pipeline, time=0:03:12, scoring...
INFO:alpha_automl.automl_api:Scored pi

### Exploring Pipelines

After the pipeline search is complete, we can display the leaderboard:

In [6]:
automl.plot_leaderboard()

ranking,pipeline,accuracy_score
1,"ColumnTransformer, HuggingfaceCLIPTransformer, MaxAbsScaler, SelectPercentile, DecisionTreeClassifier",0.97
2,"ColumnTransformer, HuggingfaceCLIPTransformer, MaxAbsScaler, DecisionTreeClassifier",0.945
3,"ColumnTransformer, HuggingfaceCLIPTransformer, MaxAbsScaler, SelectKBest, DecisionTreeClassifier",0.93
4,"ColumnTransformer, HuggingfaceCLIPTransformer, MaxAbsScaler, GenericUnivariateSelect, LinearSVC",0.905
5,"ColumnTransformer, HuggingfaceCLIPTransformer, MaxAbsScaler, GenericUnivariateSelect, DecisionTreeClassifier",0.88
6,"ColumnTransformer, HuggingfaceCLIPTransformer, MaxAbsScaler, GenericUnivariateSelect, RandomForestClassifier",0.88
7,"ColumnTransformer, HuggingfaceCLIPTransformer, RobustScaler, GenericUnivariateSelect, DecisionTreeClassifier",0.88
8,"ColumnTransformer, HogTransformer, MaxAbsScaler, SelectKBest, RandomForestClassifier",0.785
9,"ColumnTransformer, HogTransformer, MaxAbsScaler, SelectKBest, ExtraTreesClassifier",0.775
10,"ColumnTransformer, HogTransformer, MaxAbsScaler, SelectKBest, LinearSVC",0.755


In order to explore the produced pipelines, we can use [PipelineProfiler](https://github.com/VIDA-NYU/PipelineVis). PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AlphaAutoML system.

After the pipeline search process is completed, we can use PipelineProfiler with:

In [None]:
automl.plot_comparison_pipelines()

### Testing Pipelines

Pipeline predictions are accessed with:

In [7]:
y_pred = automl.predict(X_test)
y_pred

array([1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 1])

The pipeline can be evaluated against a held out dataset with the function call:

In [8]:
automl.score(X_test, y_test)

INFO:alpha_automl.automl_api:Metric: accuracy_score, Score: 0.98


{'metric': 'accuracy_score', 'score': 0.98}