## Solving Tabular Classification Tasks

First, import the class `AutoMLClassifier`

In [1]:
from alpha_automl  import AutoMLClassifier
import pandas as pd

### Generating Pipelines for CSV Datasets

In this example, we are generating pipelines for a CSV dataset. The 299_libras_move dataset is used for this example.

In [2]:
output_path = '/Users/rlopez/D3M/tmp/'
train_dataset = pd.read_csv('/Users/rlopez/D3M/examples/datasets/299_libras_move/train_data.csv')
test_dataset = pd.read_csv('/Users/rlopez/D3M/examples/datasets/299_libras_move/test_data.csv')

Removing the target column from the features for the train dataset

In [3]:
X_train = train_dataset.drop(columns=['class'])
X_train

Unnamed: 0,xcoord1,ycoord1,xcoord2,ycoord2,xcoord3,ycoord3,xcoord4,ycoord4,xcoord5,ycoord5,...,xcoord41,ycoord41,xcoord42,ycoord42,xcoord43,ycoord43,xcoord44,ycoord44,xcoord45,ycoord45
0,0.82979,0.76620,0.82979,0.76620,0.82979,0.77083,0.82785,0.77083,0.82979,0.76620,...,0.41199,0.45370,0.37524,0.43750,0.33269,0.43056,0.29787,0.44213,0.26886,0.47222
1,0.80271,0.54630,0.80077,0.54398,0.80271,0.54398,0.80271,0.54630,0.80271,0.54398,...,0.20503,0.64583,0.20503,0.68056,0.20696,0.71296,0.21083,0.74537,0.21277,0.77315
2,0.78917,0.59028,0.79110,0.59028,0.79110,0.59028,0.79304,0.59028,0.79110,0.59028,...,0.20503,0.57407,0.19149,0.57176,0.18569,0.57407,0.18956,0.57639,0.19149,0.56944
3,0.88395,0.61574,0.88201,0.61806,0.87234,0.62037,0.87041,0.61574,0.84526,0.61574,...,0.27079,0.65972,0.26886,0.62731,0.27660,0.59259,0.27660,0.56250,0.27853,0.53009
4,0.60155,0.77315,0.59768,0.77315,0.59961,0.77315,0.59381,0.77083,0.58801,0.75000,...,0.63830,0.47917,0.63250,0.52778,0.63830,0.57407,0.63830,0.63194,0.63830,0.68750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
283,0.85300,0.57639,0.85493,0.57407,0.85300,0.57639,0.85300,0.57407,0.85493,0.56944,...,0.19923,0.80787,0.17021,0.79630,0.14313,0.75231,0.12959,0.69213,0.12959,0.62037
284,0.66925,0.78009,0.66925,0.78009,0.66925,0.78009,0.67118,0.78009,0.66925,0.78009,...,0.67311,0.28704,0.66925,0.26389,0.66731,0.24306,0.66731,0.22454,0.66538,0.20602
285,0.57060,0.65741,0.57060,0.65741,0.57060,0.65509,0.56867,0.65046,0.54739,0.64120,...,0.36944,0.62500,0.42166,0.62269,0.47389,0.62269,0.52418,0.63194,0.56286,0.64120
286,0.62282,0.65278,0.62476,0.65046,0.62669,0.64815,0.61315,0.63426,0.55319,0.60185,...,0.28433,0.66435,0.30754,0.63889,0.33462,0.61574,0.37331,0.58102,0.42747,0.54398


Selecting the target column for the train dataset

In [4]:
y_train = train_dataset[['class']]
y_train

Unnamed: 0,class
0,12
1,15
2,7
3,12
4,3
...,...
283,12
284,8
285,2
286,1


### Adding New Primitives into AlphaAutoML's Search Space

In [5]:
automl = AutoMLClassifier(output_path, time_bound=10)

In [6]:
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

xgboost_object = XGBClassifier()
lgbm_object = LGBMClassifier()

# Adding XGBClassifier and LGBMClassifier to AlphaAutoML
automl.add_primitives([(xgboost_object, 'CLASSIFICATION')])
automl.add_primitives([(lgbm_object, 'CLASSIFICATION')])

### Searching  Pipelines

In [7]:
automl.fit(X_train, y_train)

INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.19444444444444445
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.19444444444444445
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.20833333333333334
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.1388888888888889
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.041666666666666664
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.3194444444444444
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.3333333333333333
INFO:alp

INFO:alpha_automl.automl_api:Scored pipeline, score=0.2638888888888889
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:06, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.3888888888888889
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:06, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.6111111111111112
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:08, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4583333333333333
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:08, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.16666666666666666
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:08, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4861111111111111
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:08, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.1527777777777778
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:08, scoring...
INFO:alpha_a

INFO:alpha_automl.automl_api:Found pipeline, time=0:00:20, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.041666666666666664
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:21, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4583333333333333
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:21, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4722222222222222
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:21, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4861111111111111
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:21, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.3888888888888889
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:21, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.3194444444444444
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:21, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4861111111111111
INFO:alpha_

INFO:alpha_automl.automl_api:Found pipeline, time=0:01:01, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.7916666666666666
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:01, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.20833333333333334
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:01, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.6111111111111112
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4722222222222222
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4444444444444444
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.3333333333333333
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:03, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.7916666666666666
INFO:alpha_a

After the pipeline search is complete, we can display the leaderboard:

In [8]:
automl.plot_leaderboard()

ranking,summary,accuracy
1,"SimpleImputer, RobustScaler, SVC",0.847222
2,"SimpleImputer, MaxAbsScaler, ExtraTreesClassifier",0.847222
3,"SimpleImputer, MaxAbsScaler, RandomForestClassifier",0.847222
4,"SimpleImputer, StandardScaler, SVC",0.833333
5,"SimpleImputer, SVC",0.833333
6,"SimpleImputer, MaxAbsScaler, SVC",0.819444
7,"SimpleImputer, MaxAbsScaler, LinearSVC",0.805556
8,"SimpleImputer, MaxAbsScaler, BaggingClassifier",0.805556
9,"SimpleImputer, MaxAbsScaler, New_lgbmc",0.791667
10,"SimpleImputer, MaxAbsScaler, KNeighborsClassifier",0.791667


Removing the target column from the features for the test dataset

In [9]:
X_test = test_dataset.drop(columns=['class'])
X_test

Unnamed: 0,xcoord1,ycoord1,xcoord2,ycoord2,xcoord3,ycoord3,xcoord4,ycoord4,xcoord5,ycoord5,...,xcoord41,ycoord41,xcoord42,ycoord42,xcoord43,ycoord43,xcoord44,ycoord44,xcoord45,ycoord45
0,0.14700,0.47917,0.17602,0.47917,0.22244,0.48380,0.27853,0.50463,0.35783,0.50694,...,0.49130,0.49769,0.41779,0.50231,0.33656,0.50463,0.26306,0.49306,0.19729,0.48843
1,0.57060,0.51852,0.56867,0.51852,0.56867,0.52083,0.56093,0.52546,0.55126,0.47685,...,0.49903,0.49537,0.49323,0.49769,0.48549,0.46065,0.49130,0.51389,0.50097,0.49537
2,0.85300,0.74537,0.85880,0.74074,0.85880,0.74074,0.85687,0.74074,0.85300,0.74537,...,0.32302,0.46296,0.29981,0.44676,0.27466,0.44444,0.24565,0.44444,0.21470,0.46528
3,0.49710,0.74074,0.49903,0.73843,0.49903,0.73843,0.49903,0.73843,0.49710,0.73380,...,0.36750,0.33333,0.35590,0.33333,0.35397,0.34954,0.35203,0.37269,0.35977,0.43287
4,0.74662,0.25694,0.74662,0.25463,0.74662,0.25694,0.74662,0.25463,0.74662,0.25926,...,0.20116,0.36343,0.20309,0.32870,0.21083,0.29630,0.21857,0.26852,0.22631,0.24074
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,0.38685,0.28009,0.38878,0.28009,0.38878,0.28009,0.38878,0.28009,0.38685,0.28009,...,0.39265,0.71065,0.37718,0.70602,0.35977,0.70370,0.35010,0.70139,0.35203,0.70139
68,0.80851,0.53241,0.80658,0.53472,0.80658,0.53472,0.80271,0.53241,0.80077,0.53472,...,0.32882,0.50694,0.31335,0.50463,0.29400,0.50231,0.27853,0.49769,0.26499,0.49537
69,0.84526,0.51157,0.84526,0.51157,0.84526,0.51157,0.84333,0.51157,0.84526,0.51157,...,0.83559,0.62037,0.83366,0.59954,0.83752,0.57176,0.83752,0.54630,0.83366,0.51620
70,0.25145,0.54398,0.25145,0.54398,0.25145,0.54398,0.25145,0.54167,0.25145,0.54398,...,0.51064,0.73611,0.53385,0.73843,0.55513,0.73843,0.57447,0.74074,0.59574,0.74306


Selecting the target column for the test dataset

In [10]:
y_test = test_dataset[['class']]
y_test

Unnamed: 0,class
0,2
1,9
2,12
3,3
4,14
...,...
67,5
68,7
69,6
70,14


Pipeline predictions are accessed with:

In [11]:
y_pred = automl.predict(X_test)
y_pred

array([ 2,  9, 12,  3, 12,  8, 14, 12,  9,  4,  1,  7,  1,  8,  5,  7, 12,
        9, 11, 11,  3, 13, 11,  7, 15,  9, 13,  3, 15, 12, 10,  7, 15, 11,
        9,  8, 12,  9, 11,  4,  2, 10,  3,  6,  8, 13,  6, 12,  9, 14,  9,
        2,  2,  3, 10,  9,  5, 13, 13,  7, 13,  5, 15,  6, 15,  2,  4,  5,
        7,  6, 14,  2])

The pipeline can be evaluated against a held out dataset with the function call:

In [12]:
automl.score(X_test, y_test)

INFO:alpha_automl.automl_api:Metric: accuracy, Score: 0.7361111111111112


{'metric': 'accuracy', 'score': 0.7361111111111112}

### Visualizing pipelines using Pipeline Profiler

In order to explore the produced pipelines, we can use [PipelineProfiler](https://github.com/VIDA-NYU/PipelineVis). PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AlphaAutoML system.

After the pipeline search process is completed, we can use PipelineProfiler with:

In [None]:
automl.plot_comparison_pipelines()

For more information about how to use PipelineProfiler, click [here](https://towardsdatascience.com/exploring-auto-sklearn-models-with-pipelineprofiler-5b2c54136044). There is also a video demo available [here](https://www.youtube.com/watch?v=2WSYoaxLLJ8).