## Solving Tabular Regression Tasks

First, import the class `AutoMLRegressor`

In [1]:
from alpha_automl  import AutoMLRegressor
import pandas as pd

### Generating Pipelines for CSV Datasets

In this example, we are generating pipelines for a CSV dataset. The 196_autoMpg dataset is used for this example.

In [2]:
output_path = '/Users/rlopez/D3M/tmp/'
train_dataset = pd.read_csv('/Users/rlopez/D3M/examples/datasets/196_autoMpg/train_data.csv')
test_dataset = pd.read_csv('/Users/rlopez/D3M/examples/datasets/196_autoMpg/test_data.csv')

Removing the target column from the features for the train dataset

In [3]:
X_train = train_dataset.drop(columns=['class'])
X_train

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model,origin
0,8,350.0,165.0,3693,11.5,70,1
1,8,318.0,150.0,3436,11.0,70,1
2,8,302.0,140.0,3449,10.5,70,1
3,8,454.0,220.0,4354,9.0,70,1
4,8,440.0,215.0,4312,8.5,70,1
...,...,...,...,...,...,...,...
293,4,144.0,96.0,2665,13.9,82,3
294,4,135.0,84.0,2370,13.0,82,1
295,4,151.0,90.0,2950,17.3,82,1
296,4,135.0,84.0,2295,11.6,82,1


Selecting the target column for the train dataset

In [4]:
y_train = train_dataset[['class']]
y_train

Unnamed: 0,class
0,15.0
1,18.0
2,17.0
3,14.0
4,14.0
...,...
293,32.0
294,36.0
295,27.0
296,32.0


### Searching  Pipelines

In [5]:
automl = AutoMLRegressor(output_path, time_bound=10)
automl.fit(X_train, y_train)

INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=-11.548409932589315
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=-14.708540608138133
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=-12.080522883240945
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=-14.708540608138133
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=-11.463559556336683
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=-14.491872393245075
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=-14.714207441820857
INFO:a

After the pipeline search is complete, we can display the leaderboard:

In [6]:
automl.plot_leaderboard()

ranking,summary,max_error
1,"SimpleImputer, MaxAbsScaler, Lars",-11.222741
2,"SimpleImputer, MaxAbsScaler, LinearRegression",-11.46356
3,"SimpleImputer, StandardScaler, LinearRegression",-11.46356
4,"SimpleImputer, RobustScaler, LinearRegression",-11.46356
5,"SimpleImputer, StandardScaler, RidgeCV",-11.54841
6,"SimpleImputer, MaxAbsScaler, ARDRegression",-11.627414
7,"SimpleImputer, MaxAbsScaler, RidgeCV",-12.080523
8,"SimpleImputer, MaxAbsScaler, PassiveAggressiveRegressor",-12.979345
9,"SimpleImputer, MaxAbsScaler, HuberRegressor",-13.019221
10,"SimpleImputer, MaxAbsScaler, TheilSenRegressor",-13.05332


Removing the target column from the features for the test dataset

In [7]:
X_test = test_dataset.drop(columns=['class'])
X_test

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model,origin
0,8,307,130.0,3504,12.0,70,1
1,8,304,150.0,3433,12.0,70,1
2,8,429,198.0,4341,10.0,70,1
3,8,390,190.0,3850,8.5,70,1
4,6,198,95.0,2833,15.5,70,1
...,...,...,...,...,...,...,...
95,6,181,110.0,2945,16.4,82,1
96,6,232,112.0,2835,14.7,82,1
97,4,140,86.0,2790,15.6,82,1
98,4,97,52.0,2130,24.6,82,2


Selecting the target column for the test dataset

In [8]:
y_test = test_dataset[['class']]
y_test

Unnamed: 0,class
0,18.0
1,16.0
2,15.0
3,15.0
4,22.0
...,...
95,25.0
96,22.0
97,27.0
98,44.0


Pipeline predictions are accessed with:

In [9]:
y_pred = automl.predict(X_test)
y_pred

array([14.08142539, 13.85865698, 12.29690254, 14.05646111, 17.66295652,
       24.73077502, 21.16589936,  5.54482198, 23.13437337, 25.54645937,
       21.96006082, 11.24752756,  5.82209153, 20.56797595, 22.93415109,
       27.55726534, 23.93092069, 26.23409822, 12.90464557, 12.17674539,
       11.85104829, 11.11803549, 11.36098008, 18.40272393, 23.59795392,
       19.39107422, 24.54403185, 27.08405218, 10.04304236, 10.98174645,
       11.98663973, 19.88041871,  9.70423376, 26.80221186, 20.1810553 ,
       24.37842419, 12.66069436, 26.91407252, 16.49343766, 21.04234553,
       24.19524186, 10.96844655, 26.04201085, 26.88902146, 20.37126107,
       10.59633736, 24.03331731, 26.79954045, 28.94407867, 23.42044126,
       23.16749291, 33.30189166, 27.71775068, 34.20812229, 16.10132785,
       23.44139647, 32.01441478, 14.32894416, 21.19521185, 18.63063606,
       17.56029548, 25.42300381, 27.8593681 , 33.89073162, 35.65145164,
       20.7160088 , 23.07435433, 21.65879452, 25.6652809 , 21.05

The pipeline can be evaluated against a held out dataset with the function call:

In [10]:
automl.score(X_test, y_test)

INFO:alpha_automl.automl_api:Metric: max_error, Score: 8.197032975604646


{'metric': 'max_error', 'score': 8.197032975604646}

### Visualizing pipelines using Pipeline Profiler

In order to explore the produced pipelines, we can use [PipelineProfiler](https://github.com/VIDA-NYU/PipelineVis). PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AlphaAutoML system.

After the pipeline search process is completed, we can use PipelineProfiler with:

In [None]:
automl.plot_comparison_pipelines()

For more information about how to use PipelineProfiler, click [here](https://towardsdatascience.com/exploring-auto-sklearn-models-with-pipelineprofiler-5b2c54136044). There is also a video demo available [here](https://www.youtube.com/watch?v=2WSYoaxLLJ8).