## Solving Tabular Regression Tasks

First, import the class `AutoMLRegressor`

In [1]:
from alpha_automl import AutoMLRegressor
import pandas as pd

### Generating Pipelines for CSV Datasets

In this example, we are generating pipelines for a CSV dataset. The 196_autoMpg dataset is used for this example.

In [2]:
output_path = 'tmp/'
train_dataset = pd.read_csv('datasets/196_autoMpg/train_data.csv')
test_dataset = pd.read_csv('datasets/196_autoMpg/test_data.csv')

Removing the target column from the features for the train dataset

In [3]:
target_column = 'class'
X_train = train_dataset.drop(columns=[target_column])
X_train

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model,origin
0,8,350.0,165.0,3693,11.5,70,1
1,8,318.0,150.0,3436,11.0,70,1
2,8,302.0,140.0,3449,10.5,70,1
3,8,454.0,220.0,4354,9.0,70,1
4,8,440.0,215.0,4312,8.5,70,1
...,...,...,...,...,...,...,...
293,4,144.0,96.0,2665,13.9,82,3
294,4,135.0,84.0,2370,13.0,82,1
295,4,151.0,90.0,2950,17.3,82,1
296,4,135.0,84.0,2295,11.6,82,1


Selecting the target column for the train dataset

In [4]:
y_train = train_dataset[[target_column]]
y_train

Unnamed: 0,class
0,15.0
1,18.0
2,17.0
3,14.0
4,14.0
...,...
293,32.0
294,36.0
295,27.0
296,32.0


### Searching  Pipelines

In [5]:
automl = AutoMLRegressor(output_path, time_bound=10)
automl.fit(X_train, y_train)

INFO:alpha_automl.automl_api:Found pipeline, time=0:00:00, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=3.3182052729132256
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:00, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=3.3182052729132256
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:01, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=2.8462896066896053
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:01, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=2.8483592673992657
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:01, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=2.848401895838938
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:01, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=2.773362899435579
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:01, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=1.8657199999999996
INFO:alpha_auto

### Exploring Pipelines

After the pipeline search is complete, we can display the leaderboard:

In [6]:
automl.plot_leaderboard()

ranking,pipeline,mean_absolute_error
1,"SimpleImputer, RobustScaler, RandomForestRegressor",1.85
2,"SimpleImputer, StandardScaler, RandomForestRegressor",1.866
3,"SimpleImputer, MaxAbsScaler, ExtraTreesRegressor",1.868
4,"SimpleImputer, MaxAbsScaler, RandomForestRegressor",1.882
5,"SimpleImputer, RobustScaler, ExtraTreesRegressor",1.884
6,"SimpleImputer, StandardScaler, ExtraTreesRegressor",1.886
7,"SimpleImputer, RandomForestRegressor",1.898
8,"SimpleImputer, ExtraTreesRegressor",1.901
9,"SimpleImputer, RobustScaler, GradientBoostingRegressor",2.048
10,"SimpleImputer, StandardScaler, GradientBoostingRegressor",2.055


In order to explore the produced pipelines, we can use [PipelineProfiler](https://github.com/VIDA-NYU/PipelineVis). PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AlphaAutoML system.

After the pipeline search process is completed, we can use PipelineProfiler with:

In [None]:
automl.plot_comparison_pipelines()

For more information about how to use PipelineProfiler, click [here](https://towardsdatascience.com/exploring-auto-sklearn-models-with-pipelineprofiler-5b2c54136044). There is also a video demo available [here](https://www.youtube.com/watch?v=2WSYoaxLLJ8).

### Testing Pipelines

Removing the target column from the features for the test dataset

In [7]:
X_test = test_dataset.drop(columns=[target_column])
X_test

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model,origin
0,8,307,130.0,3504,12.0,70,1
1,8,304,150.0,3433,12.0,70,1
2,8,429,198.0,4341,10.0,70,1
3,8,390,190.0,3850,8.5,70,1
4,6,198,95.0,2833,15.5,70,1
...,...,...,...,...,...,...,...
95,6,181,110.0,2945,16.4,82,1
96,6,232,112.0,2835,14.7,82,1
97,4,140,86.0,2790,15.6,82,1
98,4,97,52.0,2130,24.6,82,2


Selecting the target column for the test dataset

In [8]:
y_test = test_dataset[[target_column]]
y_test

Unnamed: 0,class
0,18.0
1,16.0
2,15.0
3,15.0
4,22.0
...,...
95,25.0
96,22.0
97,27.0
98,44.0


Pipeline predictions are accessed with:

In [9]:
y_pred = automl.predict(X_test)
y_pred

array([16.006, 16.881, 13.73 , 13.815, 18.974, 26.6  , 23.967, 11.46 ,
       23.847, 25.053, 19.981, 13.03 , 12.285, 18.766, 22.43 , 27.748,
       27.225, 24.734, 12.96 , 12.28 , 14.775, 14.535, 14.42 , 20.725,
       23.575, 20.497, 22.571, 26.035, 11.79 , 14.43 , 11.85 , 18.834,
       12.51 , 25.03 , 22.33 , 25.337, 13.405, 26.015, 14.26 , 19.364,
       24.946, 14.309, 25.82 , 26.168, 19.517, 14.305, 24.234, 26.525,
       28.47 , 23.815, 22.852, 30.166, 27.431, 30.246, 14.808, 21.247,
       30.495, 18.14 , 18.818, 15.379, 15.288, 24.687, 28.594, 37.447,
       35.187, 20.838, 19.92 , 18.72 , 23.022, 19.548, 20.977, 17.521,
       16.644, 34.246, 19.123, 26.632, 38.061, 29.279, 26.114, 36.044,
       32.836, 26.38 , 35.321, 35.247, 36.533, 29.394, 29.787, 29.829,
       26.586, 34.518, 36.513, 35.733, 36.856, 35.577, 33.981, 28.263,
       29.105, 26.971, 40.338, 29.909])

The pipeline can be evaluated against a held out dataset with the function call:

In [10]:
automl.score(X_test, y_test)

INFO:alpha_automl.automl_api:Metric: mean_absolute_error, Score: 1.7378600000000004


{'metric': 'mean_absolute_error', 'score': 1.7378600000000004}