## Solving Tabular Regression Tasks

First, import the class `AutoMLRegressor`

In [1]:
from alpha_automl import AutoMLRegressor
import pandas as pd

### Generating Pipelines for CSV Datasets

In this example, we are generating pipelines for a CSV dataset. The 196_autoMpg dataset is used for this example.

In [2]:
output_path = 'tmp/'
train_dataset = pd.read_csv('datasets/196_autoMpg/train_data.csv')
test_dataset = pd.read_csv('datasets/196_autoMpg/test_data.csv')

Removing the target column from the features for the train dataset

In [3]:
target_column = 'class'
X_train = train_dataset.drop(columns=[target_column])
X_train

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model,origin
0,8,350.0,165.0,3693,11.5,70,1
1,8,318.0,150.0,3436,11.0,70,1
2,8,302.0,140.0,3449,10.5,70,1
3,8,454.0,220.0,4354,9.0,70,1
4,8,440.0,215.0,4312,8.5,70,1
...,...,...,...,...,...,...,...
293,4,144.0,96.0,2665,13.9,82,3
294,4,135.0,84.0,2370,13.0,82,1
295,4,151.0,90.0,2950,17.3,82,1
296,4,135.0,84.0,2295,11.6,82,1


Selecting the target column for the train dataset

In [4]:
y_train = train_dataset[[target_column]]
y_train

Unnamed: 0,class
0,15.0
1,18.0
2,17.0
3,14.0
4,14.0
...,...
293,32.0
294,36.0
295,27.0
296,32.0


### Searching  Pipelines

In [5]:
automl = AutoMLRegressor(output_path, time_bound=10, optimizing=True, optimizing_number=20)
automl.fit(X_train, y_train)

INFO:alpha_automl.automl_api:Found pipeline, time=0:00:00, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=22.393773769993338
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:00, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=22.393773769993338
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:00, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=18.44031117445266
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:00, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=18.411598425935725
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:00, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=18.73502642853782
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:00, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=18.38271983841094
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:01, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=9.417245853333338
INFO:alpha_automl

### Exploring Pipelines

After the pipeline search is complete, we can display the leaderboard:

In [6]:
automl.plot_leaderboard()

ranking,pipeline,mean_squared_error
1,"SimpleImputer, MaxAbsScaler, RandomForestRegressor",9.183
2,"SimpleImputer, RobustScaler, RandomForestRegressor",9.343
3,"SimpleImputer, RandomForestRegressor",9.348
4,"SimpleImputer, StandardScaler, RandomForestRegressor",9.417
5,"SimpleImputer, MaxAbsScaler, GradientBoostingRegressor",9.528
6,"SimpleImputer, MaxAbsScaler, ExtraTreesRegressor",9.544
7,"SimpleImputer, RobustScaler, ExtraTreesRegressor",9.736
8,"SimpleImputer, StandardScaler, GradientBoostingRegressor",9.765
9,"SimpleImputer, RobustScaler, GradientBoostingRegressor",9.841
10,"SimpleImputer, ExtraTreesRegressor",10.077


In order to explore the produced pipelines, we can use [PipelineProfiler](https://github.com/VIDA-NYU/PipelineVis). PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AlphaAutoML system.

After the pipeline search process is completed, we can use PipelineProfiler with:

In [None]:
automl.plot_comparison_pipelines()

For more information about how to use PipelineProfiler, click [here](https://towardsdatascience.com/exploring-auto-sklearn-models-with-pipelineprofiler-5b2c54136044). There is also a video demo available [here](https://www.youtube.com/watch?v=2WSYoaxLLJ8).

### Testing Pipelines

Removing the target column from the features for the test dataset

In [8]:
X_test = test_dataset.drop(columns=[target_column])
X_test

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model,origin
0,8,307,130.0,3504,12.0,70,1
1,8,304,150.0,3433,12.0,70,1
2,8,429,198.0,4341,10.0,70,1
3,8,390,190.0,3850,8.5,70,1
4,6,198,95.0,2833,15.5,70,1
...,...,...,...,...,...,...,...
95,6,181,110.0,2945,16.4,82,1
96,6,232,112.0,2835,14.7,82,1
97,4,140,86.0,2790,15.6,82,1
98,4,97,52.0,2130,24.6,82,2


Selecting the target column for the test dataset

In [9]:
y_test = test_dataset[[target_column]]
y_test

Unnamed: 0,class
0,18.0
1,16.0
2,15.0
3,15.0
4,22.0
...,...
95,25.0
96,22.0
97,27.0
98,44.0


Pipeline predictions are accessed with:

In [10]:
y_pred = automl.predict(X_test)
y_pred

array([15.845, 17.301, 13.36 , 14.025, 18.794, 26.92 , 24.56 , 11.67 ,
       22.886, 25.622, 18.994, 12.99 , 12.41 , 19.177, 21.48 , 26.52 ,
       27.213, 24.749, 12.72 , 11.745, 15.39 , 14.791, 14.485, 20.884,
       23.225, 21.314, 23.339, 27.17 , 11.51 , 14.02 , 11.69 , 19.182,
       12.32 , 26.075, 20.599, 26.08 , 12.71 , 25.38 , 13.315, 20.641,
       24.383, 14.549, 26.14 , 25.511, 19.792, 14.51 , 24.167, 26.969,
       28.175, 23.224, 22.992, 31.094, 27.44 , 31.657, 14.725, 20.979,
       32.248, 20.055, 18.818, 15.54 , 15.668, 25.603, 28.817, 36.511,
       34.647, 20.781, 20.439, 19.045, 24.262, 19.26 , 21.477, 17.972,
       16.167, 33.364, 18.44 , 25.135, 40.156, 30.821, 25.275, 35.56 ,
       33.029, 25.189, 33.702, 34.442, 35.037, 28.557, 29.885, 30.498,
       26.406, 34.937, 37.425, 35.804, 35.268, 35.407, 32.963, 28.423,
       30.444, 26.976, 38.821, 29.493])

The pipeline can be evaluated against a held out dataset with the function call:

In [11]:
automl.score(X_test, y_test)

INFO:alpha_automl.automl_api:Metric: mean_absolute_error, Score: 1.774809999999999


{'metric': 'mean_absolute_error', 'score': 1.774809999999999}