Improve user method of seeing pipelines generated #1298

eddiebergman · 2021-11-10T11:46:21Z

Currently, the easiest way for a user to see the pipelines included in the ensemble is through estimator.show_models() which just returns a str which needs to be manually parsed and looked through. There could definitely be a nicer format to view any such pipeline and provide easy access.

The text was updated successfully, but these errors were encountered:

eddiebergman · 2021-11-17T08:50:28Z

If anyone would like to give a look at this, it would require looking through how show_models generates the string and then returning the object version of that, maybe with a nicer interface to how the returned object is accessed.

eddiebergman · 2021-11-17T09:25:55Z

Some relevant issues that explain methods of seeing pipelines:

eddiebergman · 2021-11-17T09:30:18Z

From #1224, how to access individual pipeline steps

from autosklearn.pipeline.util import get_dataset
from autosklearn.classification import AutoSklearnClassifier

X_train, y_train, X_test, y_test = get_dataset('iris')
automodel = AutoSklearnClassifier(time_left_for_this_task=60)
automodel.fit(X_train, y_train)

# A list of pipelines with their weights [ (ensemble_weight, Pipeline) ]
models_with_weights = automodel.get_models_with_weights() 

# Get the first model with it's weight
weight, model = models_with_weights[0]

# Note that these models and the following are sklearn compatible
# The steps in the models pipeline
# [ 
#    ('data_preprocessing', DataPreprocessor),
#    ('balancing', Balancing ),
#    ('feature_preprocessor', FeaturePreprocessorChoice)
#    ('classifier', ClassifierChoice)
# ]
models_steps = model.steps

# Get the ClassifierChoice wrapper
classifier_str, classifier = model.steps[-1]

# The autosklearn wrapped model 
classifier = classifier.choice 
print(type(classifier)) # autosklearn.pipeline.components.classification.random_forest.RandomForest

# The sklearn model
sklearn_classifier = classifier.estimator

eddiebergman · 2021-11-17T09:31:42Z

From #1206, how to access a specific model in the ensemble

import sklearn
from sklearn import datasets
from autosklearn.classification import AutoSklearnClassifier

X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

clf = AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30)
clf.fit(X_train, y_train)

# Can find model id's from the `leaderboard()` function
wanted_model_id = ...
wanted_model = None

for (seed, model_id, budget), model in clf.automl_.models_.items():
    if model_id == wanted_model_id:
        wanted_model = model

UserFindingSelf · 2021-11-23T20:00:13Z

Hello! This is a really great project, I recently used Auto-Sklearn in one of my projects and it really helped me. Thank you!

But I faced the same issue of manually parsing the str output of show_models() method. I am actually looking to start contributing to open source and I would like to work on this issue.

Can you tell me what you think would be a good solution for this? Taking inspiration from the code snippets you have shared, I have come up with the following idea:

Create a dictionary for the whole ensemble where rank of the model is the key.
The ensemble dictionary has dictionaries for all the pipelines which contains:
- ensemble_weight
- balancing
- data_preprocessor
- feature_preprocessor
- classifer or regressor (auto sklearn wrapped model)
- sklearn_model

I have implemented it using the code you had shared and for reference, it would look something like this image:

What do you think about this? I will create a PR after editing the show_models function if you approve.

eddiebergman · 2021-11-24T09:35:42Z

Hi @UserFindingSelf,

Thanks for the kind words and wanting to contribute :) This looks like it would be quite useful for sure! I would be happy to help with a PR for this if you would like to share.

The only thing I would change is perhaps using the run_id as the key to the dictionary instead of rank, as this is more of a unique key to identify a model, leaving the rank to be inside the dictionary. Having the rank as the dict key means that the whole dict may as well be a list where the order identifies rank.

In the future moving forward, where we would like easier access to models, the run_id is a better indicator than rank. In a hypothetical situation where we allow checkpointing and resuming, the rank is likely to change which would lead to issues.

UserFindingSelf · 2021-11-24T11:03:14Z

Thank you so much for your help! Your suggestion definitely makes more sense.

Can you confirm if run_id is same as model_id in the table created by the leaderboard() method, or something else? I would then create model_id as the key and add rank to the model dictionary instead. If it's something else, can you please point to where I might find the run_id?

Thanks again! :)

eddiebergman · 2021-11-24T11:12:44Z

Yup that's the one! For context, the optimizer SMAC gives each "run" a number but for us, a "run" corresponds to a model configuration that is trained, hence it makes sense to present it as model_id, sorry for the confusion.

# autosklearn/estimators.py 680
{
    'model_id': rval.additional_info['num_run'],
}

UserFindingSelf · 2021-11-24T12:42:20Z

Great! Just one more question. Is it okay to use leaderboard() to get 'rank' and 'model_id', or should I write independent code for that inside the show_models() function?

eddiebergman · 2021-11-24T14:36:29Z

So leaderboard does a lot of extra things, mainly validation and sorting. This is a relatively small dataframe and the sorting is necessary anyways so I think reusing leaderboard should be fine :)

UserFindingSelf · 2021-11-25T13:23:31Z

Hello @eddiebergman! I wasn't able to use leaderboard() inside the automl.py file so I wrote the code required for getting ranks and ensemble weights. I feel that's alright since those two are the only things I had thought of including from the leaderboard() table.

Now, in line with the process described in CONTRIBUTING.md I am currently running tests. If all goes well I will create the PR. Can you tell me how long it usually takes for tests to complete? And do I even need to complete all the tests?

Thank you!

eddiebergman · 2021-11-25T13:26:21Z

Hi @UserFindingSelf,

Glad to hear it's working and even more glad you're running the tests! In general, on the github action servers where we run automated tests, they can take around 45 minutes (something we need to optimize).

You can create a PR and I can schedule the tests to run on them, saving your machine from having to run them yourself!

UserFindingSelf · 2021-11-25T13:31:24Z

Awesome! Working on creating the PR then.

UserFindingSelf · 2021-11-26T08:53:38Z

Hey @eddiebergman! I have created the PR, please review whenever possible for you. I will make the necessary changes based on your suggestions then.

eddiebergman · 2022-01-02T20:20:40Z

Improved with #1321. Might need some improvements in the future but for now I'm closing this.

TomPham97 · 2022-09-16T20:49:36Z

From #1206, how to access a specific model in the ensemble

import sklearn
from sklearn import datasets
from autosklearn.classification import AutoSklearnClassifier

X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

clf = AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30)
clf.fit(X_train, y_train)

# Can find model id's from the `leaderboard()` function
wanted_model_id = ...
wanted_model = None

for (seed, model_id, budget), model in clf.automl_.models_.items():
    if model_id == wanted_model_id:
        wanted_model = model

After the #1321 improvement, would the updated code be like this?

# Can find model id's from the `leaderboard()` function
wanted_model_id = clf.leaderboard()[clf.leaderboard()['rank'] == ...].index[0]

wanted_model = clf.show_models()[wanted_model_id]

mfeurer · 2022-09-20T08:30:21Z

Please open a new issue @TomPham97 for any new questions.

eddiebergman added the maintenance Internal maintenance label Nov 10, 2021

project-bot bot added this to LabelMe in Maintenance Nov 10, 2021

eddiebergman changed the title ~~Improve use method of seeing pipelines generated~~ Improve user method of seeing pipelines generated Nov 10, 2021

eddiebergman added the Good first issue label Nov 17, 2021

eddiebergman mentioned this issue Nov 17, 2021

[Question] How to get individual models from the ensemble? #1224

Closed

This was referenced Nov 17, 2021

Access output after each step of the pipeline #1148

Closed

How to get the classifier with parameters and results of the best models? #1116

Closed

UserFindingSelf mentioned this issue Nov 25, 2021

Changes show_models() function to return a dictionary of models in ensemble #1321

Merged

eddiebergman linked a pull request Jan 2, 2022 that will close this issue

Changes show_models() function to return a dictionary of models in ensemble #1321

Merged

eddiebergman closed this as completed Jan 2, 2022

Maintenance automation moved this from LabelMe to Done Jan 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve user method of seeing pipelines generated #1298

Improve user method of seeing pipelines generated #1298

eddiebergman commented Nov 10, 2021

eddiebergman commented Nov 17, 2021

eddiebergman commented Nov 17, 2021 •

edited

eddiebergman commented Nov 17, 2021

eddiebergman commented Nov 17, 2021

UserFindingSelf commented Nov 23, 2021 •

edited

eddiebergman commented Nov 24, 2021

UserFindingSelf commented Nov 24, 2021

eddiebergman commented Nov 24, 2021

UserFindingSelf commented Nov 24, 2021

eddiebergman commented Nov 24, 2021

UserFindingSelf commented Nov 25, 2021

eddiebergman commented Nov 25, 2021

UserFindingSelf commented Nov 25, 2021

UserFindingSelf commented Nov 26, 2021

eddiebergman commented Jan 2, 2022

TomPham97 commented Sep 16, 2022 •

edited

mfeurer commented Sep 20, 2022

Improve user method of seeing pipelines generated #1298

Improve user method of seeing pipelines generated #1298

Comments

eddiebergman commented Nov 10, 2021

eddiebergman commented Nov 17, 2021

eddiebergman commented Nov 17, 2021 • edited

eddiebergman commented Nov 17, 2021

eddiebergman commented Nov 17, 2021

UserFindingSelf commented Nov 23, 2021 • edited

eddiebergman commented Nov 24, 2021

UserFindingSelf commented Nov 24, 2021

eddiebergman commented Nov 24, 2021

UserFindingSelf commented Nov 24, 2021

eddiebergman commented Nov 24, 2021

UserFindingSelf commented Nov 25, 2021

eddiebergman commented Nov 25, 2021

UserFindingSelf commented Nov 25, 2021

UserFindingSelf commented Nov 26, 2021

eddiebergman commented Jan 2, 2022

TomPham97 commented Sep 16, 2022 • edited

mfeurer commented Sep 20, 2022

eddiebergman commented Nov 17, 2021 •

edited

UserFindingSelf commented Nov 23, 2021 •

edited

TomPham97 commented Sep 16, 2022 •

edited