Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve user method of seeing pipelines generated #1298

Closed
eddiebergman opened this issue Nov 10, 2021 · 17 comments · Fixed by #1321
Closed

Improve user method of seeing pipelines generated #1298

eddiebergman opened this issue Nov 10, 2021 · 17 comments · Fixed by #1321
Labels
Projects

Comments

@eddiebergman
Copy link
Contributor

Currently, the easiest way for a user to see the pipelines included in the ensemble is through estimator.show_models() which just returns a str which needs to be manually parsed and looked through. There could definitely be a nicer format to view any such pipeline and provide easy access.

@eddiebergman eddiebergman added the maintenance Internal maintenance label Nov 10, 2021
@project-bot project-bot bot added this to LabelMe in Maintenance Nov 10, 2021
@eddiebergman eddiebergman changed the title Improve use method of seeing pipelines generated Improve user method of seeing pipelines generated Nov 10, 2021
@eddiebergman
Copy link
Contributor Author

If anyone would like to give a look at this, it would require looking through how show_models generates the string and then returning the object version of that, maybe with a nicer interface to how the returned object is accessed.

@eddiebergman
Copy link
Contributor Author

From #1224, how to access individual pipeline steps

from autosklearn.pipeline.util import get_dataset
from autosklearn.classification import AutoSklearnClassifier

X_train, y_train, X_test, y_test = get_dataset('iris')
automodel = AutoSklearnClassifier(time_left_for_this_task=60)
automodel.fit(X_train, y_train)

# A list of pipelines with their weights [ (ensemble_weight, Pipeline) ]
models_with_weights = automodel.get_models_with_weights() 

# Get the first model with it's weight
weight, model = models_with_weights[0]

# Note that these models and the following are sklearn compatible
# The steps in the models pipeline
# [ 
#    ('data_preprocessing', DataPreprocessor),
#    ('balancing', Balancing ),
#    ('feature_preprocessor', FeaturePreprocessorChoice)
#    ('classifier', ClassifierChoice)
# ]
models_steps = model.steps

# Get the ClassifierChoice wrapper
classifier_str, classifier = model.steps[-1]

# The autosklearn wrapped model 
classifier = classifier.choice 
print(type(classifier)) # autosklearn.pipeline.components.classification.random_forest.RandomForest

# The sklearn model
sklearn_classifier = classifier.estimator

@eddiebergman
Copy link
Contributor Author

From #1206, how to access a specific model in the ensemble

import sklearn
from sklearn import datasets
from autosklearn.classification import AutoSklearnClassifier

X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

clf = AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30)
clf.fit(X_train, y_train)

# Can find model id's from the `leaderboard()` function
wanted_model_id = ...
wanted_model = None

for (seed, model_id, budget), model in clf.automl_.models_.items():
    if model_id == wanted_model_id:
        wanted_model = model

@UserFindingSelf
Copy link
Contributor

UserFindingSelf commented Nov 23, 2021

Hello! This is a really great project, I recently used Auto-Sklearn in one of my projects and it really helped me. Thank you!

But I faced the same issue of manually parsing the str output of show_models() method. I am actually looking to start contributing to open source and I would like to work on this issue.

Can you tell me what you think would be a good solution for this? Taking inspiration from the code snippets you have shared, I have come up with the following idea:

  • Create a dictionary for the whole ensemble where rank of the model is the key.
  • The ensemble dictionary has dictionaries for all the pipelines which contains:
    • ensemble_weight
    • balancing
    • data_preprocessor
    • feature_preprocessor
    • classifer or regressor (auto sklearn wrapped model)
    • sklearn_model

I have implemented it using the code you had shared and for reference, it would look something like this image:
image

What do you think about this? I will create a PR after editing the show_models function if you approve.

@eddiebergman
Copy link
Contributor Author

Hi @UserFindingSelf,

Thanks for the kind words and wanting to contribute :) This looks like it would be quite useful for sure! I would be happy to help with a PR for this if you would like to share.

The only thing I would change is perhaps using the run_id as the key to the dictionary instead of rank, as this is more of a unique key to identify a model, leaving the rank to be inside the dictionary. Having the rank as the dict key means that the whole dict may as well be a list where the order identifies rank.

In the future moving forward, where we would like easier access to models, the run_id is a better indicator than rank. In a hypothetical situation where we allow checkpointing and resuming, the rank is likely to change which would lead to issues.

@UserFindingSelf
Copy link
Contributor

Thank you so much for your help! Your suggestion definitely makes more sense.

Can you confirm if run_id is same as model_id in the table created by the leaderboard() method, or something else? I would then create model_id as the key and add rank to the model dictionary instead. If it's something else, can you please point to where I might find the run_id?

Thanks again! :)

@eddiebergman
Copy link
Contributor Author

Yup that's the one! For context, the optimizer SMAC gives each "run" a number but for us, a "run" corresponds to a model configuration that is trained, hence it makes sense to present it as model_id, sorry for the confusion.

# autosklearn/estimators.py 680
{
    'model_id': rval.additional_info['num_run'],
}

@UserFindingSelf
Copy link
Contributor

Great! Just one more question. Is it okay to use leaderboard() to get 'rank' and 'model_id', or should I write independent code for that inside the show_models() function?

@eddiebergman
Copy link
Contributor Author

So leaderboard does a lot of extra things, mainly validation and sorting. This is a relatively small dataframe and the sorting is necessary anyways so I think reusing leaderboard should be fine :)

@UserFindingSelf
Copy link
Contributor

Hello @eddiebergman! I wasn't able to use leaderboard() inside the automl.py file so I wrote the code required for getting ranks and ensemble weights. I feel that's alright since those two are the only things I had thought of including from the leaderboard() table.

Now, in line with the process described in CONTRIBUTING.md I am currently running tests. If all goes well I will create the PR. Can you tell me how long it usually takes for tests to complete? And do I even need to complete all the tests?

Thank you!

@eddiebergman
Copy link
Contributor Author

Hi @UserFindingSelf,

Glad to hear it's working and even more glad you're running the tests! In general, on the github action servers where we run automated tests, they can take around 45 minutes (something we need to optimize).

You can create a PR and I can schedule the tests to run on them, saving your machine from having to run them yourself!

@UserFindingSelf
Copy link
Contributor

Awesome! Working on creating the PR then.

@UserFindingSelf
Copy link
Contributor

Hey @eddiebergman! I have created the PR, please review whenever possible for you. I will make the necessary changes based on your suggestions then.

@eddiebergman
Copy link
Contributor Author

Improved with #1321. Might need some improvements in the future but for now I'm closing this.

Maintenance automation moved this from LabelMe to Done Jan 2, 2022
@TomPham97
Copy link

TomPham97 commented Sep 16, 2022

From #1206, how to access a specific model in the ensemble

import sklearn
from sklearn import datasets
from autosklearn.classification import AutoSklearnClassifier

X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

clf = AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30)
clf.fit(X_train, y_train)

# Can find model id's from the `leaderboard()` function
wanted_model_id = ...
wanted_model = None

for (seed, model_id, budget), model in clf.automl_.models_.items():
    if model_id == wanted_model_id:
        wanted_model = model

After the #1321 improvement, would the updated code be like this?

# Can find model id's from the `leaderboard()` function
wanted_model_id = clf.leaderboard()[clf.leaderboard()['rank'] == ...].index[0]

wanted_model = clf.show_models()[wanted_model_id]

@mfeurer
Copy link
Contributor

mfeurer commented Sep 20, 2022

Please open a new issue @TomPham97 for any new questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
4 participants