New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed and time budgets #57
Comments
Unfortunately, such a strategy is not yet implemented, but we are aware of the issue. It would be great to have some best practices on the per_run_time_limit, but so far there aren't. I usually set it to something that allows a single function evaluation to take rather long, much like the 19000 seconds and then use far more than 20 hours on a cluster. This is not yet efficient enough and we're working on this; but this is not an easy problem in our setting. Since you know the nature of your problem you could remove certain classifiers which don't scale by specifying which classifiers to keep with the argument |
Now the SGDClassifier is fast but I have surely spent a few days getting to that. Including testing other (slow) classifiers and running the grid optimization pipeline a few times. So if we could get to the same accuracy in one go with auto-sklearn alone, even it needs 20 hours it is already big achievement.
I understand that you keep all converged models in an ensemble and wouldn't drop them. The question is also how much time does each model get from the overal budget to try to get better. |
How time is assigned to each individual model is a big research question and your input is certainly very helpful. What's happening in auto-sklearn right now is that each model is trained until convergence. For the future we want to have something that does evaluate models only partially and refine them if they are promising as well as that we evaluate models on subsets of the data. Models are tested sequentially and new models are proposed by the Bayesian optimization toolkit SMAC. We do not add all converged models to the ensemble, we only consider them for the ensemble. The ensemble building strategy then selects only a subset of these. Thus, a model gets as much time to converge as you specify with the argument |
Thanks, that does answer my question. I will have to learn more about SMAC to see how my suggestion could be harmonized with that. |
Matthias, you have mentioned above "use far more than 20 hours on a cluster" did you refer to the shared mode of SMAC there? I have red the SMAC guide and started the run with the following parameters to try to leverage all the cores on one of my machines:
Can you comment on if that is the correct way of leveraging this mode? Could you share your setup for a cluster if you have more than one machine involved? |
Yes, I referred to the shared mode of SMAC. This is mostly correct. You need to specify a different seed for each of the classifiers with the argument
Does this help? I will leave this issue open so I'm reminded to put this into the documentation. |
Sorry, it is not quite clear to me yet how does one run this on one machine. Suppose I have the below code to parallelize in a sub-process, what else should I add?
|
Then my description wasn't helpful, sorry. I'm running this script multiple times. I'm not sure if this approach will work, but you can give it a try. I think your script is mostly correct, you only need to create a new AutoSkleanClassifier in the main process:
|
Yet another classifier in the main process with the seed=1 is there to get the ensemble builder hooked up, correct? The other classifiers should have seeds >1 then, if I get it right. |
Yes for the first question. I'm not sure if the others need to have seeds > 1 or if >= 1 would be sufficient. But you're on the safe side having seeds > 1. |
so here is the exact code I run (after the data is loaded)
and I get this error message:
I see all 6 or 7 python processes started and occupying usual amount of memory for the planned time in "top" listing. So the part before run_ensemble_builder may be OK (I don't know for sure). Looks like task type is missing. But I can imagine the problem is somewhere else in the missing configuration. What is missing still? |
It seems like the task is only set when
before calling |
one more thing that I'm not sure we have covered - should the metalearning_configurations = 0 for all classifiers or should the first one (that is not fitted) have metalearning_configurations = 25? I admit I haven't groked the philosophy of the autosklearn + SMAC so please bear with me as we get to the first working example. It would really help to have the minimal example code that would run end to end leveraging muticore machine. This should be standard hardware everywhere by now. |
I can't answer that question empirically. But it would make sense to run the first classifier in spawn_classifier with metalearning_configurations = 25. It certainly won't make SMAC work worse, and can make the process of finding a good solution quicker. Let me know once you have the example up and running, I'm interested to know whether it works and helps finding good solutions quicker. |
Yes, I am trying to write this example myself and it looks like I'm very close to make it work. But It is not very efficient as of now because I have to find the parameters of the run_ensemble_bulder empirically.
Here I get this error even though I have set the precision above: Wouldn't it be faster if you yourself could run exact same script with your data and make sure this gets to completion? |
This below works. I have set initial_configurations_via_metalearning=0 for all classifiers and max_iterations=1, ensemble_size=50 for run_ensemble_builder. Is there a better approach? Should I play with those parameters?
|
Closing this for now - please reopen if you think it's not solved. |
In a text classification task the SGDClassifier needs just a few minutes to get to the same result as auto-sklearn that I let running for 20 hours. Lesser time budget for auto-sklearn resulted in absolute or relative failure in prediction.
I wonder if there is a strategy to try the fastest algorithms first and if time is up use their results at least?
Another question is about the recommended per_run_time_limit value. Is there a rule of thumb choosing it?
SGDClassifier Precision: 0.20 Test FR Recall: 0.53 F1: 0.29
Auto-sklearn Precision: 0.28 Recall: 0.31 F1: 0.29 classifier.fit(X_train, y_train, metric='f1_metric')
AutoSklearnClassifier( time_left_for_this_task=72000, per_run_time_limit=19000, ml_memory_limit=10000)
The text was updated successfully, but these errors were encountered: