Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed and time budgets #57

Closed
Motorrat opened this issue Mar 2, 2016 · 17 comments
Closed

Speed and time budgets #57

Motorrat opened this issue Mar 2, 2016 · 17 comments

Comments

@Motorrat
Copy link
Contributor

Motorrat commented Mar 2, 2016

In a text classification task the SGDClassifier needs just a few minutes to get to the same result as auto-sklearn that I let running for 20 hours. Lesser time budget for auto-sklearn resulted in absolute or relative failure in prediction.

I wonder if there is a strategy to try the fastest algorithms first and if time is up use their results at least?

Another question is about the recommended per_run_time_limit value. Is there a rule of thumb choosing it?

SGDClassifier Precision: 0.20 Test FR Recall: 0.53 F1: 0.29
Auto-sklearn Precision: 0.28 Recall: 0.31 F1: 0.29 classifier.fit(X_train, y_train, metric='f1_metric')
AutoSklearnClassifier( time_left_for_this_task=72000, per_run_time_limit=19000, ml_memory_limit=10000)

@mfeurer
Copy link
Contributor

mfeurer commented Mar 2, 2016

Unfortunately, such a strategy is not yet implemented, but we are aware of the issue.

It would be great to have some best practices on the per_run_time_limit, but so far there aren't. I usually set it to something that allows a single function evaluation to take rather long, much like the 19000 seconds and then use far more than 20 hours on a cluster. This is not yet efficient enough and we're working on this; but this is not an easy problem in our setting. Since you know the nature of your problem you could remove certain classifiers which don't scale by specifying which classifiers to keep with the argument include_classifiers. I already opened up an issue to make this argument more intuitive.

@Motorrat
Copy link
Contributor Author

Motorrat commented Mar 2, 2016

Now the SGDClassifier is fast but I have surely spent a few days getting to that. Including testing other (slow) classifiers and running the grid optimization pipeline a few times. So if we could get to the same accuracy in one go with auto-sklearn alone, even it needs 20 hours it is already big achievement.
Just want to list the steps I went through to get to that fast and accurate SGDClassifier and may be that can be formalized later for auto-sklearn.

  1. When looking for Classifier I have done a test of all available classifiers in regard to their speed and accuracy.
  2. Tested the fastest ones on a grid and made evaluation of what parameters get them there (regularization, error term)
  3. Tested the slow ones (random forest is sloooow!) again to see if I can move the accuracy significantly.
  4. Because slow ones did not get better discard them all. Take fastest one and run extensive grid optimization on multiple datasets to get best performance out of it.

I understand that you keep all converged models in an ensemble and wouldn't drop them. The question is also how much time does each model get from the overal budget to try to get better.

@mfeurer
Copy link
Contributor

mfeurer commented Mar 3, 2016

How time is assigned to each individual model is a big research question and your input is certainly very helpful. What's happening in auto-sklearn right now is that each model is trained until convergence. For the future we want to have something that does evaluate models only partially and refine them if they are promising as well as that we evaluate models on subsets of the data.

Models are tested sequentially and new models are proposed by the Bayesian optimization toolkit SMAC. We do not add all converged models to the ensemble, we only consider them for the ensemble. The ensemble building strategy then selects only a subset of these. Thus, a model gets as much time to converge as you specify with the argument per_run_time_limit. Does this answer your question?

@Motorrat
Copy link
Contributor Author

Thanks, that does answer my question. I will have to learn more about SMAC to see how my suggestion could be harmonized with that.

@Motorrat
Copy link
Contributor Author

Motorrat commented Apr 6, 2016

Matthias, you have mentioned above "use far more than 20 hours on a cluster" did you refer to the shared mode of SMAC there? I have red the SMAC guide and started the run with the following parameters to try to leverage all the cores on one of my machines:

c = AutoSklearnClassifier( time_left_for_this_task=10000, per_run_time_limit=3000,
    shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
    delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
                      ml_memory_limit=22000)

Can you comment on if that is the correct way of leveraging this mode?
I will increase the time budgets once I see that this goes through. So don't worry about that.

Could you share your setup for a cluster if you have more than one machine involved?
Thanks!

@mfeurer
Copy link
Contributor

mfeurer commented Apr 6, 2016

Yes, I referred to the shared mode of SMAC. This is mostly correct. You need to specify a different seed for each of the classifiers with the argument seed=<int>. Then you can

  1. use all cores of one machine by starting your script several times
  2. use several cores on several machines if they share their a common file system by starting a script several times.
    The cluster I'm using has a shared file system, so I can use option two. My script looks something like this
from autosklearn.classification import AutoSklearnClassifier

# parse arguments
args = parse_args()
output_directory = args.output_directory
seed = args.seed

if seed <= 1:
    metalearning_configurations = 25
else:
    metalearning_configurations = 0

# Load data
X, y = load_data()

automl = AutoSklearnClassifier(
    tmp_dir=output_directory,
    output_dir=output_directory,
    time_left_for_this_task=172800,
    per_run_time_limit=10800,
    initial_configurations_via_metalearning=metalearning_configurations,
    # Don't set the ensemble size here to not start the ensemble
    # script
    ensemble_size=0,
    ensemble_nbest=200,
    ml_memory_limit=6500,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 5},
    shared_mode=True,
    seed=seed)

automl.fit(X, y)
automl.run_ensemble_builder(0, 1, ensemble_size).wait()
time.sleep(5)

# use the model/ensemble in whatever way

Does this help?

I will leave this issue open so I'm reminded to put this into the documentation.

@Motorrat
Copy link
Contributor Author

Motorrat commented Apr 6, 2016

Sorry, it is not quite clear to me yet how does one run this on one machine.
Can I run this from inside same script using multiprocessing? If yes, will just the constructor and "fit" methods be run in parallel?

Suppose I have the below code to parallelize in a sub-process, what else should I add?

# seed = {1 to N} for parallel processes, will be assigned in the sub-process call logic
# should the sub-process code start here?
c = AutoSklearnClassifier( time_left_for_this_task=10000, per_run_time_limit=3000,
    shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
    delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
    ensemble_size=0, initial_configurations_via_metalearning=0, seed=seed,
    ml_memory_limit=22000)
c.fit(X_train, y_train)
# and so does the sub-process code ends here then and the results are reconsiled?
c.run_ensemble_builder(0, 1, ensemble_size).wait()
y_pred = c.predict(X_test)
print_prediction_report(y_test, y_pred)

@mfeurer
Copy link
Contributor

mfeurer commented Apr 6, 2016

Then my description wasn't helpful, sorry. I'm running this script multiple times. I'm not sure if this approach will work, but you can give it a try. I think your script is mostly correct, you only need to create a new AutoSkleanClassifier in the main process:

# seed = {1 to N} for parallel processes, will be assigned in the sub-process call logic
# Start the subprocess code here
c = AutoSklearnClassifier( time_left_for_this_task=10000, per_run_time_limit=3000,
    shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
    delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
    ensemble_size=0, initial_configurations_via_metalearning=0, seed=seed,
    ml_memory_limit=22000)
c.fit(X_train, y_train)
# End the subprocess code here
# Need a new classifier object here because we're in the main process again.
c = AutoSklearnClassifier( time_left_for_this_task=10000, per_run_time_limit=3000,
    shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
    delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
    ensemble_size=0, initial_configurations_via_metalearning=0, seed=1,
    ml_memory_limit=22000)
c.run_ensemble_builder(0, 1, ensemble_size).wait()
y_pred = c.predict(X_test)
print_prediction_report(y_test, y_pred)

@Motorrat
Copy link
Contributor Author

Motorrat commented Apr 6, 2016

Yet another classifier in the main process with the seed=1 is there to get the ensemble builder hooked up, correct? The other classifiers should have seeds >1 then, if I get it right.

@mfeurer
Copy link
Contributor

mfeurer commented Apr 6, 2016

Yes for the first question. I'm not sure if the others need to have seeds > 1 or if >= 1 would be sufficient. But you're on the safe side having seeds > 1.

@Motorrat
Copy link
Contributor Author

Motorrat commented Apr 7, 2016

so here is the exact code I run (after the data is loaded)

from autosklearn.classification import AutoSklearnClassifier
from multiprocessing import Pool
import shutil

shutil.rmtree('./atskln_tmp',ignore_errors=True)
shutil.rmtree('./atskln_output',ignore_errors=True)

def spawn_classifier(seed):
    c = AutoSklearnClassifier( time_left_for_this_task=300, per_run_time_limit=90,
        shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
        delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
        ensemble_size=0, initial_configurations_via_metalearning=0, seed=seed,
        ml_memory_limit=22000)
    c.fit(X_train, y_train, metric='f1_metric')

if __name__ == '__main__':
    pl = Pool(6) #the number is limied by available memory
    pl.map(spawn_classifier, [2, 3, 4, 5, 6, 7])

seed=1
c = AutoSklearnClassifier( time_left_for_this_task=300, per_run_time_limit=90,
    shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
    delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
    ensemble_size=0, initial_configurations_via_metalearning=0, seed=seed,
    ml_memory_limit=22000)

ensemble_size = 6
c.run_ensemble_builder(0, 1, ensemble_size,).wait()

and I get this error message:

Traceback (most recent call last): File "autosklearn-multy.py",line 194, in <module> c.run_ensemble_builder(0, 1, ensemble_size).wait() File "/home/MY_USER/anaconda2/lib/python2.7/site-packages/AutoSklearn-0.0.1.dev0-py2.7-linux-x86_64.egg/autosklearn/automl.py", line 564, in run_ensemble_builder precision=self.precision File "/home/MY_USER/anaconda2/lib/python2.7/site-packages/AutoSklearn-0.0.1.dev0-py2.7-linux-x86_64.egg/autosklearn/util/submit_process.py", line 56, in run_ensemble_builder task_type = TASK_TYPES_TO_STRING[task_type] KeyError: None
ensemble_size = 6 or ensemble_size = 0 result in the same error message.

I see all 6 or 7 python processes started and occupying usual amount of memory for the planned time in "top" listing. So the part before run_ensemble_builder may be OK (I don't know for sure).

Looks like task type is missing. But I can imagine the problem is somewhere else in the missing configuration. What is missing still?

@mfeurer
Copy link
Contributor

mfeurer commented Apr 8, 2016

It seems like the task is only set when fit() is called. You can try:

from autosklearn.constants import *
c._task = BINARY_CLASSIFICATION
c._metric = F1_METRIC

before calling c.run_ensemble_builder(). It might be more attributes missing, here is the code that calls the ensemble builder: https://github.com/automl/auto-sklearn/blob/master/autosklearn/automl.py#L552

@Motorrat
Copy link
Contributor Author

Motorrat commented Apr 8, 2016

one more thing that I'm not sure we have covered - should the metalearning_configurations = 0 for all classifiers or should the first one (that is not fitted) have metalearning_configurations = 25?

I admit I haven't groked the philosophy of the autosklearn + SMAC so please bear with me as we get to the first working example. It would really help to have the minimal example code that would run end to end leveraging muticore machine. This should be standard hardware everywhere by now.

@mfeurer
Copy link
Contributor

mfeurer commented Apr 8, 2016

I can't answer that question empirically. But it would make sense to run the first classifier in spawn_classifier with metalearning_configurations = 25. It certainly won't make SMAC work worse, and can make the process of finding a good solution quicker.

Let me know once you have the example up and running, I'm interested to know whether it works and helps finding good solutions quicker.

@Motorrat
Copy link
Contributor Author

Motorrat commented Apr 8, 2016

Yes, I am trying to write this example myself and it looks like I'm very close to make it work. But It is not very efficient as of now because I have to find the parameters of the run_ensemble_bulder empirically.

from autosklearn.classification import AutoSklearnClassifier
from autosklearn.constants import *
from multiprocessing import Pool
import shutil

shutil.rmtree('./atskln_tmp',ignore_errors=True)
shutil.rmtree('./atskln_output',ignore_errors=True)

seed = 1
c = AutoSklearnClassifier( time_left_for_this_task=300, per_run_time_limit=90, ml_memory_limit=22000,
    shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
    delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
    ensemble_size=0, initial_configurations_via_metalearning=0,
    seed=seed)
c._task = BINARY_CLASSIFICATION
c._metric = F1_METRIC
c._precision = '32'

def spawn_classifier(seed):
    c = AutoSklearnClassifier( time_left_for_this_task=300, per_run_time_limit=90, ml_memory_limit=22000,
        shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
        delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
        ensemble_size=0, initial_configurations_via_metalearning=0,
        seed=seed)
    c.fit(X_train, y_train, metric='f1_metric')

if __name__ == '__main__':
    pl = Pool(10) #the number is limied by available memory
    pl.map(spawn_classifier, [i+1 for i in range(10)])

ensemble_size = 50
c.run_ensemble_builder(0, 1, ensemble_size).wait()
time.sleep(5)

Here I get this error even though I have set the precision above:
Traceback (most recent call last): File "truffles_autosklearn-multy.py", line 194, in <module> c.run_ensemble_builder(0, 1, ensemble_size).wait() File "/home/MY_USER/anaconda2/lib/python2.7/site-packages/AutoSklearn-0.0.1.dev0-py2.7-linux-x86_64.egg/autosklearn/automl.py", line 564, in run_ensemble_builder precision=self.precision File "/home/MY_USER/anaconda2/lib/python2.7/site-packages/AutoSklearn-0.0.1.dev0-py2.7-linux-x86_64.egg/autosklearn/util/submit_process.py", line 74, in run_ensemble_builder call = ' '.join(call) TypeError: sequence item 4: expected string, NoneType found

Wouldn't it be faster if you yourself could run exact same script with your data and make sure this gets to completion?

@Motorrat
Copy link
Contributor Author

Motorrat commented Apr 8, 2016

This below works. I have set initial_configurations_via_metalearning=0 for all classifiers and max_iterations=1, ensemble_size=50 for run_ensemble_builder. Is there a better approach? Should I play with those parameters?

from autosklearn.classification import AutoSklearnClassifier
from autosklearn.constants import *
from multiprocessing import Pool
import shutil

shutil.rmtree('./atskln_tmp',ignore_errors=True)
shutil.rmtree('./atskln_output',ignore_errors=True)

seed = 1
c = AutoSklearnClassifier( time_left_for_this_task=3000, per_run_time_limit=900, ml_memory_limit=22000,
        shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
        delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
        ensemble_size=0, initial_configurations_via_metalearning=0,
        seed=seed)
c._task = BINARY_CLASSIFICATION
c._metric = F1_METRIC
c._precision = '32'
c._dataset_name = 'FooBar'

def spawn_classifier(seed):
    c = AutoSklearnClassifier( time_left_for_this_task=3000, per_run_time_limit=900, ml_memory_limit=22000,
            shared_mode=True, tmp_folder='./atskln_tmp', output_folder='./atskln_output',
            delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False,
            ensemble_size=0, initial_configurations_via_metalearning=0,
            seed=seed)
    c.fit(X_train, y_train, metric='f1_metric')

if __name__ == '__main__':
    pl = Pool(10) #the number is limied by available memory
    pl.map(spawn_classifier, [i+1 for i in range(10)])

c.run_ensemble_builder(
    time_left_for_ensembles=0,
    max_iterations=1,
    ensemble_size=50,
    ).wait()

time.sleep(5)

@mfeurer
Copy link
Contributor

mfeurer commented Oct 12, 2016

Closing this for now - please reopen if you think it's not solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants