Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting final result from pSMAC on a distributed compute cluster #446

Closed
Mestalbet opened this issue Jul 4, 2018 · 14 comments
Closed

getting final result from pSMAC on a distributed compute cluster #446

Mestalbet opened this issue Jul 4, 2018 · 14 comments
Labels

Comments

@Mestalbet
Copy link

Mestalbet commented Jul 4, 2018

Hi,

I have managed to run pSMAC on a distributed DASK cluster. I run it as follows:

cs = ConfigurationSpace()
scenario = Scenario({"run_obj": "quality",  
                         "runcount-limit": 50,  
                         "cs": cs,              
                         "deterministic": True,
                         "shared_model": True, "initial_incumbent": 'RANDOM',
                         "output_dir": "/shared_data/smac3-output/runs/",
                         "input_psmac_dirs": "/shared_data/smac3-output/runs/run_*"})

def errortomimize:
  return( trainmodel( **cfg), use_pynisher = False))
tae = ExecuteTAFuncDict(errortomimize,use_pynisher=False)
smac = SMAC(scenario=scenario, rng=np.random.RandomState(i), tae_runner=tae, run_id = i)

where i in this case is the worker number.

This creates a directory structure in the shared space like the following:

----smac3-output/runs
                                |
                                |_______run_1
                                |_______run_2
                                             . . .
                                |_______run_N

where N is the number of workers in the cluster.

However, if you look at the traj_aclib2.json in each run_* subdirectory, they are identical. I tried adding verbose_level = "DEBUG" to errortomimize but I don't get any output from this.

Any idea what the problem is? How can I tell whether pSMAC is actually sharing the model (as opposed to just running the same thing N times)?

Cheers,
Noah

@mlindauer
Copy link
Contributor

As the first lines of your script, you should add:

logging.basicConfig(level=logging.DEBUG)

Afterwards you should see lines such as:
DEBUG:smac.optimizer.smbo.SMBO:Shared model mode: Loaded 5 new runs from ...

@mlindauer
Copy link
Contributor

Yes, the parallel SMAC runs do not coordinate which configuration is considered as the incumbent (i.e., the currently best known configuration) and thus, each run will return its own incumbent.

@Mestalbet
Copy link
Author

Hi,

Thanks. I managed to see the output. It looks like it's working correctly - how do I return the final incumbent values after a distributed run? I can see the values in the log...

Thanks again,
Noah

@mlindauer
Copy link
Contributor

You can use smac.get_runhistory() to get a handle on the runhistory (i.e., all evaluated target algorithm runs).
runhistory.get_cost(incumbent) returns the estimated empirical cost.

@Mestalbet
Copy link
Author

Hi,

Thanks! Awesome. In the distributed mode, I get multiple smac objects - since there was one created on each worker. So I am not sure how to implement your suggestion.

Cheers,
Noah

@Mestalbet Mestalbet changed the title pSMAC not sharing model getting final result from pSMAC on a distributed compute cluster Jul 5, 2018
@mlindauer
Copy link
Contributor

I also don't know since I don't know your DASK cluster. Either your workers or your master should have access at some point to the SMAC object, right?

@Mestalbet
Copy link
Author

They all have access to their own SMAC object -that's my point. Each one writes to a run_<run_id> directory which is also accessed through input_psmac_dirs variable.

@mlindauer
Copy link
Contributor

Sorry, I don't understand your problem.
Each SMAC object returns its own incumbent configuration.
You only have to decide which of these to use in the end.
One simple way is to ask each SMAC object how it estimates the empirical cost of its incumbent configuration and you simply choose the best one.

@Mestalbet
Copy link
Author

Okay, I understand now. I have to figure out how to get the intermediate result (the smac object) from the computational graph. The issue is optimizing the distributed computation (the final result is a list of incumbents not the smac objects). Of course, it's trivial to get the smac objects but non-trivial (for me anyway) to optimize the computation then.

I'll post the solution here when I have it so it helps other people wanting to run SMAC in a distributed computation setting.

Thanks.

@brenting
Copy link
Contributor

brenting commented Oct 24, 2018

@Mestalbet
I was having the same problem. If you run pSMAC on a cluster using the command line (which I assume you did), you are using the SMACCLI object. This object creates the SMAC object within the smac package, which is only accessible when you modify the package.
EDIT: I just noticed that you did not use the SMACCLI, so it is indeed possible to extract the run history directly. However, only the full run history will be available after the last worker is finished.

You can also rebuild the entire runhistory afterward and use that to predict the cost of every final incumbent of the parallel pools. I'll explain below with commented code.

import os
import glob
import logging

from smac.optimizer import pSMAC
from smac.scenario.scenario import Scenario
from smac.runhistory.runhistory import RunHistory
from smac.utils.io.traj_logging import TrajLogger

# get a list of all the directories with smac logs
pool_dirs = glob.glob(<smac_dir>/run_*)

# initiate runhistory object
runhistory = RunHistory(aggregate_func=None)

# get configuration space via scenario. 
# You need just 1 scenario file for the config_space as it does not differ per pool.
# I just select the one in the first folder
scen_file = os.path.join(pool_dirs[0], 'scenario.txt')
config_space = Scenario(scen_file).cs

# construct history using parallel smac logs
# Here you load the entire run history into the object 'runhistory' that is included
pSMAC.read(runhistory, pool_dirs, config_space, logging.getLogger())

for pool_dir in pool_dirs:
    # get the file name of the trajectory jason file in this pool
    traj_path = os.path.join(pool_dir, 'traj_aclib2.json')

    # read the trajectory
    trajectory = TrajLogger.read_traj_aclib_format(fn=traj_path, cs=config_space)

    # get the last incumbent of the trajectory in this pool
    last_incumbent = trajectory[-1]['incumbent']

    # get the estimated cost of the incumbent using the runhistory of ALL pools
    est_cost = runhistory.get_cost(last_incumbent)

    # you can now use 'est_cost' and 'last_incumbent' to your liking. 
    # however, 'last_incumbent' is a ConfigSpace object. To convert to dictionary, see below
    last_incumbent_dict = last_incumbent.get_dictionary()

@mlindauer
What I do notice is that earlier incumbents sometimes have a lower estimated cost. Should I check all incumbents that are logged instead of only the last one, or is there an explanation for that?

@mlindauer
Copy link
Contributor

SMAC (potentially) evaluates a configuration only on a subset of instances. The set of instances is increasing over time. Let's say SMAC decides that configuration X performs better than configuration Y on instance set S; so, the performance estimate of X will be smaller than Y given S. But in the next iteration, SMAC will evaluate X on an additional instance i such that our subset of instances is now larger. (Please note that since Y was worse than X, SMAC won't evaluate Y on i.) If X performs worse on i compared to the average performance on S, the performance estimate will increase.

Long story short:
You cannot easily use the performance estimates of the runhistory to compare configurations (if you scenario includes instances)

@brenting
Copy link
Contributor

I do indeed use instances for SMAC, so I have probably been doing it wrong then.

Can I then conclude to following?:
Within a single worker, the cost of the SMAC incumbent can be estimated by using its own run history only. This estimated cost can then be used to compare it against incumbents of other workers.

@mlindauer
Copy link
Contributor

In general, you should only compare the final incumbents against each other. You can do it based on the performance estimate of the runhistory if each worker looked at a representative set of instances for the final incumbent (i.e., the instance subset is not too small).
You could also try to fit an EPM model on the runhistories from all workers and use the predictions from the EPM to compare incumbents. However, that approach requires that you have good instance features.

@stale
Copy link

stale bot commented Jun 18, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 18, 2022
@stale stale bot closed this as completed Jun 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

3 participants