-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for HPO early stopping #125
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some high-level comments:
-
It looks like from reading the code, the motivation behind implementing
save_intermediate_model
is because there are fewer requirements and constraints compared to checkpointing for spot instances. Is this correct? -
Are we required to save the model at the end of each iteration? With checkpointing, customers had to opt in, but this PR enables model saving to disk by default. The file I/O will add up for large models. Maybe this is due to my lack of understanding of the requirements. If the only requirement is to save the model when SIGTERM is received, can’t we simply save the model when there is a SIGTERM and avoid all the file I/O? Something like
class SaveIntermediateModel:
LATEST_MODEL = None
SIGTERM_RECEIVED = False
def __call__(self, env):
if not SaveIntermediateModel.SIGTERM_RECEIVED:
SaveIntermediateModel.LATEST_MODEL = env.model
return self.__call__
save_intermediate_model = SaveIntermediateModel() # global scope
def save_model_and_terminate(model_dir, is_master):
save_intermediate_model.SIGTERM_RECEIVED = True
if is_master:
with open(os.path.join(model_dir, "xgboost-model"), "wb") as f:
pkl.dump(save_intermediate_model.LATEST_MODEL, f)
sys.exit(0)
signal.signal(signal.SIGTERM, save_model_and_terminate)
- If we have to save the model every iteration, did you consider exposing a hyperparamter to have the customers choose?
hyperparameters = {
"hpo_early_stopping": False
}
if train_cfg.get("hpo_early_stopping"):
callbacks.append(save_intermediate_model)
- How was this PR tested? Are there any functional/integration tests?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've recently learned about the early_stopping_type
parameter in HyperparameterTuner. Is this parameter what triggers SIGTERM? If so, how will this be connected to our hpo_early_stopping
hyperparameter? I'm just trying to see what the workflow looks like from the customer's perspective. Will the customer have to set two hyperparameters (HyperparameterTuner. early_stopping_type
and train_cfg.hpo_early_stopping
) to enable this feature?
Yes, the |
Sorry if I wasn't clear. Let me rephrase. Let's say a customer uses Python SDK to trigger a tuning job and sets
Is the container going to save intermediate models if |
Yes, the customer will also have to set My point with the aside was that our early stopping support can be enabled if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good to me. An action item is to clarify with HPO on the expected behavior of early_stopping_type
, but it can be done as a follow-up item.
One more comment: test/resources/early_stopping/data/train/abalone.train_0
and abalone.train_1
seem like duplicates of the files in test/resources/abalone/data/train. Can't we just re-use those?
|
Parameter behavior:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Merging to have this included in the upcoming deployment. |
Issue #, if available:
Description of changes:
Adds support for external early stopping techniques, such as HPO:
SIGTERM
call received for early terminationsave_intermediate_model
callbackSIGTERM
is received and capturedTested with
tox
, integration tests and new functional tests for single and multiple instances(CR-28179853).By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.