Add support for HPO early stopping #125

salmankhurshid1 · 2020-06-23T00:14:23Z

Issue #, if available:

Description of changes:

Adds support for external early stopping techniques, such as HPO:
- Captures SIGTERM call received for early termination
- Saves model on disk after each iteration using save_intermediate_model callback
- Cleans up model directory when SIGTERM is received and captured
- Added flag to enable early stopping support

Tested with tox, integration tests and new functional tests for single and multiple instances(CR-28179853).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

src/sagemaker_xgboost_container/algorithm_mode/train.py

edwardjkim

Some high-level comments:

It looks like from reading the code, the motivation behind implementing save_intermediate_model is because there are fewer requirements and constraints compared to checkpointing for spot instances. Is this correct?
Are we required to save the model at the end of each iteration? With checkpointing, customers had to opt in, but this PR enables model saving to disk by default. The file I/O will add up for large models. Maybe this is due to my lack of understanding of the requirements. If the only requirement is to save the model when SIGTERM is received, can’t we simply save the model when there is a SIGTERM and avoid all the file I/O? Something like

class SaveIntermediateModel:

    LATEST_MODEL = None
    SIGTERM_RECEIVED = False

    def __call__(self, env):
        if not SaveIntermediateModel.SIGTERM_RECEIVED:
            SaveIntermediateModel.LATEST_MODEL = env.model
        return self.__call__

save_intermediate_model = SaveIntermediateModel()  # global scope

def save_model_and_terminate(model_dir, is_master):
    save_intermediate_model.SIGTERM_RECEIVED = True
    if is_master:
        with open(os.path.join(model_dir, "xgboost-model"), "wb") as f:
            pkl.dump(save_intermediate_model.LATEST_MODEL, f)
    sys.exit(0)

signal.signal(signal.SIGTERM, save_model_and_terminate)

If we have to save the model every iteration, did you consider exposing a hyperparamter to have the customers choose?

hyperparameters = {
    "hpo_early_stopping": False
}

if train_cfg.get("hpo_early_stopping"):
    callbacks.append(save_intermediate_model)

How was this PR tested? Are there any functional/integration tests?

src/sagemaker_xgboost_container/algorithm_mode/train.py

src/sagemaker_xgboost_container/checkpointing.py

test/unit/algorithm_mode/test_train_utils.py

salmankhurshid1 · 2020-07-02T22:24:23Z

Correct, that's one part of the motivation. Other is if there are changes to be made to save_checkpoint in the future, it's easier to maintain these two as separate classes with separate logic.
This will not guarantee that a model is saved for multiple instances and maybe even for single instances. During training with multiple instances, there are quite a few system calls(due to socket connections) that cant be interrupted, and if interrupted, raise errors and terminate the program. In such cases, we will not be able to use the signal handler to save the model to disk. So, it's important we have the latest copy available with us on disk beforehand.
I'm currently working on adding a hyperparameter so that customers can make this decision on their own(with guidance from our side through documentation etc). My plan was to add this as an extra parameter in the sagemaker python sdk such as the ones for checkpoint. Do you think it's better to add it as a training hyperparameter?
PR was tested with unit tests for the cleanup_dir function and save_intermediate_model class. Total functionality was tested through new functional tests which send a SIGTERM and SIGKILL. Also tested through manual testing by running E2E training + inference job on live instances. This cant be tested through integration tests, since there isn't a mechanism to send SIGTERM and wait for program to finish with our integration tests.

edwardjkim

I've recently learned about the early_stopping_type parameter in HyperparameterTuner. Is this parameter what triggers SIGTERM? If so, how will this be connected to our hpo_early_stopping hyperparameter? I'm just trying to see what the workflow looks like from the customer's perspective. Will the customer have to set two hyperparameters (HyperparameterTuner. early_stopping_type and train_cfg.hpo_early_stopping) to enable this feature?

test/unit/test_checkpointing.py

src/sagemaker_xgboost_container/algorithm_mode/train.py

salmankhurshid1 · 2020-07-13T19:31:55Z

Yes, the early_stopping_type parameter triggers the SIGTERM. We do not want to force all users who set this parameter to face large I/O costs in case users don't have a need for this feature. For this reason, we have another parameter. As an aside, the customer doesn't have to set these both parameters to enable this feature. To enable this feature, they simply have to set our current parameter(hpo_early_stopping).

edwardjkim · 2020-07-13T20:04:07Z

As an aside, the customer doesn't have to set these both parameters to enable this feature. To enable this feature, they simply have to set our current parameter(hpo_early_stopping).

Sorry if I wasn't clear. Let me rephrase. Let's say a customer uses Python SDK to trigger a tuning job and sets early_stopping_type='Auto':

HyperparameterTuner(
    xgboost_estimator,
    early_stopping_type="Auto"
)

Is the container going to save intermediate models if early_stopping_type='Auto? Do the customers also have to set hpo_early_stopping="true" in order to have model files for early-stopped jobs?

salmankhurshid1 · 2020-07-13T20:41:01Z

Is the container going to save intermediate models if early_stopping_type='Auto? Do the customers also have to set hpo_early_stopping="true" in order to have model files for early-stopped jobs?

Yes, the customer will also have to set hpo_early_stopping="true" even if early_stopping_type=Auto is set. The reason for that is if a user wants early stopping of HPO jobs through early_stopping_type=Auto, there might be cases(high I/O costs because of large models, no need of intermediate models, etc.) where users don't want to use our early stopping support(saving intermediate model) feature and we don't want to force it on them on every job.

My point with the aside was that our early stopping support can be enabled if early_stopping_type=Auto is not set as well. I think the confusion is coming from the term "feature". The feature represents saving intermediate model in case a SIGTERM is received, and the use case for this feature is supporting external early stopping techniques.

edwardjkim

Overall it looks good to me. An action item is to clarify with HPO on the expected behavior of early_stopping_type, but it can be done as a follow-up item.

One more comment: test/resources/early_stopping/data/train/abalone.train_0 and abalone.train_1 seem like duplicates of the files in test/resources/abalone/data/train. Can't we just re-use those?

src/sagemaker_xgboost_container/algorithm_mode/train.py

salmankhurshid1 · 2020-07-16T19:01:09Z

Sure, I'll follow that as a separate action item.
Those two files can be reused; my only concern regarding that would be if they're modified/removed for any reason then that would cause early stopping tests to fail as well. If we don't see that happening, then I can change the path in early stopping tests, and reuse those simply.

salmankhurshid1 · 2020-07-20T17:57:28Z

Parameter behavior:

Name set to "save_model_on_termination" based on functionality.
Default behavior is set to disabled("false").

edwardjkim

LGTM.

edwardjkim · 2020-07-20T19:00:43Z

Merging to have this included in the upcoming deployment.

Add support for HPO early stopping

0745afd

salmankhurshid1 requested review from rizwangilani and ericangelokim June 23, 2020 00:14

Minor changes applied

4b5b4e3

rizwangilani reviewed Jun 29, 2020

View reviewed changes

src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved

src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved

salmankhurshid1 added 3 commits June 29, 2020 22:52

Add function to cleanup model directory

b77c907

Add callback to support early stopping

7fe88d8

Add early stopping support for training

e4a7f9d

salmankhurshid1 requested review from rizwangilani and edwardjkim June 30, 2020 03:09

edwardjkim reviewed Jul 2, 2020

View reviewed changes

salmankhurshid1 requested a review from edwardjkim July 2, 2020 22:31

salmankhurshid1 added 2 commits July 7, 2020 22:19

PR comments applied

803d8bd

PR changes applied

e60bb69

edwardjkim reviewed Jul 13, 2020

View reviewed changes

test/unit/test_checkpointing.py Outdated Show resolved Hide resolved

src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved

src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved

salmankhurshid1 added 2 commits July 15, 2020 12:39

PR changes applied

5a15527

Add local container tests for early stopping

e768eeb

salmankhurshid1 requested a review from edwardjkim July 15, 2020 17:21

edwardjkim reviewed Jul 16, 2020

View reviewed changes

src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved

Change name of parameter

b4a8a77

salmankhurshid1 requested a review from edwardjkim July 20, 2020 17:57

edwardjkim approved these changes Jul 20, 2020

View reviewed changes

edwardjkim merged commit 198566d into aws:master Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for HPO early stopping #125

Add support for HPO early stopping #125

salmankhurshid1 commented Jun 23, 2020 •

edited

edwardjkim left a comment

salmankhurshid1 commented Jul 2, 2020 •

edited

edwardjkim left a comment

salmankhurshid1 commented Jul 13, 2020

edwardjkim commented Jul 13, 2020

salmankhurshid1 commented Jul 13, 2020 •

edited

edwardjkim left a comment

salmankhurshid1 commented Jul 16, 2020

salmankhurshid1 commented Jul 20, 2020

edwardjkim left a comment

edwardjkim commented Jul 20, 2020

Add support for HPO early stopping #125

Add support for HPO early stopping #125

Conversation

salmankhurshid1 commented Jun 23, 2020 • edited

edwardjkim left a comment

Choose a reason for hiding this comment

salmankhurshid1 commented Jul 2, 2020 • edited

edwardjkim left a comment

Choose a reason for hiding this comment

salmankhurshid1 commented Jul 13, 2020

edwardjkim commented Jul 13, 2020

salmankhurshid1 commented Jul 13, 2020 • edited

edwardjkim left a comment

Choose a reason for hiding this comment

salmankhurshid1 commented Jul 16, 2020

salmankhurshid1 commented Jul 20, 2020

edwardjkim left a comment

Choose a reason for hiding this comment

edwardjkim commented Jul 20, 2020

salmankhurshid1 commented Jun 23, 2020 •

edited

salmankhurshid1 commented Jul 2, 2020 •

edited

salmankhurshid1 commented Jul 13, 2020 •

edited