Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HPO early stopping #125

Merged
merged 10 commits into from
Jul 20, 2020
Merged

Add support for HPO early stopping #125

merged 10 commits into from
Jul 20, 2020

Conversation

salmankhurshid1
Copy link
Contributor

@salmankhurshid1 salmankhurshid1 commented Jun 23, 2020

Issue #, if available:

Description of changes:

  • Adds support for external early stopping techniques, such as HPO:

    • Captures SIGTERM call received for early termination
    • Saves model on disk after each iteration using save_intermediate_model callback
    • Cleans up model directory when SIGTERM is received and captured
    • Added flag to enable early stopping support

Tested with tox, integration tests and new functional tests for single and multiple instances(CR-28179853).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@edwardjkim edwardjkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level comments:

  • It looks like from reading the code, the motivation behind implementing save_intermediate_model is because there are fewer requirements and constraints compared to checkpointing for spot instances. Is this correct?

  • Are we required to save the model at the end of each iteration? With checkpointing, customers had to opt in, but this PR enables model saving to disk by default. The file I/O will add up for large models. Maybe this is due to my lack of understanding of the requirements. If the only requirement is to save the model when SIGTERM is received, can’t we simply save the model when there is a SIGTERM and avoid all the file I/O? Something like

class SaveIntermediateModel:

    LATEST_MODEL = None
    SIGTERM_RECEIVED = False

    def __call__(self, env):
        if not SaveIntermediateModel.SIGTERM_RECEIVED:
            SaveIntermediateModel.LATEST_MODEL = env.model
        return self.__call__

save_intermediate_model = SaveIntermediateModel()  # global scope

def save_model_and_terminate(model_dir, is_master):
    save_intermediate_model.SIGTERM_RECEIVED = True
    if is_master:
        with open(os.path.join(model_dir, "xgboost-model"), "wb") as f:
            pkl.dump(save_intermediate_model.LATEST_MODEL, f)
    sys.exit(0)

signal.signal(signal.SIGTERM, save_model_and_terminate)
  • If we have to save the model every iteration, did you consider exposing a hyperparamter to have the customers choose?
hyperparameters = {
    "hpo_early_stopping": False
}

if train_cfg.get("hpo_early_stopping"):
    callbacks.append(save_intermediate_model)
  • How was this PR tested? Are there any functional/integration tests?

src/sagemaker_xgboost_container/checkpointing.py Outdated Show resolved Hide resolved
test/unit/algorithm_mode/test_train_utils.py Outdated Show resolved Hide resolved
@salmankhurshid1
Copy link
Contributor Author

salmankhurshid1 commented Jul 2, 2020

  1. Correct, that's one part of the motivation. Other is if there are changes to be made to save_checkpoint in the future, it's easier to maintain these two as separate classes with separate logic.

  2. This will not guarantee that a model is saved for multiple instances and maybe even for single instances. During training with multiple instances, there are quite a few system calls(due to socket connections) that cant be interrupted, and if interrupted, raise errors and terminate the program. In such cases, we will not be able to use the signal handler to save the model to disk. So, it's important we have the latest copy available with us on disk beforehand.

  3. I'm currently working on adding a hyperparameter so that customers can make this decision on their own(with guidance from our side through documentation etc). My plan was to add this as an extra parameter in the sagemaker python sdk such as the ones for checkpoint. Do you think it's better to add it as a training hyperparameter?

  4. PR was tested with unit tests for the cleanup_dir function and save_intermediate_model class. Total functionality was tested through new functional tests which send a SIGTERM and SIGKILL. Also tested through manual testing by running E2E training + inference job on live instances. This cant be tested through integration tests, since there isn't a mechanism to send SIGTERM and wait for program to finish with our integration tests.

Copy link
Contributor

@edwardjkim edwardjkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've recently learned about the early_stopping_type parameter in HyperparameterTuner. Is this parameter what triggers SIGTERM? If so, how will this be connected to our hpo_early_stopping hyperparameter? I'm just trying to see what the workflow looks like from the customer's perspective. Will the customer have to set two hyperparameters (HyperparameterTuner. early_stopping_type and train_cfg.hpo_early_stopping) to enable this feature?

test/unit/test_checkpointing.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
@salmankhurshid1
Copy link
Contributor Author

Yes, the early_stopping_type parameter triggers the SIGTERM. We do not want to force all users who set this parameter to face large I/O costs in case users don't have a need for this feature. For this reason, we have another parameter. As an aside, the customer doesn't have to set these both parameters to enable this feature. To enable this feature, they simply have to set our current parameter(hpo_early_stopping).

@edwardjkim
Copy link
Contributor

As an aside, the customer doesn't have to set these both parameters to enable this feature. To enable this feature, they simply have to set our current parameter(hpo_early_stopping).

Sorry if I wasn't clear. Let me rephrase. Let's say a customer uses Python SDK to trigger a tuning job and sets early_stopping_type='Auto':

HyperparameterTuner(
    xgboost_estimator,
    early_stopping_type="Auto"
)

Is the container going to save intermediate models if early_stopping_type='Auto? Do the customers also have to set hpo_early_stopping="true" in order to have model files for early-stopped jobs?

@salmankhurshid1
Copy link
Contributor Author

salmankhurshid1 commented Jul 13, 2020

Is the container going to save intermediate models if early_stopping_type='Auto? Do the customers also have to set hpo_early_stopping="true" in order to have model files for early-stopped jobs?

Yes, the customer will also have to set hpo_early_stopping="true" even if early_stopping_type=Auto is set. The reason for that is if a user wants early stopping of HPO jobs through early_stopping_type=Auto, there might be cases(high I/O costs because of large models, no need of intermediate models, etc.) where users don't want to use our early stopping support(saving intermediate model) feature and we don't want to force it on them on every job.

My point with the aside was that our early stopping support can be enabled if early_stopping_type=Auto is not set as well. I think the confusion is coming from the term "feature". The feature represents saving intermediate model in case a SIGTERM is received, and the use case for this feature is supporting external early stopping techniques.

Copy link
Contributor

@edwardjkim edwardjkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good to me. An action item is to clarify with HPO on the expected behavior of early_stopping_type, but it can be done as a follow-up item.

One more comment: test/resources/early_stopping/data/train/abalone.train_0 and abalone.train_1 seem like duplicates of the files in test/resources/abalone/data/train. Can't we just re-use those?

src/sagemaker_xgboost_container/algorithm_mode/train.py Outdated Show resolved Hide resolved
@salmankhurshid1
Copy link
Contributor Author

  • Sure, I'll follow that as a separate action item.

  • Those two files can be reused; my only concern regarding that would be if they're modified/removed for any reason then that would cause early stopping tests to fail as well. If we don't see that happening, then I can change the path in early stopping tests, and reuse those simply.

@salmankhurshid1
Copy link
Contributor Author

Parameter behavior:

  • Name set to "save_model_on_termination" based on functionality.
  • Default behavior is set to disabled("false").

Copy link
Contributor

@edwardjkim edwardjkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@edwardjkim
Copy link
Contributor

Merging to have this included in the upcoming deployment.

@edwardjkim edwardjkim merged commit 198566d into aws:master Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants