Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Try again in remote launching when ResourceLimitExceeded is caught #676

Merged
merged 1 commit into from
May 26, 2023

Conversation

mseeger
Copy link
Collaborator

@mseeger mseeger commented May 16, 2023

I could not directly use your decorater. Maybe there is a better solution.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mseeger mseeger requested a review from jgolebiowski May 16, 2023 13:45
@codecov
Copy link

codecov bot commented May 16, 2023

Codecov Report

Patch coverage: 13.63% and project coverage change: -0.05 ⚠️

Comparison is base (c09eb22) 65.90% compared to head (2310a76) 65.85%.

❗ Current head 2310a76 differs from pull request most recent head 07fdb48. Consider uploading reports for the commit 07fdb48 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #676      +/-   ##
==========================================
- Coverage   65.90%   65.85%   -0.05%     
==========================================
  Files         399      399              
  Lines       27810    27826      +16     
==========================================
- Hits        18327    18326       -1     
- Misses       9483     9500      +17     
Impacted Files Coverage Δ
benchmarking/commons/hpo_main_common.py 61.11% <ø> (ø)
benchmarking/commons/launch_remote_common.py 0.00% <0.00%> (ø)
benchmarking/commons/launch_remote_local.py 0.00% <0.00%> (ø)
benchmarking/commons/launch_remote_simulator.py 0.00% <0.00%> (ø)
...izer/schedulers/searchers/regularized_evolution.py 71.83% <ø> (ø)
syne_tune/remote/scheduling.py 100.00% <100.00%> (ø)
...chmarking/benchmark_commons/test_hpo_main_local.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@@ -25,6 +27,7 @@
from benchmarking.nursery.benchmark_multiobjective.hpo_main import main


@pytest.mark.skip("NEEDS FIXING")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test started failing after #672 was merged.

It's a new test that I just added, I am checking now to see if we can keep it enabled or if we really need to skip it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for test: #678

Copy link
Collaborator

@jgolebiowski jgolebiowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small changes, most importantly I think we should combine the two decorators into one

@@ -39,3 +43,39 @@ def wrapper(*args, **kwargs):
return wrapper

return errorcatch


def backoff_boto_clienterror(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the only difference is if errorname not in str(ex): vs if not e.__class__.__name__ == errorname:. I think we should just agree on one and combine those two decorators.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do that. I don't know what e.__class__.__name__ would be for botocore.exceptions.ClientError. Also, there are different ClientError`, they differ only by error message.

I find this matching on class name quite brittle.

try:
return some_function(*args, **kwargs)
except ClientError as ex:
if errorname not in str(ex):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point of catchiing multiple errors but this seems very permissive, for example empty string here will catch even keyboard interrupts. I think we should have a solution that is flexible but more strict, how about passing a list of errors to be caught rather than just a single one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, this is wrong! I only want to catch ClientError

Comment on lines 108 to 117
@backoff_boto_clienterror(
errorname="ResourceLimitExceeded", length2sleep=backoff_wait_time
)
def fit_sagemaker_estimator_with_backoff(estimator: EstimatorBase, **kwargs):
estimator.fit(**kwargs)

if backoff_wait_time > 0:
fit_sagemaker_estimator_with_backoff(estimator, **kwargs)
else:
estimator.fit(**kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tihnk this is abusing decorators a little. I would try to bake the logic in directly here or otherwise use something along the lines of:

  if backoff_wait_time > 0:
      backoff_boto_clienterror(errorname="ResourceLimitExceeded", length2sleep=backoff_wait_time)(estimator.fit(**kwargs))
  else:
      estimator.fit(**kwargs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this stuff well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know now what we need, maybe you have a proposal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need something that triggers on botocore.exceptions.ClientError, and only if the error string contains "ResourceLimitExceeded". Any other exception, or ClientError with other message, should pass through

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also be fine to make the whole decorator specific to this exact type of exception, because that is what we need it for.

Copy link
Collaborator

@jgolebiowski jgolebiowski May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to bake the decorator logic directly into this function given that we are already specifying a lot of bespoke elements (see new decorator) that are not used anywere else. For example as:

def fit_sagemaker_estimator(backoff_wait_time: int, estimator: EstimatorBase, ntimes_resource_wait: int = 100, length2sleep: int = 360, **kwargs):

    # If backoff_wait_time is None, run standard fitting
    if backoff_wait_time == 0:
        estimator.fit(**kwargs)

    for idx in range(ntimes_resource_wait):
        try:
            return estimator.fit(**kwargs)
        except Exception as e:
            if not e == botocore.exceptions.ClientError:
                raise (e)
        logger.info(
            f"botocore.exceptions.ClientError[{errorname}] detected "
            f"when calling <{some_function.__name__}>. Waiting "
            f"{length2sleep / 60} minutes before retrying"
        )
        time.sleep(length2sleep)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can adapt the previously available decorator and then use it:

def backoff(error: Exception, ntimes_resource_wait: int = 100, length2sleep: float = 600):
    """
    Decorator that back offs for a fixed about of s after a given error is detected
    """
    def errorcatch(some_function):
        @functools.wraps(some_function)
        def wrapper(*args, **kwargs):
            for idx in range(ntimes_resource_wait):
                try:
                    return some_function(*args, **kwargs)
                except Exception as e:
                    if not e == error:
                        raise (e)

                logger.info(
                    f"{error} detected when calling <{some_function.__name__}>, waiting {length2sleep / 60} minutes before retring"
                )
                time.sleep(length2sleep)
                continue
        return wrapper

    return errorcatch

and the insides of the function becomes

    @backoff_boto_clienterror(
        error=botocore.exceptions.ClientError, length2sleep=backoff_wait_time
    )
    def fit_sagemaker_estimator_with_backoff(estimator: EstimatorBase, **kwargs):
        estimator.fit(**kwargs)

    if backoff_wait_time > 0:
        fit_sagemaker_estimator_with_backoff(estimator, **kwargs)
    else:
        estimator.fit(**kwargs)

@mseeger
Copy link
Collaborator Author

mseeger commented May 22, 2023

@jgolebiowski I think we should aim to close this. The situation is this:

  • We have a specific need, namely that a certain exception is thrown, of a certain subtype, when you run out of quota. Having this in benchmarking (or elsewhere in Syne Tune) would be great
  • I do not have the skills to connect your solution to this need. I don't know how stable it is to match class names of exceptions, in particular if they are from some external dependency. Your solution also does not allow dependence on the exception message

Solutions:

  • Either we replace your general case with the specific case we need at the moment (I don't really see a broader one)
  • Or you make sure your general case works for the specific one, hopefully not making it a lot more complex

@mseeger
Copy link
Collaborator Author

mseeger commented May 22, 2023

How about a refactoring of your code like this?

def backoff(
    errorclass: Exception,
    message_contains: Optional[str] = None,
    ntimes_resource_wait: int = 100,
    length2sleep: float = 600
):
    def errorcatch(some_function):
        @functools.wraps(some_function)
        def wrapper(*args, **kwargs):
            for idx in range(ntimes_resource_wait):
                try:
                    return some_function(*args, **kwargs)
                except Exception as e:
                    if not isinstance(e, errorclass):
                        raise (e)
                    if message_contains is not None and message_contains not in str(e):
                        raise(e)

                logging.info(
                    f"{errorclass} detected when calling <{some_function.__name__}>, waiting {length2sleep / 60} minutes before retrying"
                )
                time.sleep(length2sleep)
                continue

        return wrapper

    return errorcatch

That would allow the use case here and be clean (matching type instead of class name)

Copy link
Collaborator

@jgolebiowski jgolebiowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have combined all the comments under the fit_sagemaker_estimator function

@mseeger
Copy link
Collaborator Author

mseeger commented May 25, 2023

OK, I think this is it. Please do check it again.

@mseeger
Copy link
Collaborator Author

mseeger commented May 25, 2023

Tested this:

(st_with_ray) 88665a46461b:syne-tune matthis$ python benchmarking/examples/launch_local/launch_remote.py --method BO --benchmark transformer_wikitext2 --experiment_tag blurb1 --instance_type ml.g5.24xlarge --estimator_fit_backoff_wait_time 10 Failed to import apex. You can still train with --precision {float|double}. coolname is not installed, will not be used Master random_seed = 163844129 0%| | 0/1 [00:00<?, ?it/s]blurb1-BO-0 hyperparameters = {'experiment_tag': 'blurb1', 'benchmark': 'transformer_wikitext2', 'method': 'BO', 'save_tuner': 0, 'num_seeds': 1, 'start_seed': 0, 'random_seed': 163844129, 'scale_max_wallclock_time': 0, 'launched_remotely': 1, 'instance_type': 'ml.g5.24xlarge', 'verbose': 0, 'remote_tuning_metrics': 1, 'delete_checkpoints': 0, 'num_gpus_per_trial': 1} Results written to s3://sagemaker-us-west-2-719355911555/syne-tune/blurb1/BO-0/ botocore.exceptions.ClientError[ResourceLimitExceeded] detected when calling estimator.fit. Waiting 0.16666666666666666 minutes before retrying botocore.exceptions.ClientError[ResourceLimitExceeded] detected when calling estimator.fit. Waiting 0.16666666666666666 minutes before retrying

@mseeger
Copy link
Collaborator Author

mseeger commented May 25, 2023

Nice, I have 0 quota for ml.g5.24xlarge. Seems to work

Copy link
Collaborator

@jgolebiowski jgolebiowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution looks good, thanks for adapting.

@mseeger mseeger merged commit 9ff15fe into main May 26, 2023
30 checks passed
@mseeger mseeger deleted the backoff_benchmarking branch May 26, 2023 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants