feat: Try again in remote launching when ResourceLimitExceeded is caught #676

mseeger · 2023-05-16T13:45:06Z

I could not directly use your decorater. Maybe there is a better solution.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov · 2023-05-16T14:40:59Z

Codecov Report

Patch coverage: 13.63% and project coverage change: -0.05 ⚠️

Comparison is base (c09eb22) 65.90% compared to head (2310a76) 65.85%.

❗ Current head 2310a76 differs from pull request most recent head 07fdb48. Consider uploading reports for the commit 07fdb48 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #676      +/-   ##
==========================================
- Coverage   65.90%   65.85%   -0.05%     
==========================================
  Files         399      399              
  Lines       27810    27826      +16     
==========================================
- Hits        18327    18326       -1     
- Misses       9483     9500      +17

Impacted Files	Coverage Δ
benchmarking/commons/hpo_main_common.py	`61.11% <ø> (ø)`
benchmarking/commons/launch_remote_common.py	`0.00% <0.00%> (ø)`
benchmarking/commons/launch_remote_local.py	`0.00% <0.00%> (ø)`
benchmarking/commons/launch_remote_simulator.py	`0.00% <0.00%> (ø)`
...izer/schedulers/searchers/regularized_evolution.py	`71.83% <ø> (ø)`
syne_tune/remote/scheduling.py	`100.00% <100.00%> (ø)`
...chmarking/benchmark_commons/test_hpo_main_local.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

wesk · 2023-05-16T15:00:27Z

tst/benchmarking/benchmark_commons/test_hpo_main_local.py

@@ -25,6 +27,7 @@
 from benchmarking.nursery.benchmark_multiobjective.hpo_main import main


+@pytest.mark.skip("NEEDS FIXING")


The test started failing after #672 was merged.

It's a new test that I just added, I am checking now to see if we can keep it enabled or if we really need to skip it.

Fix for test: #678

jgolebiowski

Small changes, most importantly I think we should combine the two decorators into one

jgolebiowski · 2023-05-16T14:54:16Z

syne_tune/remote/scheduling.py

@@ -39,3 +43,39 @@ def wrapper(*args, **kwargs):
        return wrapper

    return errorcatch
+
+
+def backoff_boto_clienterror(


I see the only difference is if errorname not in str(ex): vs if not e.__class__.__name__ == errorname:. I think we should just agree on one and combine those two decorators.

I don't know how to do that. I don't know what e.__class__.__name__ would be for botocore.exceptions.ClientError. Also, there are different ClientError`, they differ only by error message.

I find this matching on class name quite brittle.

jgolebiowski · 2023-05-16T14:55:21Z

syne_tune/remote/scheduling.py

+                try:
+                    return some_function(*args, **kwargs)
+                except ClientError as ex:
+                    if errorname not in str(ex):


I see the point of catchiing multiple errors but this seems very permissive, for example empty string here will catch even keyboard interrupts. I think we should have a solution that is flexible but more strict, how about passing a list of errors to be caught rather than just a single one?

Oh wait, this is wrong! I only want to catch ClientError

jgolebiowski · 2023-05-16T16:12:39Z

benchmarking/commons/launch_remote_common.py

+    @backoff_boto_clienterror(
+        errorname="ResourceLimitExceeded", length2sleep=backoff_wait_time
+    )
+    def fit_sagemaker_estimator_with_backoff(estimator: EstimatorBase, **kwargs):
+        estimator.fit(**kwargs)
+
+    if backoff_wait_time > 0:
+        fit_sagemaker_estimator_with_backoff(estimator, **kwargs)
+    else:
+        estimator.fit(**kwargs)


I tihnk this is abusing decorators a little. I would try to bake the logic in directly here or otherwise use something along the lines of:

if backoff_wait_time > 0: backoff_boto_clienterror(errorname="ResourceLimitExceeded", length2sleep=backoff_wait_time)(estimator.fit(**kwargs)) else: estimator.fit(**kwargs)

I don't know this stuff well

You know now what we need, maybe you have a proposal

We need something that triggers on botocore.exceptions.ClientError, and only if the error string contains "ResourceLimitExceeded". Any other exception, or ClientError with other message, should pass through

I'd also be fine to make the whole decorator specific to this exact type of exception, because that is what we need it for.

I would suggest to bake the decorator logic directly into this function given that we are already specifying a lot of bespoke elements (see new decorator) that are not used anywere else. For example as:

def fit_sagemaker_estimator(backoff_wait_time: int, estimator: EstimatorBase, ntimes_resource_wait: int = 100, length2sleep: int = 360, **kwargs): # If backoff_wait_time is None, run standard fitting if backoff_wait_time == 0: estimator.fit(**kwargs) for idx in range(ntimes_resource_wait): try: return estimator.fit(**kwargs) except Exception as e: if not e == botocore.exceptions.ClientError: raise (e) logger.info( f"botocore.exceptions.ClientError[{errorname}] detected " f"when calling <{some_function.__name__}>. Waiting " f"{length2sleep / 60} minutes before retrying" ) time.sleep(length2sleep)

Alternatively, we can adapt the previously available decorator and then use it:

def backoff(error: Exception, ntimes_resource_wait: int = 100, length2sleep: float = 600): """ Decorator that back offs for a fixed about of s after a given error is detected """ def errorcatch(some_function): @functools.wraps(some_function) def wrapper(*args, **kwargs): for idx in range(ntimes_resource_wait): try: return some_function(*args, **kwargs) except Exception as e: if not e == error: raise (e) logger.info( f"{error} detected when calling <{some_function.__name__}>, waiting {length2sleep / 60} minutes before retring" ) time.sleep(length2sleep) continue return wrapper return errorcatch

and the insides of the function becomes

@backoff_boto_clienterror( error=botocore.exceptions.ClientError, length2sleep=backoff_wait_time ) def fit_sagemaker_estimator_with_backoff(estimator: EstimatorBase, **kwargs): estimator.fit(**kwargs) if backoff_wait_time > 0: fit_sagemaker_estimator_with_backoff(estimator, **kwargs) else: estimator.fit(**kwargs)

mseeger · 2023-05-22T16:08:56Z

@jgolebiowski I think we should aim to close this. The situation is this:

We have a specific need, namely that a certain exception is thrown, of a certain subtype, when you run out of quota. Having this in benchmarking (or elsewhere in Syne Tune) would be great
I do not have the skills to connect your solution to this need. I don't know how stable it is to match class names of exceptions, in particular if they are from some external dependency. Your solution also does not allow dependence on the exception message

Solutions:

Either we replace your general case with the specific case we need at the moment (I don't really see a broader one)
Or you make sure your general case works for the specific one, hopefully not making it a lot more complex

mseeger · 2023-05-22T20:47:57Z

How about a refactoring of your code like this?

def backoff(
    errorclass: Exception,
    message_contains: Optional[str] = None,
    ntimes_resource_wait: int = 100,
    length2sleep: float = 600
):
    def errorcatch(some_function):
        @functools.wraps(some_function)
        def wrapper(*args, **kwargs):
            for idx in range(ntimes_resource_wait):
                try:
                    return some_function(*args, **kwargs)
                except Exception as e:
                    if not isinstance(e, errorclass):
                        raise (e)
                    if message_contains is not None and message_contains not in str(e):
                        raise(e)

                logging.info(
                    f"{errorclass} detected when calling <{some_function.__name__}>, waiting {length2sleep / 60} minutes before retrying"
                )
                time.sleep(length2sleep)
                continue

        return wrapper

    return errorcatch

That would allow the use case here and be clean (matching type instead of class name)

jgolebiowski

I have combined all the comments under the fit_sagemaker_estimator function

mseeger · 2023-05-25T18:10:56Z

OK, I think this is it. Please do check it again.

mseeger · 2023-05-25T21:38:33Z

Tested this:

(st_with_ray) 88665a46461b:syne-tune matthis$ python benchmarking/examples/launch_local/launch_remote.py --method BO --benchmark transformer_wikitext2 --experiment_tag blurb1 --instance_type ml.g5.24xlarge --estimator_fit_backoff_wait_time 10 Failed to import apex. You can still train with --precision {float|double}. coolname is not installed, will not be used Master random_seed = 163844129 0%| | 0/1 [00:00<?, ?it/s]blurb1-BO-0 hyperparameters = {'experiment_tag': 'blurb1', 'benchmark': 'transformer_wikitext2', 'method': 'BO', 'save_tuner': 0, 'num_seeds': 1, 'start_seed': 0, 'random_seed': 163844129, 'scale_max_wallclock_time': 0, 'launched_remotely': 1, 'instance_type': 'ml.g5.24xlarge', 'verbose': 0, 'remote_tuning_metrics': 1, 'delete_checkpoints': 0, 'num_gpus_per_trial': 1} Results written to s3://sagemaker-us-west-2-719355911555/syne-tune/blurb1/BO-0/ botocore.exceptions.ClientError[ResourceLimitExceeded] detected when calling estimator.fit. Waiting 0.16666666666666666 minutes before retrying botocore.exceptions.ClientError[ResourceLimitExceeded] detected when calling estimator.fit. Waiting 0.16666666666666666 minutes before retrying

mseeger · 2023-05-25T21:39:09Z

Nice, I have 0 quota for ml.g5.24xlarge. Seems to work

jgolebiowski

This solution looks good, thanks for adapting.

mseeger requested a review from jgolebiowski May 16, 2023 13:45

wesk reviewed May 16, 2023

View reviewed changes

jgolebiowski requested changes May 16, 2023

View reviewed changes

mseeger force-pushed the backoff_benchmarking branch from 3807f7d to 7a5b66d Compare May 17, 2023 09:54

jgolebiowski requested changes May 24, 2023

View reviewed changes

mseeger force-pushed the backoff_benchmarking branch from 7a5b66d to fe2944e Compare May 25, 2023 18:08

feat: Try again in remote launching when ResourceLimitExceeded is caught

07fdb48

mseeger force-pushed the backoff_benchmarking branch from 2310a76 to 07fdb48 Compare May 26, 2023 09:28

jgolebiowski approved these changes May 26, 2023

View reviewed changes

mseeger merged commit 9ff15fe into main May 26, 2023
30 checks passed

mseeger deleted the backoff_benchmarking branch May 26, 2023 09:48

jgolebiowski added the feature label Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Try again in remote launching when ResourceLimitExceeded is caught #676

feat: Try again in remote launching when ResourceLimitExceeded is caught #676

mseeger commented May 16, 2023

codecov bot commented May 16, 2023 •

edited

Loading

wesk May 16, 2023

wesk May 16, 2023

jgolebiowski left a comment

jgolebiowski May 16, 2023

mseeger May 16, 2023

jgolebiowski May 16, 2023

mseeger May 16, 2023

jgolebiowski May 16, 2023

mseeger May 16, 2023

mseeger May 16, 2023

mseeger May 16, 2023

mseeger May 17, 2023

jgolebiowski May 24, 2023 •

edited

Loading

jgolebiowski May 24, 2023

mseeger commented May 22, 2023

mseeger commented May 22, 2023 •

edited

Loading

jgolebiowski left a comment

mseeger commented May 25, 2023 •

edited

Loading

mseeger commented May 25, 2023

mseeger commented May 25, 2023

jgolebiowski left a comment

		@@ -25,6 +27,7 @@
		from benchmarking.nursery.benchmark_multiobjective.hpo_main import main


		@pytest.mark.skip("NEEDS FIXING")

feat: Try again in remote launching when ResourceLimitExceeded is caught #676

feat: Try again in remote launching when ResourceLimitExceeded is caught #676

Conversation

mseeger commented May 16, 2023

codecov bot commented May 16, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgolebiowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgolebiowski May 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mseeger commented May 22, 2023

mseeger commented May 22, 2023 • edited Loading

jgolebiowski left a comment

Choose a reason for hiding this comment

mseeger commented May 25, 2023 • edited Loading

mseeger commented May 25, 2023

mseeger commented May 25, 2023

jgolebiowski left a comment

Choose a reason for hiding this comment

codecov bot commented May 16, 2023 •

edited

Loading

jgolebiowski May 24, 2023 •

edited

Loading

mseeger commented May 22, 2023 •

edited

Loading

mseeger commented May 25, 2023 •

edited

Loading