-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Try again in remote launching when ResourceLimitExceeded is caught #676
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #676 +/- ##
==========================================
- Coverage 65.90% 65.85% -0.05%
==========================================
Files 399 399
Lines 27810 27826 +16
==========================================
- Hits 18327 18326 -1
- Misses 9483 9500 +17
☔ View full report in Codecov by Sentry. |
@@ -25,6 +27,7 @@ | |||
from benchmarking.nursery.benchmark_multiobjective.hpo_main import main | |||
|
|||
|
|||
@pytest.mark.skip("NEEDS FIXING") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test started failing after #672 was merged.
It's a new test that I just added, I am checking now to see if we can keep it enabled or if we really need to skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix for test: #678
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small changes, most importantly I think we should combine the two decorators into one
syne_tune/remote/scheduling.py
Outdated
@@ -39,3 +43,39 @@ def wrapper(*args, **kwargs): | |||
return wrapper | |||
|
|||
return errorcatch | |||
|
|||
|
|||
def backoff_boto_clienterror( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the only difference is if errorname not in str(ex):
vs if not e.__class__.__name__ == errorname:
. I think we should just agree on one and combine those two decorators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to do that. I don't know what e.__class__.__name__ would be for
botocore.exceptions.ClientError. Also, there are different
ClientError`, they differ only by error message.
I find this matching on class name quite brittle.
syne_tune/remote/scheduling.py
Outdated
try: | ||
return some_function(*args, **kwargs) | ||
except ClientError as ex: | ||
if errorname not in str(ex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point of catchiing multiple errors but this seems very permissive, for example empty string here will catch even keyboard interrupts. I think we should have a solution that is flexible but more strict, how about passing a list of errors to be caught rather than just a single one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, this is wrong! I only want to catch ClientError
@backoff_boto_clienterror( | ||
errorname="ResourceLimitExceeded", length2sleep=backoff_wait_time | ||
) | ||
def fit_sagemaker_estimator_with_backoff(estimator: EstimatorBase, **kwargs): | ||
estimator.fit(**kwargs) | ||
|
||
if backoff_wait_time > 0: | ||
fit_sagemaker_estimator_with_backoff(estimator, **kwargs) | ||
else: | ||
estimator.fit(**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tihnk this is abusing decorators a little. I would try to bake the logic in directly here or otherwise use something along the lines of:
if backoff_wait_time > 0:
backoff_boto_clienterror(errorname="ResourceLimitExceeded", length2sleep=backoff_wait_time)(estimator.fit(**kwargs))
else:
estimator.fit(**kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know this stuff well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know now what we need, maybe you have a proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need something that triggers on botocore.exceptions.ClientError
, and only if the error string contains "ResourceLimitExceeded". Any other exception, or ClientError
with other message, should pass through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also be fine to make the whole decorator specific to this exact type of exception, because that is what we need it for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to bake the decorator logic directly into this function given that we are already specifying a lot of bespoke elements (see new decorator) that are not used anywere else. For example as:
def fit_sagemaker_estimator(backoff_wait_time: int, estimator: EstimatorBase, ntimes_resource_wait: int = 100, length2sleep: int = 360, **kwargs):
# If backoff_wait_time is None, run standard fitting
if backoff_wait_time == 0:
estimator.fit(**kwargs)
for idx in range(ntimes_resource_wait):
try:
return estimator.fit(**kwargs)
except Exception as e:
if not e == botocore.exceptions.ClientError:
raise (e)
logger.info(
f"botocore.exceptions.ClientError[{errorname}] detected "
f"when calling <{some_function.__name__}>. Waiting "
f"{length2sleep / 60} minutes before retrying"
)
time.sleep(length2sleep)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can adapt the previously available decorator and then use it:
def backoff(error: Exception, ntimes_resource_wait: int = 100, length2sleep: float = 600):
"""
Decorator that back offs for a fixed about of s after a given error is detected
"""
def errorcatch(some_function):
@functools.wraps(some_function)
def wrapper(*args, **kwargs):
for idx in range(ntimes_resource_wait):
try:
return some_function(*args, **kwargs)
except Exception as e:
if not e == error:
raise (e)
logger.info(
f"{error} detected when calling <{some_function.__name__}>, waiting {length2sleep / 60} minutes before retring"
)
time.sleep(length2sleep)
continue
return wrapper
return errorcatch
and the insides of the function becomes
@backoff_boto_clienterror(
error=botocore.exceptions.ClientError, length2sleep=backoff_wait_time
)
def fit_sagemaker_estimator_with_backoff(estimator: EstimatorBase, **kwargs):
estimator.fit(**kwargs)
if backoff_wait_time > 0:
fit_sagemaker_estimator_with_backoff(estimator, **kwargs)
else:
estimator.fit(**kwargs)
3807f7d
to
7a5b66d
Compare
@jgolebiowski I think we should aim to close this. The situation is this:
Solutions:
|
How about a refactoring of your code like this?
That would allow the use case here and be clean (matching type instead of class name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have combined all the comments under the fit_sagemaker_estimator function
7a5b66d
to
fe2944e
Compare
OK, I think this is it. Please do check it again. |
Tested this:
|
Nice, I have 0 quota for |
2310a76
to
07fdb48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution looks good, thanks for adapting.
I could not directly use your decorater. Maybe there is a better solution.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.