-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Deflake progress metrics test with retry. #7455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
R: @tvalentyn |
73ea215 to
f55f46e
Compare
| DEFAULT_SAMPLING_PERIOD_MS = 0 | ||
|
|
||
|
|
||
| def retry(attempts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving this to third_party/py/apache_beam/testing/util.py. If you do so:
- Consider adding a docstring with sample usage and clarify that input param is total number of attempts, not number of retries. Perhaps a different name (
total_attempts?) would help disambiguate. - I would add a comment to use this very sparingly, since we may otherwise start sweeping flakes under the rug, which may be legitimate bugs that happen due to rare race conditions.
- Consider adding a test for the decorator to util_test.py.
If you think that in general this is a good idea but don't have time to do it, I can do that in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few libraries that offer retry-decorators for the same purpose, we can consider adopting one of them, see: https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af@%3Cdev.beam.apache.org%3E
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about test_util, but that's primarily user-facing. Let's see where the discussion goes on tenacity.
| pcoll_b = p | 'b' >> beam.Create(['b']) | ||
| assert_that((pcoll_a, pcoll_b) | First(), equal_to(['a'])) | ||
|
|
||
| @retry(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment from PR description here to explain why we had to add a retry.
| pcoll_b = p | 'b' >> beam.Create(['b']) | ||
| assert_that((pcoll_a, pcoll_b) | First(), equal_to(['a'])) | ||
|
|
||
| @retry(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant to decorate test_progress_metrics method instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. yes.
| def apply(f): | ||
| @functools.wraps(f) | ||
| def wrapper(*args): | ||
| for _ in range(attempts - 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to return the value of decorated function or pass kwargs? I guess neither is relevant for test methods.
f55f46e to
07ef317
Compare
|
PTAL. This should look very similar to #7492 now :). |
tvalentyn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I assume a failing PreCommit test test_error_traceback_includes_user_code is preexisting failure, but we should fix that before merge to make sure it's not a surprising side effect...
|
retest this please |
| all_metrics_via_montoring_infos, namespace, name, step='MyStep') | ||
|
|
||
| # Due to somewhat non-deterministic nature of state sampling and sleep, | ||
| # this test is flaky when state duraiton is low. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there was a typo in my PR, should be duration in line 572.
Due to the not perfectly determanistic nature of state sampling and sleep, this test has a 1% failure rate. Adding a retry rate of 3 drops that to one in a million.
07ef317 to
e0f64e9
Compare
|
Tenacity is breaking Flink python tests as https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/console |
|
@angoenka we should fix Flink test suite, see: https://issues.apache.org/jira/browse/BEAM-6469. |
|
We may consider replacing apache_beam.utils.retry with tenacity at some point. |
Due to the not perfectly determanistic nature of state sampling and sleep,
this test has a 1% failure rate. Adding a single retry drops that to
1 in 10,000.
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username) to look at it.Post-Commit Tests Status (on master branch)