Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-12603] Add retry on grpc data channel and remove retry from test. #17537

Merged
merged 3 commits into from
May 5, 2022

Conversation

y1chi
Copy link
Contributor

@y1chi y1chi commented May 4, 2022

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@github-actions github-actions bot added the python label May 4, 2022
@y1chi
Copy link
Contributor Author

y1chi commented May 4, 2022

R: @TheNeuralBit

I was able to get 100 runs successfully with this patch, do you mind take a look and help me validate and decide if this is an acceptable fix?

@codecov
Copy link

codecov bot commented May 4, 2022

Codecov Report

Merging #17537 (3405928) into master (0a01fbe) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #17537      +/-   ##
==========================================
+ Coverage   73.85%   73.89%   +0.03%     
==========================================
  Files         691      691              
  Lines       91255    91547     +292     
==========================================
+ Hits        67396    67645     +249     
- Misses      22626    22669      +43     
  Partials     1233     1233              
Flag Coverage Δ
python 83.68% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ks/python/apache_beam/runners/worker/data_plane.py 87.57% <100.00%> (-1.64%) ⬇️
...ks/python/apache_beam/runners/worker/sdk_worker.py 89.09% <100.00%> (-0.45%) ⬇️
.../python/apache_beam/testing/test_stream_service.py 88.09% <0.00%> (-4.77%) ⬇️
sdks/python/apache_beam/coders/row_coder.py 94.49% <0.00%> (-2.65%) ⬇️
sdks/python/apache_beam/runners/common.py 87.94% <0.00%> (-2.33%) ⬇️
...che_beam/runners/interactive/interactive_runner.py 89.43% <0.00%> (-1.41%) ⬇️
sdks/python/apache_beam/io/source_test_utils.py 88.01% <0.00%> (-1.39%) ⬇️
sdks/python/apache_beam/io/localfilesystem.py 90.97% <0.00%> (-0.76%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a01fbe...3405928. Read the comment docs.

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think this workaround is preferable to retrying the entire test. A couple of questions/concerns:

  • Are we sure it's safe to retry these UNAVAILABLE responses? The [gRPC docs note "it is not always safe to retry non-idempotent operations."
  • I think this is a better workaround, but it's still a little concerning that we don't understand the root cause - any thoughts on how we could dig deeper?

@y1chi
Copy link
Contributor Author

y1chi commented May 4, 2022

Thanks! I think this workaround is preferable to retrying the entire test. A couple of questions/concerns:

  • Are we sure it's safe to retry these UNAVAILABLE responses? The [gRPC docs note "it is not always safe to retry non-idempotent operations."
  • I think this is a better workaround, but it's still a little concerning that we don't understand the root cause - any thoughts on how we could dig deeper?

I believe UNAVAILABLE in this cases means that the underneath tcp connection was broken (not sure why and unclear how to debug that) before the request is handled so that means it should be retriable(had more than 250 runs and didn't see any side effect like getting wrong results after adding retry). I enabled the GRPC debug log but didn't find anything interesting also.

@y1chi y1chi merged commit 9154f8b into apache:master May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants