Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: don't leave poison zero-nodeID connections in pool #37204

Merged
merged 1 commit into from Apr 30, 2019

Conversation

Projects
None yet
3 participants
@tbg
Copy link
Member

commented Apr 30, 2019

An optimiziation to share the (target,remoteNodeID) connection under a
second name (target,0) backfired because we were never unregistering
the latter, meaning that clients requesting (target,0) would be handed
an eternally broken connection.

See #37200.

Release note (bug fix): Avoid a source of internal connectivity problems
that would resolve after restarting the affected node.

@tbg tbg requested a review from cockroachdb/core-prs as a code owner Apr 30, 2019

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

This change is Reviewable

@tbg tbg requested a review from knz Apr 30, 2019

rpc: don't leave poison zero-nodeID connections in pool
An optimiziation to share the `(target,remoteNodeID)` connection under a
second name `(target,0)` backfired because we were never unregistering
the latter, meaning that clients requesting `(target,0)` would be handed
an eternally broken connection.

See #37200.

Release note (bug fix): Avoid a source of internal connectivity problems
that would resolve after restarting the affected node.

@tbg tbg force-pushed the tbg:fix/rpc-poison branch from 93098e5 to 0dd9ca7 Apr 30, 2019

tbg added a commit to tbg/cockroach that referenced this pull request Apr 30, 2019

roachtest: reduce hangs in acceptance-chaos tests
These tests are pretty janky, and can end up failing with a timeout and
a deadlocked test, which is not something roachtest can really ever
handle gracefully. Sprinkle more contexts around and set a statement
timeout for the central query that is most likely to get stuck under the
crucial lock that we think "causes" most of the deadlocks.

Of course there is likely a real problem with CRDB, which this PR does
nothing about. All that is (hopefully) achieved here is a clean failure
mode. The failure prompting this PR is fixed by cockroachdb#37204, unfortunately
it also turns out that the statement timeout added in this PR did not
prevent the statement from hanging. It is probably still worth merging
this.

Release note: None
@knz

knz approved these changes Apr 30, 2019

Copy link
Member

left a comment

Nice find and elegant fix. I suppose this was precisely the kind of problem Peter was foreseeing when I initially made the change. I'm sorry I did not consider this before. LGTM in any case. Thank you!

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@tbg

This comment has been minimized.

Copy link
Member Author

commented Apr 30, 2019

I'm sorry I did not consider this before

Well, I'm sorry I didn't see it in review :-) It happens.

bors r=knz

craig bot pushed a commit that referenced this pull request Apr 30, 2019

Merge #37204
37204: rpc: don't leave poison zero-nodeID connections in pool r=knz a=tbg

An optimiziation to share the `(target,remoteNodeID)` connection under a
second name `(target,0)` backfired because we were never unregistering
the latter, meaning that clients requesting `(target,0)` would be handed
an eternally broken connection.

See #37200.

Release note (bug fix): Avoid a source of internal connectivity problems
that would resolve after restarting the affected node.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig

This comment has been minimized.

Copy link

commented Apr 30, 2019

Build succeeded

@craig craig bot merged commit 0dd9ca7 into cockroachdb:master Apr 30, 2019

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details

craig bot pushed a commit that referenced this pull request May 1, 2019

Merge #37205
37205: roachtest: reduce hangs in acceptance-chaos tests r=andreimatei a=tbg

These tests are pretty janky, and can end up failing with a timeout and
a deadlocked test, which is not something roachtest can really ever
handle gracefully. Sprinkle more contexts around and set a statement
timeout for the central query that is most likely to get stuck under the
crucial lock that we think "causes" most of the deadlocks.

Of course there is likely a real problem with CRDB, which this PR does
nothing about. All that is (hopefully) achieved here is a clean failure
mode. The failure prompting this PR is fixed by #37204, unfortunately
it also turns out that the statement timeout added in this PR did not
prevent the statement from hanging. It is probably still worth merging
this.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.