New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix upgrade reconnect failure #637

Merged
merged 4 commits into from Jul 28, 2018

Conversation

Projects
None yet
3 participants
@satherton
Copy link
Contributor

satherton commented Jul 28, 2018

There are three flow connection related fixes here, the most important is that reliable packet send attempts to an already-connected peer which has been determined to be incompatible would be silently dropped. This could, in certain rare sequences of events, cause the multi version client to never connect to a cluster after an upgrade. The fix is to only silently drop reliable packets to peers which are newer than the current process because the current process will never communicate with such a peer in its lifetime. Older peers, however, may be upgraded.

The other fixes were a memory leak and a rare failed assertion regarding incompatible connection count which could occur if a cluster was rapidly cycled between versions (in a very abusive and unsupported way).

satherton added some commits Jul 28, 2018

Bug fix causing clients to sometimes (rarely) not reconnect to upgrad…
…ed clusters. Reliable packets were being dropped to incompatible peers intentionally, but now this is only done if the peer is newer since successful communication with a newer peer will never be possible.

@satherton satherton requested a review from etschannen Jul 28, 2018

@etschannen etschannen merged commit 7552a07 into apple:release-5.2 Jul 28, 2018

@ajbeamon

This comment has been minimized.

Copy link
Contributor

ajbeamon commented Jul 28, 2018

One bug was already fixed in release 6.0, and since this was done differently here, you'll meed to resolve the merge. Also, I don't think this bug requires any weird version cycling, I've seen it in normal upgrades.

#616

I'm a little curious about the main fix. The idea behind neutering the connection was that if a process were upgraded, the connection would be dropped and replaced. Is that not happening? Or is there something maybe going on with us caching something (e.g. the Peer) where a fix could be made instead?

I'm not sure the fix is completely correct as written because it may be possible for the protocol version to downgrade in the case that is still compatible (if the version differs only outside the mask). EDIT: actually, I think this paragraph can be disregarded because the behavior is guarded by a compatibility check.

Also, the fix dramatically reduces the benefit we were intending to provide. It means some incompatible connections will now be expensive, meaning we can't describe them as being generally low cost anymore. I don't think this is just theoretical, either, as there are legitimate cases where you can end up with a newer incompatible version connecting for a lengthy period of time. Even if it's a relatively short period of time, it's often long enough the we couldn't just ignore the difference.

If we intend to use this fix, it should probably be accompanied by filing an issue to restore the performance of incompatible connections.

@ajbeamon

This comment has been minimized.

Copy link
Contributor

ajbeamon commented Jul 28, 2018

One other question:

Is the fix that's a duplicate of the one made on release-6.0 still susceptible here to accounting issues? On the other version of the fix, we track when the counter is incremented and decrement it only if it had been incremented. Here, we decrement based on some property which isn't necessarily associated with the incrementing. For example, it appears that a peer which is both initially incompatible and actually incompatible will cause the counter to be incremented and not decremented. I'd also be concerned about the presence of waits in this code that could change the state between when this bool is set and when the counter is incremented.

@ajbeamon

This comment has been minimized.

Copy link
Contributor

ajbeamon commented Jul 28, 2018

I talked this over this with Evan through another channel. It sounds like the problem roughly speaking was that some packets trying to be sent are being waited on indefinitely when dropped and were unaware when the connection became compatible again.

We also talked about what we believe the performance impact of the change to be. After thinking it over, the performance problem this was intended to resolve was likely related to failure monitoring. However, Evan checked and confirmed that failure monitoring won't be running in the case that was changed (because failure monitoring requires talking to a cluster controller to start, and by being a newer client, it will never have talked to one).

So I think I'm in agreement now that the above change works without too much impact, pending a test to confirm that it was indeed failure monitoring that was costly and not something else.

@ajbeamon

This comment has been minimized.

Copy link
Contributor

ajbeamon commented Jul 28, 2018

Also, Evan fixed the change for incompatible connection accounting in a subsequent commit and then removed the fix here in favor of the 6.0 version when he merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment