Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Fix upgrade reconnect failure #637
There are three flow connection related fixes here, the most important is that reliable packet send attempts to an already-connected peer which has been determined to be incompatible would be silently dropped. This could, in certain rare sequences of events, cause the multi version client to never connect to a cluster after an upgrade. The fix is to only silently drop reliable packets to peers which are newer than the current process because the current process will never communicate with such a peer in its lifetime. Older peers, however, may be upgraded.
The other fixes were a memory leak and a rare failed assertion regarding incompatible connection count which could occur if a cluster was rapidly cycled between versions (in a very abusive and unsupported way).
One bug was already fixed in release 6.0, and since this was done differently here, you'll meed to resolve the merge. Also, I don't think this bug requires any weird version cycling, I've seen it in normal upgrades.
I'm a little curious about the main fix. The idea behind neutering the connection was that if a process were upgraded, the connection would be dropped and replaced. Is that not happening? Or is there something maybe going on with us caching something (e.g. the Peer) where a fix could be made instead?
I'm not sure the fix is completely correct as written because it may be possible for the protocol version to downgrade in the case that is still compatible (if the version differs only outside the mask). EDIT: actually, I think this paragraph can be disregarded because the behavior is guarded by a compatibility check.
Also, the fix dramatically reduces the benefit we were intending to provide. It means some incompatible connections will now be expensive, meaning we can't describe them as being generally low cost anymore. I don't think this is just theoretical, either, as there are legitimate cases where you can end up with a newer incompatible version connecting for a lengthy period of time. Even if it's a relatively short period of time, it's often long enough the we couldn't just ignore the difference.
If we intend to use this fix, it should probably be accompanied by filing an issue to restore the performance of incompatible connections.
One other question:
Is the fix that's a duplicate of the one made on release-6.0 still susceptible here to accounting issues? On the other version of the fix, we track when the counter is incremented and decrement it only if it had been incremented. Here, we decrement based on some property which isn't necessarily associated with the incrementing. For example, it appears that a peer which is both initially incompatible and actually incompatible will cause the counter to be incremented and not decremented. I'd also be concerned about the presence of waits in this code that could change the state between when this bool is set and when the counter is incremented.
I talked this over this with Evan through another channel. It sounds like the problem roughly speaking was that some packets trying to be sent are being waited on indefinitely when dropped and were unaware when the connection became compatible again.
We also talked about what we believe the performance impact of the change to be. After thinking it over, the performance problem this was intended to resolve was likely related to failure monitoring. However, Evan checked and confirmed that failure monitoring won't be running in the case that was changed (because failure monitoring requires talking to a cluster controller to start, and by being a newer client, it will never have talked to one).
So I think I'm in agreement now that the above change works without too much impact, pending a test to confirm that it was indeed failure monitoring that was costly and not something else.