This Bug Report affects these Traffic Control components:
Current behavior:
In an optimistic quorum formed by three TMs, when a single TM detects an ATS server as down, its report for /publish/CrStates flaps between available and unavailable, which results in HTTP 503 on TrafficRouter when it checks that TM in the instant it reports unavailable. The MM in question, seems to be disregarding its peers report of available.
Looking deeper, I noticed the the flapping of /publish/CrStates is just a consequence of another failure when TM checks for its peers. When checking /publish/PeerStates, there is also a flapping between available and unavailable for both of its peers. I took some packet captures for the calls to /publish/CrStates?raw on its peers and I see they actually return an "available" state for the cache, but somewhere in the TM that detects the ATS as down, it is changing the local copy of the peers states to unavailable.
Following through the code it seems the bug is somewhere in traffic_monitor/peer/peer.go or traffic_monitor/manager/manager.go. I could not pin point the exact function where it fails as variables are a bit cryptic and I don't have that much experience reading Go.
Expected behavior:
When in an optimistic quorum, a TM that detects an ATS as down, it should always takes the optimistic value reported by its peers. If the other two TMs report the ATS as available, the TM in question should also report as available.
Steps to reproduce:
- Deploy an optimistic quorum of minimum 3 TMs
- Simulate a connection drop between a single TM and an ATS server (i.e. firewall)
- Look at
/publish/CrStates for this TM and see the stats flap between available and unavailable
- Look at
/publish/PeerStates and see state flap between available and unavailable
- Make several streaming requests against TR (curl or browser stream)
- See TR also flap between successful requests and HTTP 503 errors. (this propagates from the flaps in the affected TM)
This Bug Report affects these Traffic Control components:
Current behavior:
In an optimistic quorum formed by three TMs, when a single TM detects an ATS server as down, its report for
/publish/CrStatesflaps between available and unavailable, which results in HTTP 503 on TrafficRouter when it checks that TM in the instant it reports unavailable. The MM in question, seems to be disregarding its peers report of available.Looking deeper, I noticed the the flapping of
/publish/CrStatesis just a consequence of another failure when TM checks for its peers. When checking/publish/PeerStates, there is also a flapping between available and unavailable for both of its peers. I took some packet captures for the calls to/publish/CrStates?rawon its peers and I see they actually return an "available" state for the cache, but somewhere in the TM that detects the ATS as down, it is changing the local copy of the peers states to unavailable.Following through the code it seems the bug is somewhere in
traffic_monitor/peer/peer.goortraffic_monitor/manager/manager.go. I could not pin point the exact function where it fails as variables are a bit cryptic and I don't have that much experience reading Go.Expected behavior:
When in an optimistic quorum, a TM that detects an ATS as down, it should always takes the optimistic value reported by its peers. If the other two TMs report the ATS as available, the TM in question should also report as available.
Steps to reproduce:
/publish/CrStatesfor this TM and see the stats flap between available and unavailable/publish/PeerStatesand see state flap between available and unavailable