Skip to content
This repository was archived by the owner on Nov 24, 2025. It is now read-only.
This repository was archived by the owner on Nov 24, 2025. It is now read-only.

Failures on the Optimistic Health Check for Traffic Monitor #6377

@mikeV02

Description

@mikeV02

This Bug Report affects these Traffic Control components:

  • Traffic Monitor

Current behavior:

In an optimistic quorum formed by three TMs, when a single TM detects an ATS server as down, its report for /publish/CrStates flaps between available and unavailable, which results in HTTP 503 on TrafficRouter when it checks that TM in the instant it reports unavailable. The MM in question, seems to be disregarding its peers report of available.

Looking deeper, I noticed the the flapping of /publish/CrStates is just a consequence of another failure when TM checks for its peers. When checking /publish/PeerStates, there is also a flapping between available and unavailable for both of its peers. I took some packet captures for the calls to /publish/CrStates?raw on its peers and I see they actually return an "available" state for the cache, but somewhere in the TM that detects the ATS as down, it is changing the local copy of the peers states to unavailable.

Following through the code it seems the bug is somewhere in traffic_monitor/peer/peer.go or traffic_monitor/manager/manager.go. I could not pin point the exact function where it fails as variables are a bit cryptic and I don't have that much experience reading Go.

Expected behavior:

When in an optimistic quorum, a TM that detects an ATS as down, it should always takes the optimistic value reported by its peers. If the other two TMs report the ATS as available, the TM in question should also report as available.

Steps to reproduce:

  1. Deploy an optimistic quorum of minimum 3 TMs
  2. Simulate a connection drop between a single TM and an ATS server (i.e. firewall)
  3. Look at /publish/CrStates for this TM and see the stats flap between available and unavailable
  4. Look at /publish/PeerStates and see state flap between available and unavailable
  5. Make several streaming requests against TR (curl or browser stream)
  6. See TR also flap between successful requests and HTTP 503 errors. (this propagates from the flaps in the affected TM)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Traffic Monitorrelated to Traffic Monitorbugsomething isn't working as intendedhigh impactimpacts the basic function, deployment, or operation of a CDN

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions