New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add health client parent health #7096
Conversation
2d8314d
to
316ee4a
Compare
316ee4a
to
1d993dc
Compare
dataOffset := 5 // because we have no options? | ||
|
||
native := TCPHdrNative{ | ||
SrcPort: uint16(srcPort), |
Check failure
Code scanning / CodeQL
Incorrect conversion between integer types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False positive, bounds are checked in GetAndHoldEphemeralPort.
608c7bf
to
712350b
Compare
destPort := host.Port | ||
dataOffset := 5 // because we have no options? | ||
native := TCPHdrNative{ | ||
SrcPort: uint16(srcPort), |
Check failure
Code scanning / CodeQL
Incorrect conversion between integer types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False positive, bounds are checked in GetAndHoldEphemeralPort
712350b
to
fede1f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. One small typo in the README.md example tc-health-client.json has comma after last key value pair.
"parent-health-log-location": "/var/log/trafficcontrol/tc-health-client_parent-health.log", "health-methods": ["traffic-monitor", "parent-l4", "parent-l7", "parent-service"], "markdown-methods": ["traffic-monitor", "parent-l4", "parent-l7", "parent-service"], "hostname": "", }
fede1f6
to
a0ca168
Compare
Fixed |
a0ca168
to
e0b7d46
Compare
Adds parent health to the health client.
This allows caches to log and markdown parent health based on whether the cache can directly request its parents, in addition to the existing Traffic Monitor health.
Intra-datacenter network issues can cause a child cache to be unable to get to its parent, even though externally both are accessible to the Traffic Monitor. When this happens, we implicitly rely on the ATS parent health and markdown system, which is reactive and typically leads to client timeouts before a markdown and retry of a different parent can be completed. Direct parent health provides faster proactive markdown during these intra-datacenter network events.
This also significantly refactors the health client. It was a single thread with a loop to reload config, refresh TO data, get TM health, and markdown. This makes all operations their own goroutine/microthread service, as the single-threaded work loop just wasn't feasible with the size of work for parent health polling.
It adds 3 health mechanisms: L4 health (a TCP syn-ack-rst), L7 health (a successful HTTP response), and a meta-parent poll which polls the parent's own health client parent health and uses a heuristic of unavailable parents on the parent.
All new parent health mechanisms default to disabled, and should be considered experimental.
Which Traffic Control components are affected by this PR?
What is the best way to verify this PR?
Enable parent health on the health client, observe logs and markdowns
If this is a bugfix, which Traffic Control versions contained the bug?
Not a bug fix
PR submission checklist