Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health client parent health #7096

Merged
merged 1 commit into from Oct 24, 2022
Merged

Conversation

rob05c
Copy link
Member

@rob05c rob05c commented Sep 28, 2022

Adds parent health to the health client.

This allows caches to log and markdown parent health based on whether the cache can directly request its parents, in addition to the existing Traffic Monitor health.

Intra-datacenter network issues can cause a child cache to be unable to get to its parent, even though externally both are accessible to the Traffic Monitor. When this happens, we implicitly rely on the ATS parent health and markdown system, which is reactive and typically leads to client timeouts before a markdown and retry of a different parent can be completed. Direct parent health provides faster proactive markdown during these intra-datacenter network events.

This also significantly refactors the health client. It was a single thread with a loop to reload config, refresh TO data, get TM health, and markdown. This makes all operations their own goroutine/microthread service, as the single-threaded work loop just wasn't feasible with the size of work for parent health polling.

It adds 3 health mechanisms: L4 health (a TCP syn-ack-rst), L7 health (a successful HTTP response), and a meta-parent poll which polls the parent's own health client parent health and uses a heuristic of unavailable parents on the parent.

All new parent health mechanisms default to disabled, and should be considered experimental.

Which Traffic Control components are affected by this PR?

  • Traffic Control Health Client (tc-health-client)

What is the best way to verify this PR?

Enable parent health on the health client, observe logs and markdowns

If this is a bugfix, which Traffic Control versions contained the bug?

Not a bug fix

PR submission checklist

@rob05c rob05c added new feature A new feature, capability or behavior tc-health-client Traffic Control Health Client labels Sep 28, 2022
@rob05c rob05c force-pushed the add-hc-parent-health branch 6 times, most recently from 2d8314d to 316ee4a Compare September 28, 2022 21:53
@rob05c rob05c marked this pull request as ready for review September 28, 2022 21:53
dataOffset := 5 // because we have no options?

native := TCPHdrNative{
SrcPort: uint16(srcPort),

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types

Incorrect conversion of an integer with architecture-dependent bit size from [strconv.Atoi](1) to a lower bit size type uint16 without an upper bound check.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive, bounds are checked in GetAndHoldEphemeralPort.

@rob05c rob05c force-pushed the add-hc-parent-health branch 6 times, most recently from 608c7bf to 712350b Compare September 28, 2022 22:43
destPort := host.Port
dataOffset := 5 // because we have no options?
native := TCPHdrNative{
SrcPort: uint16(srcPort),

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types

Incorrect conversion of an integer with architecture-dependent bit size from [strconv.Atoi](1) to a lower bit size type uint16 without an upper bound check.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive, bounds are checked in GetAndHoldEphemeralPort

@ocket8888 ocket8888 added the medium impact impacts a significant portion of a CDN, or has the potential to do so label Sep 29, 2022
Copy link
Contributor

@jpappa200 jpappa200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. One small typo in the README.md example tc-health-client.json has comma after last key value pair.
"parent-health-log-location": "/var/log/trafficcontrol/tc-health-client_parent-health.log", "health-methods": ["traffic-monitor", "parent-l4", "parent-l7", "parent-service"], "markdown-methods": ["traffic-monitor", "parent-l4", "parent-l7", "parent-service"], "hostname": "", }

@rob05c
Copy link
Member Author

rob05c commented Oct 17, 2022

One small typo in the README.md

Fixed

tc-health-client/config/config_test.go Outdated Show resolved Hide resolved
tc-health-client/sar/ephemeralportholder.go Show resolved Hide resolved
tc-health-client/tmagent/markdownservice.go Outdated Show resolved Hide resolved
@zrhoffman zrhoffman merged commit 5ab489f into apache:master Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium impact impacts a significant portion of a CDN, or has the potential to do so new feature A new feature, capability or behavior tc-health-client Traffic Control Health Client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants