This repository was archived by the owner on Nov 24, 2025. It is now read-only.
Multi interface health#4916
Merged
mattjackson220 merged 115 commits intoapache:masterfrom Sep 11, 2020
Merged
Conversation
42b0a15 to
21ff6f6
Compare
27ba698 to
979a44b
Compare
shamrickus
suggested changes
Aug 6, 2020
003563a to
830f6aa
Compare
7 tasks
7701475 to
f89c037
Compare
98273eb to
712ac39
Compare
Contributor
|
Looks really close to me. TO and TM unit tests pass, API tests pass, docs build and look good, manual testing works as expected. |
712ac39 to
472eac3
Compare
5b73c56 to
b5ad807
Compare
shamrickus
reviewed
Sep 9, 2020
When a set of interface info had only one of IPv4 or IPv6, and non-service interfaces appeared after the service interface in the iteration, the value stored at the interface name pointer would be changed because the iteration variable is referential; therefore although IP information was calculated correctly, the wrong InterfaceName would be set. This commit not only fixes that, but also actually speeds up the conversion process in such cases.
…g cache is included in status
6dac1c9 to
70191a5
Compare
mattjackson220
approved these changes
Sep 10, 2020
Contributor
mattjackson220
left a comment
There was a problem hiding this comment.
Looks good to me! lib, monitor, and ops unit tests pass, docs build and look good, v1 v2 and v3 API tests pass, testing in full environment all looks great!
shamrickus
approved these changes
Sep 10, 2020
Member
shamrickus
left a comment
There was a problem hiding this comment.
LGTM
There are still a couple issues, specifically with upgrading ATC (#5001), but those will be resolved in a separate PR.
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR (Pull Request) do?
This removes the temporary shim "aggregate" interface used by Traffic Monitor - and Traffic Ops to a much lesser degree - and adds the ability to truly monitor cache servers according to both aggregate statistics as well as network interface thresholds.
Because interface stats now appear in cache stats TM API responses, this also introduces an equivalent to the
statsquery parameter used to filter general cache stats calledinterfaceStats.This PR also properly wraps use of the Traffic Ops Go client in Traffic Monitor - for tasks requiring APIv3-format data (there are still some places where APIv2 is used explicitly) - such that if login fails with the current API version (3), login will be tried with the older version (2.0) and, if successful, all future requests will use API v2.0, coercing legacy data into modern structures as necessary - and there are unit tests for those coercions.
It also consolidates double-definitions of cache server interface information and IP address structures into one, and consequently moved some functionality between files as appropriate.
It also adds support for signed and unsigned integral values of cache statistics, which previously only allowed floating point-type numbers.
It also adds some table styling to the TM UI to fix error rows not highlighted in red - that's just a temporary, small workaround to ease testing of this PR, and a better fix is coming in #4746.
It also fixes a bug in a "CRConfig" Snapshot API unit test where the test was using servers without any IP addresses - it wasn't failing, but was merely not actually testing as much as it ought to have been.
It also fixes a bug when converting server interface data to the legacy format which would cause them to report an incorrect interface name under certain circumstances, adding a test.
It also fixes a bug parsing
lib/go-tc.TMParametersstructures from JSON with certain comparators in health thresholds, adding a test.Finally, it also adds a load of GoDoc comments to objects that previously did not have them, and several tests for things that did not previously have those.
Which Traffic Control components are affected by this PR?
What is the best way to verify this PR?
Run all unit tests - for
/lib, Traffic Monitor, and Traffic Ops. Also ensure all of the Traffic Ops API tests - versions 1-3 - work and pass.To test the interface monitoring behavior, given some set of cache servers monitored by the Traffic Monitor instance being tested, add dummy interfaces you're sure won't be in the returned health data from the cache servers' health polling endpoints. Ensure that when "monitor" is not true for this/these interface(s) that they don't affect health status setting for the cache servers. Ensure that when this changes ("monitor" is true) the Traffic Monitor marks the cache servers as unhealthy because it can't find any data for the interfaces in the polling data. Also, check that "maxBandwidth" limits on each actually poll-able interface is respected, by setting it to something lower than the interface is actually serving (I use zero, because that's nearly impossible to reach) such that cache servers are marked unavailable when they are exceeded, and are not - at least not for that reason - when they are not exceeded. Finally, check that there is no setting of "maxBandwidth" on any interface with "monitor" set to false that will result in the cache server being marked unavailable (again, I use zero).
If this is a bug fix, what versions of Traffic Control are affected?
The following criteria are ALL met by this PR