Skip to content
This repository was archived by the owner on Nov 24, 2025. It is now read-only.

Multi interface health#4916

Merged
mattjackson220 merged 115 commits intoapache:masterfrom
ocket8888:multi-interface-health
Sep 11, 2020
Merged

Multi interface health#4916
mattjackson220 merged 115 commits intoapache:masterfrom
ocket8888:multi-interface-health

Conversation

@ocket8888
Copy link
Contributor

@ocket8888 ocket8888 commented Jul 27, 2020

What does this PR (Pull Request) do?

This removes the temporary shim "aggregate" interface used by Traffic Monitor - and Traffic Ops to a much lesser degree - and adds the ability to truly monitor cache servers according to both aggregate statistics as well as network interface thresholds.

Because interface stats now appear in cache stats TM API responses, this also introduces an equivalent to the stats query parameter used to filter general cache stats called interfaceStats.

This PR also properly wraps use of the Traffic Ops Go client in Traffic Monitor - for tasks requiring APIv3-format data (there are still some places where APIv2 is used explicitly) - such that if login fails with the current API version (3), login will be tried with the older version (2.0) and, if successful, all future requests will use API v2.0, coercing legacy data into modern structures as necessary - and there are unit tests for those coercions.

It also consolidates double-definitions of cache server interface information and IP address structures into one, and consequently moved some functionality between files as appropriate.

It also adds support for signed and unsigned integral values of cache statistics, which previously only allowed floating point-type numbers.

It also adds some table styling to the TM UI to fix error rows not highlighted in red - that's just a temporary, small workaround to ease testing of this PR, and a better fix is coming in #4746.

It also fixes a bug in a "CRConfig" Snapshot API unit test where the test was using servers without any IP addresses - it wasn't failing, but was merely not actually testing as much as it ought to have been.

It also fixes a bug when converting server interface data to the legacy format which would cause them to report an incorrect interface name under certain circumstances, adding a test.

It also fixes a bug parsing lib/go-tc.TMParameters structures from JSON with certain comparators in health thresholds, adding a test.

Finally, it also adds a load of GoDoc comments to objects that previously did not have them, and several tests for things that did not previously have those.

Which Traffic Control components are affected by this PR?

  • Documentation
  • Traffic Monitor
  • Traffic Ops

What is the best way to verify this PR?

Run all unit tests - for /lib, Traffic Monitor, and Traffic Ops. Also ensure all of the Traffic Ops API tests - versions 1-3 - work and pass.

To test the interface monitoring behavior, given some set of cache servers monitored by the Traffic Monitor instance being tested, add dummy interfaces you're sure won't be in the returned health data from the cache servers' health polling endpoints. Ensure that when "monitor" is not true for this/these interface(s) that they don't affect health status setting for the cache servers. Ensure that when this changes ("monitor" is true) the Traffic Monitor marks the cache servers as unhealthy because it can't find any data for the interfaces in the polling data. Also, check that "maxBandwidth" limits on each actually poll-able interface is respected, by setting it to something lower than the interface is actually serving (I use zero, because that's nearly impossible to reach) such that cache servers are marked unavailable when they are exceeded, and are not - at least not for that reason - when they are not exceeded. Finally, check that there is no setting of "maxBandwidth" on any interface with "monitor" set to false that will result in the cache server being marked unavailable (again, I use zero).

If this is a bug fix, what versions of Traffic Control are affected?

  • master

The following criteria are ALL met by this PR

  • This PR includes tests
  • This PR includes documentation
  • This PR includes an update to CHANGELOG.md
  • This PR includes any and all required license headers
  • This PR does not include a database migration
  • This PR DOES NOT FIX A SERIOUS SECURITY VULNERABILITY

@ocket8888 ocket8888 added new feature A new feature, capability or behavior Traffic Ops related to Traffic Ops Traffic Monitor related to Traffic Monitor high impact impacts the basic function, deployment, or operation of a CDN labels Jul 27, 2020
@ocket8888 ocket8888 force-pushed the multi-interface-health branch 2 times, most recently from 42b0a15 to 21ff6f6 Compare July 29, 2020 18:51
@ocket8888 ocket8888 marked this pull request as ready for review July 29, 2020 21:13
@ocket8888 ocket8888 force-pushed the multi-interface-health branch from 27ba698 to 979a44b Compare August 3, 2020 23:44
Copy link
Contributor

@mattjackson220 mattjackson220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple minor comments. TM unit tests pass, TO unit and API v1, v2, and v3 tests pass, /lib tests pass, docs build and look good. still testing functionality in ciab

@ocket8888 ocket8888 force-pushed the multi-interface-health branch 2 times, most recently from 7701475 to f89c037 Compare August 18, 2020 23:35
@ocket8888 ocket8888 force-pushed the multi-interface-health branch from 98273eb to 712ac39 Compare August 20, 2020 21:59
@mattjackson220
Copy link
Contributor

Looks really close to me. TO and TM unit tests pass, API tests pass, docs build and look good, manual testing works as expected.
only thing ive noticed is that all interfaces in /cache-statuses have the same bandwidth, making the total bandwidth off too. i think its a bug in cachestate.go lines 181-187. should use health[0].InterfaceVitals[inf.Name] data instead of lastStat?

ocket8888 and others added 24 commits September 10, 2020 14:09
When a set of interface info had only one of IPv4 or IPv6, and
non-service interfaces appeared after the service interface in the
iteration, the value stored at the interface name pointer would be
changed because the iteration variable is referential; therefore
although IP information was calculated correctly, the wrong
InterfaceName would be set.

This commit not only fixes that, but also actually speeds up the
conversion process in such cases.
@ocket8888 ocket8888 force-pushed the multi-interface-health branch from 6dac1c9 to 70191a5 Compare September 10, 2020 20:09
Copy link
Contributor

@mattjackson220 mattjackson220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! lib, monitor, and ops unit tests pass, docs build and look good, v1 v2 and v3 API tests pass, testing in full environment all looks great!

Copy link
Member

@shamrickus shamrickus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
There are still a couple issues, specifically with upgrading ATC (#5001), but those will be resolved in a separate PR.

@mattjackson220 mattjackson220 merged commit 2609e6f into apache:master Sep 11, 2020
@ocket8888 ocket8888 deleted the multi-interface-health branch September 11, 2020 15:09
@zrhoffman zrhoffman mentioned this pull request Oct 5, 2020
7 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

high impact impacts the basic function, deployment, or operation of a CDN new feature A new feature, capability or behavior Traffic Monitor related to Traffic Monitor Traffic Ops related to Traffic Ops

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable Traffic Monitor to read in multiple interfaces for a cache server

4 participants