Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[heartbeat] States and Improved Errors #30632

Merged
merged 129 commits into from Sep 13, 2022
Merged

[heartbeat] States and Improved Errors #30632

merged 129 commits into from Sep 13, 2022

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Mar 2, 2022

Fixes #32163 , corresponding mapping changes for synthetics package in elastic/integrations#4023

Adds a notion of state across checks, with flapping as a bonus. At a high level this PR does the following:

  1. Adds new root level state fields
  2. Enhances the ecserr package and types to make them more testable and usable
  3. Refactors timeout, http status, and could not connect errors to use the new ecserr package to make testing this PR/feature easier (these are the easiest types of errors to replicate) with lightweight monitors.
  4. Adds support for the standard mage goIntegTest task, already supported by CI that thus far has been a noop for heartbeat.
  5. Adds a notion of flapping states, in addition to up / down states.
  6. Automatically connects to ES to retrieve the last state value for the given monitor when a monitor first starts, this is necessary to continue the previous state across restarts of heartbeat
  7. Replaces the add_observer_metadata processor with a new heartbeat.location global setting and location per monitor setting. This lets us set a location ID (which is then set to observer.name. See details below:

Note: flapping is currently disabled

Per the discussion in the review, it's a complex feature, let's add it in a follow-up

What are states, and how are they implemented here?

The main goal of this PR is to resolve #32163 , which this goes, but it also recognizes that the goal of grouping errors is a subset of the more general problem of grouping both up and down states. It's useful to group both since it's useful to see something like:

State Duration Reason
UP 18 hours
DOWN 30 minutes status 400
Up 1 month

Hence, the introduction of the various state.* fields, which group contiguous blocks of 'up' and 'down' states together.

A sample of the state.* fields can be seen below:

{
  "state": { // new state field in addition to existing monitor fields
            // globally unique ID for this state, this ID is sortable as a timestamp 
            // to speed up aggregations. The format is id-timestampMsHex-serialHex
            // which is more compact than a UUID, and also chronologically sortable
            "id": "dummy-182a27ea210-2dc",
            // when this state first started, with this we can see when the first event in the
            // state occurred without having to retrieve that event
            "started_at": "2022-08-15T12:13:04.2721958-05:00",
            // number of milliseconds this state has been active for
            "duration_ms": 4655149,
            // number of checks that have occurred within this state
            // broken out by up/down. Flapping states will have non-zero values for both up/down
            "checks": 2290,
            "up": 0,
            "down": 2290
             // status of the state, which can be 'up', 'down', or 'flapping'
            // usually identical to monitor.status except in the case of a flapping monitor
            "status": "down",
            // the last FLAPPING_THRESHOLD-1 checks, used to reconstruct flapping state
            // when resuming state from ES
            "flap_history": [], 
            // The prior state, the nice thing about `state.ends` is that these states do not change
            // since they are complete, so they are easy to query / aggregate since the values are stable
            // in actual use these are only attached to events with `state.checks: 1` so they appear
            // exactly once
            "ends": {
              "started_at": "2022-08-15T12:12:57.1792082-05:00",
              "duration_ms": 5069,
              "status": "flap",
              "up": 3,
              "down": 1,
              "flap_history": null, // omitted on ends states since it's just dead-weight
              "id": "dummy-182a27e865b-2db",
              "checks": 4,
              "ends": null // we don't recurse end states
            },
          },
}

Notes on location

The new location field can be set as follows:

#globally
heartbeat.location:
  id: "us-east-1a"
  geo:
    name: "US-East Coast"
    location: "44.123, 45.12345"
heartbeat.monitors:
- type: http
  id: my-monitor
  urls: "http://elastic.co"
  run_from:
    id: "us-east-1a"
      geo:
        name: "US-East Coast"
        location: "44.123, 45.12345"

Notes on flapping states

The new flapping state serves an important purpose, to reduce the cardinality of states for unstable sites. This is important for UX and UI reasons, since having large numbers of states to visualize in a list is a key thing we'd like to improve.

The flapping threshold in this PR is hard coded to the number 7. This number is equivalent to the number of consecutive identical 'up' or 'down' states the algorithm uses to determine whether a monitor is stable or not. As an example, if a monitor experiences 7 consecutive up checks, followed by seven consecutive down checks it will be reflected as a single up state of 7 checks followed by a single down state of 7 checks. If, by contrast, there are 7 consecutive up checks followed by 6 consecutive down checks, then a single up check there will be two consecutive states, of up followed by flapping; if 6 consecutive down checks were to follow the last of these new events would constitute a new down state following the flapping state, since the monitor would now be stable.

Please see the unit tests for monitor states for additional more nuanced detail. I don't think it makes sense to expose this to users yet, though we could in the future. It's a bit complex to explain, and I think this is a good starting point. We'll likely want to tweak this algorithm in the future, but we that could be done in a follow-up. One concern I have is that it could take a while for monitors to recover if they run infrequently.

It should be noted that flapping checks start as a simple up or down check, but change into a flapping check if they see a different result before the flapping threshold is hit. So the most recent state.status is the only accurate value that should be used. It is also for this reason that two consecutive up or down states cannot happen, but multiple consecutive flapping states could happen if after what looks like a recover instability occurs again. We may want to tweak this to allow for shorter stable states. Again, I think this could happen in a flapping follow-up.

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Create some monitors and verify that the state field is populated as expected in Elasticsearch. Since heartbeat just re-uses the ES connection, the easiest way is just to use the elastic user for your ES output, or just set cloud.id and cloud.auth.

You could try the following kibana dashboard (exported and compressed as a zip file) to make an easier time of it.

states-dashboard-export.zip

@andrewvc
Copy link
Contributor Author

andrewvc commented Sep 7, 2022

Tests are failing on win, blocked on #32994 for now

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, I havent digged much in to the flapping states and thrershold when we switch them as we are disabling that particular feature for this PR.

heartbeat/monitors/wrappers/monitorstate/tracker.go Outdated Show resolved Hide resolved
return state
}

tries := 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if retrying 3 times would be good to do for run_once mode. Maybe fine for long running HB deployments.

  • Should we set this to 1 by default and do multiple retries only when Run once mode is off?
  • Also should we set at timeout for this query?

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be fine for run_once mode because it should be constrained to 1^2+2^2+500*3=6500ms total delay + the time to actually exec the requests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add a little more detail here, the internet can be a little flakey sometimes, so I think it's a good thing to do a sort of minimal retry along the lines we do here for run_once as well. Also, the regular case is similar to run_once in that we don't want to impede the startup of heartbeat if ES is down, but it's OK to delay it... a little bit. So, it would make sense to use the same logic for both.

time.Sleep(time.Millisecond * 10)
ms.recordCheck(TestSf, StatusUp)
// Pretty forgiving upper bound to account for flaky CI
require.True(t, ms.DurationMs > 9 && ms.DurationMs < 300, "Expected duration to be ~10ms, got %d", ms.DurationMs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could remove the 300ms upper bound here.I see no harm in that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry is that there could be a bug in how we calculate the duration in the future. Let me up it to say 900 though, that's very forgiving.


The key types in here are:

- Scenario: A description of a given heartbeat configuration with some additional parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the flexibility of these new tests framework.

x-pack/heartbeat/scenarios/framework/framework.go Outdated Show resolved Hide resolved
@andrewvc
Copy link
Contributor Author

andrewvc commented Sep 9, 2022

@vigneshshanmugam FYI 6ecac3e fixes an interesting little bug, where heartbeat would try to connect to ES with the default options when non-ES outputs were defined. This was caught by the python tests

@mergify
Copy link
Contributor

mergify bot commented Sep 13, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b tv2 upstream/tv2
git merge upstream/main
git push upstream tv2

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with pending comments, This is exciting 🎉

Tested with lightweight and browser jobs. State seems to be populated correctly, Location data is reflected properly on all documents.

@@ -44,7 +44,7 @@ func newMonitorState(sf stdfields.StdMonitorFields, status StateStatus, ctr int,
// ID is unique and sortable by time for easier aggregations
// Note that we add an incrementing counter to help with the fact that
// millisecond res isn't quite enough for uniqueness (esp. in tests)
ID: fmt.Sprintf("%s-%s-%x-%x", sf.ID, sf.Location, now.UnixMilli(), ctr),
ID: fmt.Sprintf("%s-%s-%x-%x", sf.ID, sf.RunFrom, now.UnixMilli(), ctr),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to account for missing RunFrom config. This is what I see in states

  "state": {
    "started_at": "2022-09-13T11:22:16.489384-07:00",
    "up": 0,
    "ends": null,
    "id": "my-monitor-%!s(*config.LocationWithID=<nil>)-183381669a9-0",
    "duration_ms": 20003,
    "status": "down",
    "checks": 3,
    "down": 3,
    "flap_history": []
  },

We can default to unknown location ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it should be RunFrom.id, since its using interface now.

@@ -27,8 +27,8 @@ func newLoaderDB() *loaderDB {
}

func loaderDbKey(sf stdfields.StdMonitorFields) string {
if sf.Location != nil {
return fmt.Sprintf("%s-%s", sf.ID, sf.Location.ID)
if sf.RunFrom != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this logic to monitorstate.go?

@@ -33,6 +39,7 @@ type Config struct {
Scheduler Scheduler `config:"scheduler"`
Autodiscover *autodiscover.Config `config:"autodiscover"`
Jobs map[string]*JobLimit `config:"jobs"`
Location *LocationWithID `config:"location"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also change this to RunFrom

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking this was a good top level name, but now I think you're right, consistency is more important.

@andrewvc andrewvc merged commit 0e3ab4a into elastic:main Sep 13, 2022
andrewvc added a commit to elastic/integrations that referenced this pull request Oct 4, 2022
This is the mapping counterpart to elastic/beats#30632

It adds supports for the new state.* fields
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
Fixes #32163 , corresponding mapping changes for synthetics package in elastic/integrations#4023

Adds a notion of state across checks, with flapping as a bonus. At a high level this PR does the following:

Adds new root level state fields
Enhances the ecserr package and types to make them more testable and usable
Refactors timeout, http status, and could not connect errors to use the new ecserr package to make testing this PR/feature easier (these are the easiest types of errors to replicate) with lightweight monitors.
Adds support for the standard mage goIntegTest task, already supported by CI that thus far has been a noop for heartbeat.
Adds a notion of flapping states, in addition to up / down states.
Automatically connects to ES to retrieve the last state value for the given monitor when a monitor first starts, this is necessary to continue the previous state across restarts of heartbeat
Replaces the add_observer_metadata processor with a new heartbeat.location global setting and location per monitor setting. This lets us set a location ID (which is then set to observer.name. See details below:
Note: flapping is currently disabled

Per the discussion in the review, it's a complex feature, let's add it in a follow-up

What are states, and how are they implemented here?
The main goal of this PR is to resolve #32163 , which this goes, but it also recognizes that the goal of grouping errors is a subset of the more general problem of grouping both up and down states. It's useful to group both since it's useful to see something like:

State	Duration	Reason
UP	18 hours	
DOWN	30 minutes	status 400
Up	1 month	
Hence, the introduction of the various state.* fields, which group contiguous blocks of 'up' and 'down' states together.

A sample of the state.* fields can be seen below:

{
  "state": { // new state field in addition to existing monitor fields
            // globally unique ID for this state, this ID is sortable as a timestamp 
            // to speed up aggregations. The format is id-timestampMsHex-serialHex
            // which is more compact than a UUID, and also chronologically sortable
            "id": "dummy-182a27ea210-2dc",
            // when this state first started, with this we can see when the first event in the
            // state occurred without having to retrieve that event
            "started_at": "2022-08-15T12:13:04.2721958-05:00",
            // number of milliseconds this state has been active for
            "duration_ms": 4655149,
            // number of checks that have occurred within this state
            // broken out by up/down. Flapping states will have non-zero values for both up/down
            "checks": 2290,
            "up": 0,
            "down": 2290
             // status of the state, which can be 'up', 'down', or 'flapping'
            // usually identical to monitor.status except in the case of a flapping monitor
            "status": "down",
            // the last FLAPPING_THRESHOLD-1 checks, used to reconstruct flapping state
            // when resuming state from ES
            "flap_history": [], 
            // The prior state, the nice thing about `state.ends` is that these states do not change
            // since they are complete, so they are easy to query / aggregate since the values are stable
            // in actual use these are only attached to events with `state.checks: 1` so they appear
            // exactly once
            "ends": {
              "started_at": "2022-08-15T12:12:57.1792082-05:00",
              "duration_ms": 5069,
              "status": "flap",
              "up": 3,
              "down": 1,
              "flap_history": null, // omitted on ends states since it's just dead-weight
              "id": "dummy-182a27e865b-2db",
              "checks": 4,
              "ends": null // we don't recurse end states
            },
          },
}
Notes on location
The new location field can be set as follows:

#globally
heartbeat.location:
  id: "us-east-1a"
  geo:
    name: "US-East Coast"
    location: "44.123, 45.12345"
heartbeat.monitors:
- type: http
  id: my-monitor
  urls: "http://elastic.co"
  run_from:
    id: "us-east-1a"
      geo:
        name: "US-East Coast"
        location: "44.123, 45.12345"
Notes on flapping states
The new flapping state serves an important purpose, to reduce the cardinality of states for unstable sites. This is important for UX and UI reasons, since having large numbers of states to visualize in a list is a key thing we'd like to improve.

The flapping threshold in this PR is hard coded to the number 7. This number is equivalent to the number of consecutive identical 'up' or 'down' states the algorithm uses to determine whether a monitor is stable or not. As an example, if a monitor experiences 7 consecutive up checks, followed by seven consecutive down checks it will be reflected as a single up state of 7 checks followed by a single down state of 7 checks. If, by contrast, there are 7 consecutive up checks followed by 6 consecutive down checks, then a single up check there will be two consecutive states, of up followed by flapping; if 6 consecutive down checks were to follow the last of these new events would constitute a new down state following the flapping state, since the monitor would now be stable.

Please see the unit tests for monitor states for additional more nuanced detail. I don't think it makes sense to expose this to users yet, though we could in the future. It's a bit complex to explain, and I think this is a good starting point. We'll likely want to tweak this algorithm in the future, but we that could be done in a follow-up. One concern I have is that it could take a while for monitors to recover if they run infrequently.

It should be noted that flapping checks start as a simple up or down check, but change into a flapping check if they see a different result before the flapping threshold is hit. So the most recent state.status is the only accurate value that should be used. It is also for this reason that two consecutive up or down states cannot happen, but multiple consecutive flapping states could happen if after what looks like a recover instability occurs again. We may want to tweak this to allow for shorter stable states. Again, I think this could happen in a flapping follow-up.
@andrewvc andrewvc deleted the tv2 branch June 22, 2023 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify enhancement Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Heartbeat][Spike] Stateful Errors
7 participants