[heartbeat] States and Improved Errors #30632

andrewvc · 2022-03-02T01:43:41Z

Fixes #32163 , corresponding mapping changes for synthetics package in elastic/integrations#4023

Adds a notion of state across checks, with flapping as a bonus. At a high level this PR does the following:

Adds new root level state fields
Enhances the ecserr package and types to make them more testable and usable
Refactors timeout, http status, and could not connect errors to use the new ecserr package to make testing this PR/feature easier (these are the easiest types of errors to replicate) with lightweight monitors.
Adds support for the standard mage goIntegTest task, already supported by CI that thus far has been a noop for heartbeat.
Adds a notion of flapping states, in addition to up / down states.
Automatically connects to ES to retrieve the last state value for the given monitor when a monitor first starts, this is necessary to continue the previous state across restarts of heartbeat
Replaces the add_observer_metadata processor with a new heartbeat.location global setting and location per monitor setting. This lets us set a location ID (which is then set to observer.name. See details below:

Note: flapping is currently disabled

Per the discussion in the review, it's a complex feature, let's add it in a follow-up

What are states, and how are they implemented here?

The main goal of this PR is to resolve #32163 , which this goes, but it also recognizes that the goal of grouping errors is a subset of the more general problem of grouping both up and down states. It's useful to group both since it's useful to see something like:

State	Duration	Reason
UP	18 hours
DOWN	30 minutes	status 400
Up	1 month

Hence, the introduction of the various state.* fields, which group contiguous blocks of 'up' and 'down' states together.

A sample of the state.* fields can be seen below:

{
  "state": { // new state field in addition to existing monitor fields
            // globally unique ID for this state, this ID is sortable as a timestamp 
            // to speed up aggregations. The format is id-timestampMsHex-serialHex
            // which is more compact than a UUID, and also chronologically sortable
            "id": "dummy-182a27ea210-2dc",
            // when this state first started, with this we can see when the first event in the
            // state occurred without having to retrieve that event
            "started_at": "2022-08-15T12:13:04.2721958-05:00",
            // number of milliseconds this state has been active for
            "duration_ms": 4655149,
            // number of checks that have occurred within this state
            // broken out by up/down. Flapping states will have non-zero values for both up/down
            "checks": 2290,
            "up": 0,
            "down": 2290
             // status of the state, which can be 'up', 'down', or 'flapping'
            // usually identical to monitor.status except in the case of a flapping monitor
            "status": "down",
            // the last FLAPPING_THRESHOLD-1 checks, used to reconstruct flapping state
            // when resuming state from ES
            "flap_history": [], 
            // The prior state, the nice thing about `state.ends` is that these states do not change
            // since they are complete, so they are easy to query / aggregate since the values are stable
            // in actual use these are only attached to events with `state.checks: 1` so they appear
            // exactly once
            "ends": {
              "started_at": "2022-08-15T12:12:57.1792082-05:00",
              "duration_ms": 5069,
              "status": "flap",
              "up": 3,
              "down": 1,
              "flap_history": null, // omitted on ends states since it's just dead-weight
              "id": "dummy-182a27e865b-2db",
              "checks": 4,
              "ends": null // we don't recurse end states
            },
          },
}

Notes on location

The new location field can be set as follows:

#globally
heartbeat.location:
  id: "us-east-1a"
  geo:
    name: "US-East Coast"
    location: "44.123, 45.12345"

heartbeat.monitors:
- type: http
  id: my-monitor
  urls: "http://elastic.co"
  run_from:
    id: "us-east-1a"
      geo:
        name: "US-East Coast"
        location: "44.123, 45.12345"

Notes on flapping states

The new flapping state serves an important purpose, to reduce the cardinality of states for unstable sites. This is important for UX and UI reasons, since having large numbers of states to visualize in a list is a key thing we'd like to improve.

The flapping threshold in this PR is hard coded to the number 7. This number is equivalent to the number of consecutive identical 'up' or 'down' states the algorithm uses to determine whether a monitor is stable or not. As an example, if a monitor experiences 7 consecutive up checks, followed by seven consecutive down checks it will be reflected as a single up state of 7 checks followed by a single down state of 7 checks. If, by contrast, there are 7 consecutive up checks followed by 6 consecutive down checks, then a single up check there will be two consecutive states, of up followed by flapping; if 6 consecutive down checks were to follow the last of these new events would constitute a new down state following the flapping state, since the monitor would now be stable.

Please see the unit tests for monitor states for additional more nuanced detail. I don't think it makes sense to expose this to users yet, though we could in the future. It's a bit complex to explain, and I think this is a good starting point. We'll likely want to tweak this algorithm in the future, but we that could be done in a follow-up. One concern I have is that it could take a while for monitors to recover if they run infrequently.

It should be noted that flapping checks start as a simple up or down check, but change into a flapping check if they see a different result before the flapping threshold is hit. So the most recent state.status is the only accurate value that should be used. It is also for this reason that two consecutive up or down states cannot happen, but multiple consecutive flapping states could happen if after what looks like a recover instability occurs again. We may want to tweak this to allow for shorter stable states. Again, I think this could happen in a flapping follow-up.

Why is it important?

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Create some monitors and verify that the state field is populated as expected in Elasticsearch. Since heartbeat just re-uses the ES connection, the easiest way is just to use the elastic user for your ES output, or just set cloud.id and cloud.auth.

You could try the following kibana dashboard (exported and compressed as a zip file) to make an easier time of it.

states-dashboard-export.zip

Add support for `next_run` and `next_run_in` fields to the monitors object in events. This allows for the computation of SLA statistics in elasticsearch.

andrewvc · 2022-09-07T01:42:26Z

Tests are failing on win, blocked on #32994 for now

vigneshshanmugam

Looks good overall, I havent digged much in to the flapping states and thrershold when we switch them as we are disabling that particular feature for this PR.

heartbeat/monitors/wrappers/monitorstate/tracker.go

vigneshshanmugam · 2022-09-07T04:22:24Z

heartbeat/monitors/wrappers/monitorstate/tracker.go

+		return state
+	}
+
+	tries := 3


I am not sure if retrying 3 times would be good to do for run_once mode. Maybe fine for long running HB deployments.

Should we set this to 1 by default and do multiple retries only when Run once mode is off?

Also should we set at timeout for this query?

Thoughts?

I think this would be fine for run_once mode because it should be constrained to 1^2+2^2+500*3=6500ms total delay + the time to actually exec the requests

To add a little more detail here, the internet can be a little flakey sometimes, so I think it's a good thing to do a sort of minimal retry along the lines we do here for run_once as well. Also, the regular case is similar to run_once in that we don't want to impede the startup of heartbeat if ES is down, but it's OK to delay it... a little bit. So, it would make sense to use the same logic for both.

vigneshshanmugam · 2022-09-07T17:10:54Z

heartbeat/monitors/wrappers/monitorstate/monitorstate_test.go

+	time.Sleep(time.Millisecond * 10)
+	ms.recordCheck(TestSf, StatusUp)
+	// Pretty forgiving upper bound to account for flaky CI
+	require.True(t, ms.DurationMs > 9 && ms.DurationMs < 300, "Expected duration to be ~10ms, got %d", ms.DurationMs)


nit: we could remove the 300ms upper bound here.I see no harm in that.

My worry is that there could be a bug in how we calculate the duration in the future. Let me up it to say 900 though, that's very forgiving.

heartbeat/monitors/wrappers/monitorstate/tracker_test.go

vigneshshanmugam · 2022-09-07T17:18:02Z

x-pack/heartbeat/scenarios/README.md

+
+The key types in here are:
+
+- Scenario: A description of a given heartbeat configuration with some additional parameters


Love the flexibility of these new tests framework.

x-pack/heartbeat/scenarios/framework/framework.go

Co-authored-by: Vignesh Shanmugam <vignesh.shanmugam22@gmail.com>

andrewvc · 2022-09-09T03:24:16Z

@vigneshshanmugam FYI 6ecac3e fixes an interesting little bug, where heartbeat would try to connect to ES with the default options when non-ES outputs were defined. This was caught by the python tests

This reverts commit 8552e34.

…r metadata

mergify · 2022-09-13T11:10:05Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b tv2 upstream/tv2
git merge upstream/main
git push upstream tv2

vigneshshanmugam

LGTM with pending comments, This is exciting 🎉

Tested with lightweight and browser jobs. State seems to be populated correctly, Location data is reflected properly on all documents.

vigneshshanmugam · 2022-09-13T18:24:28Z

heartbeat/monitors/wrappers/monitorstate/monitorstate.go

@@ -44,7 +44,7 @@ func newMonitorState(sf stdfields.StdMonitorFields, status StateStatus, ctr int,
 		// ID is unique and sortable by time for easier aggregations
 		// Note that we add an incrementing counter to help with the fact that
 		// millisecond res isn't quite enough for uniqueness (esp. in tests)
-		ID:              fmt.Sprintf("%s-%s-%x-%x", sf.ID, sf.Location, now.UnixMilli(), ctr),
+		ID:              fmt.Sprintf("%s-%s-%x-%x", sf.ID, sf.RunFrom, now.UnixMilli(), ctr),


Need to account for missing RunFrom config. This is what I see in states

"state": { "started_at": "2022-09-13T11:22:16.489384-07:00", "up": 0, "ends": null, "id": "my-monitor-%!s(*config.LocationWithID=<nil>)-183381669a9-0", "duration_ms": 20003, "status": "down", "checks": 3, "down": 3, "flap_history": [] },

We can default to unknown location ?

Also it should be RunFrom.id, since its using interface now.

vigneshshanmugam · 2022-09-13T18:30:24Z

x-pack/heartbeat/scenarios/framework/fakeloader.go

@@ -27,8 +27,8 @@ func newLoaderDB() *loaderDB {
 }

 func loaderDbKey(sf stdfields.StdMonitorFields) string {
-	if sf.Location != nil {
-		return fmt.Sprintf("%s-%s", sf.ID, sf.Location.ID)
+	if sf.RunFrom != nil {


Move this logic to monitorstate.go?

vigneshshanmugam · 2022-09-13T18:52:24Z

heartbeat/config/config.go

@@ -33,6 +39,7 @@ type Config struct {
 	Scheduler      Scheduler            `config:"scheduler"`
 	Autodiscover   *autodiscover.Config `config:"autodiscover"`
 	Jobs           map[string]*JobLimit `config:"jobs"`
+	Location       *LocationWithID      `config:"location"`


Should we also change this to RunFrom

Yeah, I was thinking this was a good top level name, but now I think you're right, consistency is more important.

This is the mapping counterpart to elastic/beats#30632 It adds supports for the new state.* fields

Fixes #32163 , corresponding mapping changes for synthetics package in elastic/integrations#4023 Adds a notion of state across checks, with flapping as a bonus. At a high level this PR does the following: Adds new root level state fields Enhances the ecserr package and types to make them more testable and usable Refactors timeout, http status, and could not connect errors to use the new ecserr package to make testing this PR/feature easier (these are the easiest types of errors to replicate) with lightweight monitors. Adds support for the standard mage goIntegTest task, already supported by CI that thus far has been a noop for heartbeat. Adds a notion of flapping states, in addition to up / down states. Automatically connects to ES to retrieve the last state value for the given monitor when a monitor first starts, this is necessary to continue the previous state across restarts of heartbeat Replaces the add_observer_metadata processor with a new heartbeat.location global setting and location per monitor setting. This lets us set a location ID (which is then set to observer.name. See details below: Note: flapping is currently disabled Per the discussion in the review, it's a complex feature, let's add it in a follow-up What are states, and how are they implemented here? The main goal of this PR is to resolve #32163 , which this goes, but it also recognizes that the goal of grouping errors is a subset of the more general problem of grouping both up and down states. It's useful to group both since it's useful to see something like: State Duration Reason UP 18 hours DOWN 30 minutes status 400 Up 1 month Hence, the introduction of the various state.* fields, which group contiguous blocks of 'up' and 'down' states together. A sample of the state.* fields can be seen below: { "state": { // new state field in addition to existing monitor fields // globally unique ID for this state, this ID is sortable as a timestamp // to speed up aggregations. The format is id-timestampMsHex-serialHex // which is more compact than a UUID, and also chronologically sortable "id": "dummy-182a27ea210-2dc", // when this state first started, with this we can see when the first event in the // state occurred without having to retrieve that event "started_at": "2022-08-15T12:13:04.2721958-05:00", // number of milliseconds this state has been active for "duration_ms": 4655149, // number of checks that have occurred within this state // broken out by up/down. Flapping states will have non-zero values for both up/down "checks": 2290, "up": 0, "down": 2290 // status of the state, which can be 'up', 'down', or 'flapping' // usually identical to monitor.status except in the case of a flapping monitor "status": "down", // the last FLAPPING_THRESHOLD-1 checks, used to reconstruct flapping state // when resuming state from ES "flap_history": [], // The prior state, the nice thing about `state.ends` is that these states do not change // since they are complete, so they are easy to query / aggregate since the values are stable // in actual use these are only attached to events with `state.checks: 1` so they appear // exactly once "ends": { "started_at": "2022-08-15T12:12:57.1792082-05:00", "duration_ms": 5069, "status": "flap", "up": 3, "down": 1, "flap_history": null, // omitted on ends states since it's just dead-weight "id": "dummy-182a27e865b-2db", "checks": 4, "ends": null // we don't recurse end states }, }, } Notes on location The new location field can be set as follows: #globally heartbeat.location: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" heartbeat.monitors: - type: http id: my-monitor urls: "http://elastic.co" run_from: id: "us-east-1a" geo: name: "US-East Coast" location: "44.123, 45.12345" Notes on flapping states The new flapping state serves an important purpose, to reduce the cardinality of states for unstable sites. This is important for UX and UI reasons, since having large numbers of states to visualize in a list is a key thing we'd like to improve. The flapping threshold in this PR is hard coded to the number 7. This number is equivalent to the number of consecutive identical 'up' or 'down' states the algorithm uses to determine whether a monitor is stable or not. As an example, if a monitor experiences 7 consecutive up checks, followed by seven consecutive down checks it will be reflected as a single up state of 7 checks followed by a single down state of 7 checks. If, by contrast, there are 7 consecutive up checks followed by 6 consecutive down checks, then a single up check there will be two consecutive states, of up followed by flapping; if 6 consecutive down checks were to follow the last of these new events would constitute a new down state following the flapping state, since the monitor would now be stable. Please see the unit tests for monitor states for additional more nuanced detail. I don't think it makes sense to expose this to users yet, though we could in the future. It's a bit complex to explain, and I think this is a good starting point. We'll likely want to tweak this algorithm in the future, but we that could be done in a follow-up. One concern I have is that it could take a while for monitors to recover if they run infrequently. It should be noted that flapping checks start as a simple up or down check, but change into a flapping check if they see a different result before the flapping threshold is hit. So the most recent state.status is the only accurate value that should be used. It is also for this reason that two consecutive up or down states cannot happen, but multiple consecutive flapping states could happen if after what looks like a recover instability occurs again. We may want to tweak this to allow for shorter stable states. Again, I think this could happen in a flapping follow-up.

andrewvc added 30 commits July 4, 2019 11:27

Moar intervals

dee25ff

Checkpoint

a3a7cbd

Checkpoint

72fab36

Checkpoint

22e7c12

Merge remote-tracking branch 'origin/master' into intervals

aab5d8a

[Heartbeat] Report next_run info per event

b46f81d

Add support for `next_run` and `next_run_in` fields to the monitors object in events. This allows for the computation of SLA statistics in elasticsearch.

Add changelog

e90a046

Incorporate PR feedback

cdeac64

Checkpoint

077be58

Checkpoint

b06323f

Just report the timespan

a371f6e

Merge remote-tracking branch 'origin/master' into next-run-range

9c8dbe5

fix tests

3e4e8ea

fix relnote

c9b39cb

Fmt

a00aa0f

Tweaks

e1d0c65

Factor timeout into timespans

da1fde1

fmt

4e08b3d

Merge remote-tracking branch 'origin/master' into next-run-range

51bffaa

Merge remote-tracking branch 'origin/master' into next-run-range

f74c85d

Merge remote-tracking branch 'origin/master' into next-run-range

1a7c1f4

Don't require docs on date_range sub-keys

0150408

Remove print

95acb3c

Merge remote-tracking branch 'origin/master' into next-run-range

d565042

FMT

7b2162a

Merge remote-tracking branch 'origin/master' into intervals

0b28b6f

Checkpoint

8feb085

Merge remote-tracking branch 'origin/master' into next-run-range

9232921

Merge branch 'next-run-range' into intervals

7bf22b5

Merge remote-tracking branch 'origin/master' into intervals

89563ea

vigneshshanmugam reviewed Sep 7, 2022

View reviewed changes

andrewvc and others added 6 commits September 7, 2022 13:17

Update heartbeat/monitors/wrappers/monitorstate/tracker.go

70bd444

Co-authored-by: Vignesh Shanmugam <vignesh.shanmugam22@gmail.com>

Incorporate PR feedback

c11982f

Merge remote-tracking branch 'andrewvc/tv2' into tv2

903c841

Remove unnecessary state loader assignment

ee08dc7

Remove browser from win tests

eed6a89

Fix state loader to only use ES state loader with ES output

6ecac3e

Don't run integ tests on windows

bc39d5a

andrewvc mentioned this pull request Sep 12, 2022

feat: support pushing lightweight monitors elastic/synthetics#593

Merged

andrewvc added 2 commits September 12, 2022 10:57

Revert "ci: enable windows for testing heartbeat (elastic#32937)"

045969b

This reverts commit 8552e34.

Rename monitor.location to monitor.run_from and add tests for observe…

779da1b

…r metadata

dominiqueclarke mentioned this pull request Sep 13, 2022

[Synthetics] Add run_from key to Heartbeat data streams elastic/kibana#140564

Closed

Merge branch 'main' into tv2

04d29e5

This was referenced Sep 13, 2022

[Meta] Synthetics UI - Page Level Errors elastic/kibana#134949

Closed

[Synthetics UI] [Overview] Error distribution elastic/kibana#135156

Closed

[Synthetics UI] [Overview] Error overlay elastic/kibana#135162

Closed

vigneshshanmugam approved these changes Sep 13, 2022

View reviewed changes

andrewvc added 3 commits September 13, 2022 15:22

Incorporate PR feedback

a404afa

Merge remote-tracking branch 'origin/main' into tv2

816734d

Merge remote-tracking branch 'andrewvc/tv2' into tv2

34783dc

andrewvc merged commit 0e3ab4a into elastic:main Sep 13, 2022

vigneshshanmugam added the v8.5.0 label Sep 23, 2022

andrewvc added a commit to elastic/integrations that referenced this pull request Oct 4, 2022

Support new heartbeat 'state' fields (#4023)

a691c64

This is the mapping counterpart to elastic/beats#30632 It adds supports for the new state.* fields

This was referenced Oct 27, 2022

Add docs on run_from fields #33466

Merged

[hearbeat] Add observer.name to observer metadata processor #32555

Closed

andrewvc deleted the tv2 branch June 22, 2023 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[heartbeat] States and Improved Errors #30632

[heartbeat] States and Improved Errors #30632

andrewvc commented Mar 2, 2022 •

edited

andrewvc commented Sep 7, 2022

vigneshshanmugam left a comment •

edited

vigneshshanmugam Sep 7, 2022

andrewvc Sep 7, 2022

andrewvc Sep 7, 2022

vigneshshanmugam Sep 7, 2022

andrewvc Sep 7, 2022

vigneshshanmugam Sep 7, 2022

andrewvc commented Sep 9, 2022

mergify bot commented Sep 13, 2022

vigneshshanmugam left a comment •

edited

vigneshshanmugam Sep 13, 2022

vigneshshanmugam Sep 13, 2022

vigneshshanmugam Sep 13, 2022

vigneshshanmugam Sep 13, 2022

andrewvc Sep 13, 2022


		The key types in here are:

		- Scenario: A description of a given heartbeat configuration with some additional parameters

[heartbeat] States and Improved Errors #30632

[heartbeat] States and Improved Errors #30632

Conversation

andrewvc commented Mar 2, 2022 • edited

What are states, and how are they implemented here?

Notes on location

Notes on flapping states

Why is it important?

Checklist

How to test this PR locally

andrewvc commented Sep 7, 2022

vigneshshanmugam left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewvc commented Sep 9, 2022

mergify bot commented Sep 13, 2022

vigneshshanmugam left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewvc commented Mar 2, 2022 •

edited

vigneshshanmugam left a comment •

edited

vigneshshanmugam left a comment •

edited