Introduce SLI for preview environment start #10732
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE: I have added the hold label because the build fails are we seem to have a problems with VMs right now.
Description
This PR instruments our build job to include annotations on the root span that we can use as part of our "Preview environments should start successfully" SLI.
These annotations are used to created a derived column in Honeycomb that represents our SLI. The structure of the
IF(condition, then-value, else-value)
- see here - additionally for SLIs in Honeycombtrue
means the event should count as a success,false
means it should count as a failure, andnull
means the event isn't part of the SLI.So in our case, we only count the event if
preview.gitpod_built_successfully
istrue
and thepreview.k3s_successfully_created
attribute exists. We consider it a success if the root span doesn't have an error set andpreview.k3s_successfully_created
istrue
. Otherwise it's a failure.Related Issue(s)
Fixes https://github.com/gitpod-io/ops/issues/2728
Fixes https://github.com/gitpod-io/ops/issues/2729
Fixes https://github.com/gitpod-io/ops/issues/2731
How to test
I started a job off the branch so it loaded my new TS code
The VM happened to fail as we're overloaded right now, so that's great for testing 馃槄 See the trace here and screenshot of it below:
Here is a simple query showing the count of events grouped by the the SLI - there is a bunch of events that doesn't count as part of the SLI and one failure (no successes because Harvester is overloaded)
Also created an SLO for the fun of it here.
Release Notes
Documentation
N/A