Adds workflow_job_seconds metric #5

janakerman · 2022-04-19T08:05:50Z

Fixes #4

This PR brings a number of improvements, mainly to support testing:

Adds workflow_job_duration_seconds
- This metric is made possible by the workflow_job events
- This metric tracks the dimension state which has two possible values:
  - queued - the amount of time the job spent queued waiting for a runner
  - in_progress - the amount of time the job spent in progress whilst completing all steps
- WorkflowJobObserver interface created to allow for testing of observations without interacting with Prometheus lib.
Adds integration tests
- Extracts application logic out of main and into Server struct to allow for start + stop of server with different configurations.

The PR brings more changes than actually needed to add the new metric (i.e extracting behaviour to a Server struct in an additional package. I can try to unpick these changes into a smaller but I think the other changes are worth considering.

The individual commits should largely be complete and documented.

Still TODO as part of this PR:

Run locally and test with real Github events
- Tested with two workflows and manually verified /metrics content
~~Remove check_run event handling~~ - this doesn't make sense - we'd need to add the workflow event handler for that one and I'd rather do that as a separate PR.
- ~~Record workflow_execution_time_seconds using the workflow_job completed event~~

Still TODO hopefully out of scope for this PR:

Backfill tests for other observations
Improve integration tests to check for additional metrics
Fix observation inconsistencies - some observations are against the global collector, and the new observations are against an interface.
Track additional metrics made possible by this new event type.

* Intention to support instantiation of server for integration testing. Side effect is improving readability. * Extracts root handler into func. * Error logging handled by main.

* Intention is to test against package interface using server_test package. Couldn't get to work with main_test. * Side effect is improved readability due to less global state. * ghActionsExporter private to package as created in Server setup

* Intention was to improve coverage of http server implementation. * Probably makes more sense to test the handlers directly for most cases.

* Backfills tests before behaviour changes. * CheckRun test only checks for 200 due to async behaviour.

Assumes job queue time can be calculated from difference between job start time and start time of first step. Labels: * runner_group to identify queues for specific runners * state queued as durations can also be published for completed jobs * no workflow label as it is not possible to identify workflow without additional API calls

* Simplifies testing by moving observation of workflow job metrics under interface. * Leaves other metrics untouched for now, even though inconsistent.

* Use custom serve mux to allow for creation of multiple servers in integration tests.

cpanato · 2022-04-19T15:54:10Z

looks great, i will run locally, let me know when you are ready and then set the PR out from draft!
thanks so much for this

janakerman · 2022-04-19T18:01:56Z

Awesome, thanks!

The build failed - the lint stage passes locally for me. The error suggests something about updating the tool so I've bumped the golangci-lint version to latest:

level=error msg="Running error: buildir: failed to load package goarch: could not load export data: cannot import \"internal/goarch\" (unknown iexport format version 2), export data is newer version - update tool"

This is possible as it seems step started_at times are rounded to the closest second whereas job started_at time has miliseconds included.

janakerman · 2022-04-19T18:40:06Z

I think this is ready for review. I've tested the exporter locally and ran a couple of workflows. I spotted a few issues which I've fixed up and added tests for.

cpanato

I will merge and if we found any issues we can fix that in a follow up
I need to refactor the release pipeline and decouple some things which I will do soon

thanks so much for this change it is super cool!

janakerman added 16 commits April 18, 2022 13:42

server to struct

50c0223

* Intention to support instantiation of server for integration testing. Side effect is improving readability. * Extracts root handler into func. * Error logging handled by main.

adds /metrics integration test

79cab72

* Intention was to improve coverage of http server implementation. * Probably makes more sense to test the handlers directly for most cases.

adds tests for webhook event handler

8c2fbba

* Backfills tests before behaviour changes. * CheckRun test only checks for 200 due to async behaviour.

upgrade google/go-github/ v33 to v43

0d804da

adds test for workflow job queued

d5684f7

introduces workflowJobObserver to simplify testing

98dbe95

* Simplifies testing by moving observation of workflow job metrics under interface. * Leaves other metrics untouched for now, even though inconsistent.

adds test observer to remaining exporter tests

93f82a0

adds integration test for workflow job metrics

9769d83

* Use custom serve mux to allow for creation of multiple servers in integration tests.

updates timeout on test to sensible value

170145f

pulls assertion into observation helper struct

8e9d254

observes workflow job duration for in_progress

dd759ad

workflow job queued metric corrected

1f5db8e

go mod vendor

eea7d22

go mod tidy

37c0c95

Updates golangci-lint version

abb1407

janakerman added 3 commits April 19, 2022 19:03

corrects hard coded state

f51a8ff

adds test for negative queue durations

aee6ace

This is possible as it seems step started_at times are rounded to the closest second whereas job started_at time has miliseconds included.

handles skipped jobs without panic

504a3f2

janakerman marked this pull request as ready for review April 19, 2022 18:37

renames workflow_job__seconds to workflow_job_duration_seconds

5528cde

janakerman force-pushed the jan-workflow-job-metrics branch from a8eab39 to 5528cde Compare April 19, 2022 20:22

cpanato approved these changes Apr 20, 2022

View reviewed changes

cpanato merged commit e005195 into cpanato:main Apr 20, 2022

janakerman deleted the jan-workflow-job-metrics branch April 20, 2022 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds workflow_job_seconds metric #5

Adds workflow_job_seconds metric #5

janakerman commented Apr 19, 2022 •

edited

Loading

cpanato commented Apr 19, 2022

janakerman commented Apr 19, 2022

janakerman commented Apr 19, 2022 •

edited

Loading

cpanato left a comment

Adds workflow_job_seconds metric #5

Adds workflow_job_seconds metric #5

Conversation

janakerman commented Apr 19, 2022 • edited Loading

cpanato commented Apr 19, 2022

janakerman commented Apr 19, 2022

janakerman commented Apr 19, 2022 • edited Loading

cpanato left a comment

Choose a reason for hiding this comment

janakerman commented Apr 19, 2022 •

edited

Loading

janakerman commented Apr 19, 2022 •

edited

Loading