Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds workflow_job_seconds metric #5

Merged
merged 21 commits into from
Apr 20, 2022

Conversation

janakerman
Copy link
Contributor

@janakerman janakerman commented Apr 19, 2022

Fixes #4

This PR brings a number of improvements, mainly to support testing:

  • Adds workflow_job_duration_seconds
    • This metric is made possible by the workflow_job events
    • This metric tracks the dimension state which has two possible values:
      • queued - the amount of time the job spent queued waiting for a runner
      • in_progress - the amount of time the job spent in progress whilst completing all steps
    • WorkflowJobObserver interface created to allow for testing of observations without interacting with Prometheus lib.
  • Adds integration tests
    • Extracts application logic out of main and into Server struct to allow for start + stop of server with different configurations.

The PR brings more changes than actually needed to add the new metric (i.e extracting behaviour to a Server struct in an additional package. I can try to unpick these changes into a smaller but I think the other changes are worth considering.

The individual commits should largely be complete and documented.


Still TODO as part of this PR:

  • Run locally and test with real Github events
    • Tested with two workflows and manually verified /metrics content
  • Remove check_run event handling - this doesn't make sense - we'd need to add the workflow event handler for that one and I'd rather do that as a separate PR.
    • Record workflow_execution_time_seconds using the workflow_job completed event

Still TODO hopefully out of scope for this PR:

  • Backfill tests for other observations
  • Improve integration tests to check for additional metrics
  • Fix observation inconsistencies - some observations are against the global collector, and the new observations are against an interface.
  • Track additional metrics made possible by this new event type.

* Intention to support instantiation of server for
integration testing. Side effect is improving readability.
* Extracts root handler into func.
* Error logging handled by main.
* Intention is to test against package interface
using server_test package. Couldn't get to work
with main_test.
* Side effect is improved readability due to less
global state.
* ghActionsExporter private to package as created
in Server setup
* Intention was to improve coverage of http server
implementation.
* Probably makes more sense to test the handlers
directly for most cases.
* Backfills tests before behaviour changes.
* CheckRun test only checks for 200 due to
async behaviour.
Assumes job queue time can be calculated from difference
between job start time and start time of first step.

Labels:
* runner_group to identify queues for specific runners
* state queued as durations can also be published for
  completed jobs
* no workflow label as it is not possible to identify
  workflow without additional API calls
* Simplifies testing by moving observation of workflow job
metrics under interface.
* Leaves other metrics untouched for now, even though
inconsistent.
* Use custom serve mux to allow for creation of
multiple servers in integration tests.
@cpanato
Copy link
Owner

cpanato commented Apr 19, 2022

looks great, i will run locally, let me know when you are ready and then set the PR out from draft!
thanks so much for this

@janakerman
Copy link
Contributor Author

Awesome, thanks!

The build failed - the lint stage passes locally for me. The error suggests something about updating the tool so I've bumped the golangci-lint version to latest:

level=error msg="Running error: buildir: failed to load package goarch: could not load export data: cannot import \"internal/goarch\" (unknown iexport format version 2), export data is newer version - update tool"

This is possible as it seems step started_at times are
rounded to the closest second whereas job started_at time
has miliseconds included.
@janakerman janakerman marked this pull request as ready for review April 19, 2022 18:37
@janakerman
Copy link
Contributor Author

janakerman commented Apr 19, 2022

I think this is ready for review. I've tested the exporter locally and ran a couple of workflows. I spotted a few issues which I've fixed up and added tests for.

Copy link
Owner

@cpanato cpanato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge and if we found any issues we can fix that in a follow up
I need to refactor the release pipeline and decouple some things which I will do soon

thanks so much for this change it is super cool!

@cpanato cpanato merged commit e005195 into cpanato:main Apr 20, 2022
@janakerman janakerman deleted the jan-workflow-job-metrics branch April 20, 2022 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for workflow_job events
2 participants