Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add prometheus metrics & endpoint #60

Merged
merged 3 commits into from
Jan 26, 2023

Conversation

maigl
Copy link
Contributor

@maigl maigl commented Jan 19, 2023

This adds a prometheus metrics endpoint and 2 simple metrics to allow garm being monitored.

  • The first metrics shows which runner instances currently exist and their status(es).
  • The second metric counts the webhooks garm receives.

Remarks:

  • the funcs ListAllInstances and ListAllPools now have a version without the admin check that can be used internally.
    I didn't want to fake admin

Michael Kuhnt michael.kuhnt@mercedes-benz.com Mercedes-Benz Tech Innovation GmbH (ProviderInformation)

@gabriel-samfira
Copy link
Member

This looks amazing!

I need to get some proper sleep before looking at this. I will definitely do a full review tomorrow morning!

For now, just one note in regards to the need for ListAllInstances() and ListAllPools() for internal use. We have the auth.GetAdminContext() function that was made specifically for this purpose. It is intended to allow us to use these functions internally without needing to wrap them. It's available here:

https://github.com/cloudbase/garm/blob/main/auth/context.go#L219-L227

It's mainly used for testing now, but the intention for it was to be used for your exact use case.

Copy link
Member

@gabriel-samfira gabriel-samfira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of comments. Will look at everything a bit closer tomorrow. This is a great feature to have!

apiserver/controllers/controllers.go Outdated Show resolved Hide resolved
log.Printf("got not found error from DispatchWorkflowJob. webhook not meant for us?: %q", err)
return
} else if strings.Contains(err.Error(), "signature") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific error type we return when the signature does not match? We could use errors.Is() if there is. It seems like a significant event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw .. this error never come. There is a bug in the webhook validation which has the effect that every webhook for orgs end enterprises are always valid. see #61

apiserver/routers/routers.go Outdated Show resolved Hide resolved
@maigl maigl force-pushed the metrics branch 2 times, most recently from aae1886 to d654738 Compare January 19, 2023 12:59
@maigl maigl force-pushed the metrics branch 10 times, most recently from 3f5030c to 4fe31b4 Compare January 19, 2023 15:59
@maigl maigl force-pushed the metrics branch 13 times, most recently from a07e32f to 4f9075e Compare January 19, 2023 17:32
@maigl maigl force-pushed the metrics branch 2 times, most recently from 24d21d0 to dbaa191 Compare January 20, 2023 07:53
@maigl
Copy link
Contributor Author

maigl commented Jan 20, 2023

Hi,

I changed the code so now you can:

  • set a port for metrics, must be different from main port
  • disable metrics endpoint
  • disable auth for metrics endpoint

the cli now has

garm-cli metrics-token create

with this git you can curl the metrics endpoint, but not the normal server endpoints, vice versa the admin token cannot curl the metrics endpoint

I skipped adding a dedicated metrics user in the db.

the configuration in the garm/config.toml looks like this:

[apiserver]
  bind = "0.0.0.0"
  port = 9997
  use_tls = false
  cors_origins = ["*"]
  [apiserver.metrics]
    no_auth = false
    port = 8082
    disabled = false
....

I did not update the docs and examples. I wanted to check with you if this is going in the right direction?!

@maigl maigl marked this pull request as ready for review January 20, 2023 08:00
@gabriel-samfira
Copy link
Member

I think we can skip having to add another listener for metrics. In some cases, that's just another port to justify to the security team, and while it does seem to be optional, I think it's best we don't add a moving part unless absolutely unavoidable.

In this case, if we really want to, we can do this by putting nginx in front of garm. Garm can listen on 127.0.0.1 and a reverse proxy can be configured in whichever fashion the user wants. They can create a new server{} block on a different port just for location /metrics or they can restrict which remote IP can access that location. If you'd like, I can create a PR against your branch with a sample.

Our job should be to make sure that authentication and authorization is properly fleshed out. Tokens need to have just enough privileges to do the things they need to do. There's no need for a dedicated "metrics" user. In fact, the user ID can be left blank, or a new claims type can be created. At a later time, it may be useful to create a unified claims type that allows us to set access levels. At this point, we can simply add a new field in the claims that states the token is only valid for metrics. We parse the token, and set the context appropriately.

Almost all current handlers and runner functions check if the context is that of admin. There are also a few functions that allow instances to call home and set their own status.

Same can be done for metrics.

I will look over this PR today in more detail and can give more constructive feedback point by point. But this is shaping up nicely and is a great addition to garm. Can't wait to see it in action 😄.

p1 Outdated Show resolved Hide resolved
@gabriel-samfira
Copy link
Member

gabriel-samfira commented Jan 20, 2023

added some nginx samples: mercedes-benz#3

This should mitigate the need for a separate listener, just for metrics.

Copy link
Member

@gabriel-samfira gabriel-samfira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally had a chance to do a proper review 😄

apiserver/controllers/metrics.go Outdated Show resolved Hide resolved
apiserver/controllers/metrics.go Outdated Show resolved Hide resolved
}
}

hostname, controllerID := c.apiController.GetControllerInfo()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be c.store.ControllerInfo() if you give the collector access to the store. If you want to limit the scope of the dependencies we're injecting, you could (although not mandatory) give it access just to that function if you want to be restrictive by defining a struct field as:

type GarmCollector struct {
	instanceMetric *prometheus.Desc
	apiController    *APIController
        ctrlInfoFn          func() (params.ControllerInfo, error)
}

And just pass the store.ControllerInfo function in when initializing the collector in main.go. I would give it access to the store to make things simple. Either way is fine with me.

Copy link
Contributor Author

@maigl maigl Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the dependency in to apicontroller in the collector .. but still I think this func makes sense, as described above

apiserver/controllers/metrics.go Outdated Show resolved Hide resolved
auth/auth.go Outdated Show resolved Hide resolved
cmd/garm/main.go Outdated Show resolved Hide resolved
config/config.go Outdated
@@ -476,14 +476,30 @@ func (t *TLSConfig) Validate() error {
return nil
}

type MetricsConfig struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Validate() function to this config type. All config types have validation.

Copy link
Contributor Author

@maigl maigl Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you want to use the same router, there is no port to validate .. so basically the only real config toggle is a bool (no_auth). Validation makes not much sense here ..
do you still want a validate func ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

viper would really be nice here :)

We'll get there. Eventually 😄. It's not worth the effort for now.

do you still want a validate func ?

It's a good idea to have a Validate() func, even if it's a noop, if we have a separate type for a particular set of settings. If we embed the metrics auth in an existing type, and it's a bool, there is no need to validate it.

The idea is that we may extend a config at some point, and may forget to add a Validate() func later on.

btw, I removed the disable metrics toggle, mainly because in the router I don't have access to the config and I don't know if one would want to disable metrics.

You can pass in a new param to NewAPIRouter() which includes whichever config you need. It didn't make sense to pass in a config to that func until now, but if that changes, I have nothing against adding a param that makes sense.

I think it may be useful to have a toggle to disable metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. I now pass the config into NewAPIRouter

config/config.go Outdated Show resolved Hide resolved
runner/runner.go Outdated
@@ -265,6 +266,14 @@ type Runner struct {
credentials map[string]config.Github
}

func (r *Runner) GetControllerID() (uuid.UUID, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if you give the collector access to the store.

As a side note, all functions should accept context.Context, for at least the following reasons:

  • The context may get canceled and if we pass it along, it allows the various libraries we use to cancel their tasks and return gracefully.
  • It allows us to do some authorization checks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't give the collector access directly to the store, for now this is not necessary, access to the runner seems to be a good level of abstraction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. See above wall of text 😄

apiserver/controllers/controllers.go Outdated Show resolved Hide resolved
@maigl
Copy link
Contributor Author

maigl commented Jan 20, 2023

Thank you, I'm a bit busy next week, but I'll fix everything as soon as I can.

@gabriel-samfira
Copy link
Member

Thank you, I'm a bit busy next week, but I'll fix everything as soon as I can.

No worries at all! Thanks for all the work! This will be a great addition to garm!

@gabriel-samfira
Copy link
Member

I fixed some linting errors and added a new workflow that checks things like gofmt, linting, vendoring consistency, etc. There are now a few makefile targets that should help out.

make verify

will run the lint, gofmt, vendor checks.

make test

will run verify and the go tests.

This is the last time I cause conflicts for your PR 😄. Sorry about that!

config/config.go Outdated Show resolved Hide resolved
@gabriel-samfira
Copy link
Member

The last two comments are the last nitpicking I have 😄

cmd/garm/main.go Outdated Show resolved Hide resolved
@maigl
Copy link
Contributor Author

maigl commented Jan 26, 2023

.. so .. finally, I think you should be good to do another review ..

runner/runner.go Outdated
}

func (r *Runner) GetControllerInfo(ctx context.Context) (params.ControllerInfo, error) {
if r.controllerInfo == (params.ControllerInfo{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the controller ID already set in the Runner.poolManagerCtrl.controllerID. You should be able to access it. You can also add it to the Runner{} itself if you wish. The runner.NewRunner() function, errors out if it can't get the controller ID, so having the controller ID set is guaranteed.

Which means we only have to get the hostname, which shouldn't error out.

So this could be as simple as:

return params.ControllerInfo{
  ControllerID: r.poolManagerCtrl.controllerID,
  Hostname: os.Hostname(),
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else looks fine!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. My mistake. os.Hostname() can return an error. So that would be something like:

hostname, err := os.Hostname()
if err != nil {
    return params.ControllerInfo{}, errors.Wrap(err, "getting hostname")
}
return params.ControllerInfo{
  ControllerID: r.poolManagerCtrl.controllerID,
  Hostname: hostname,
}, nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm .. accessing the id through Runner.poolManagerCtrl.controllerID is not that easy .. as you decoupled the access via an interface, I didn't want to extend this .. and type assertion seems not the right way ..

os.Hostname will probably not really error .. and if so we will ignore it anyway .. we cannot crash, right?
So I went for having no error result, but there's still the context in the runner.GetControllerInfo .. do you think that still makes much sense ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Then it might be worth adding a new un-exported field to Runner{} like controllerID uuid.UUID, and cache the controller ID there for later use.

Note, the NewRunner() function is only called once. If you want to catch a hostname change, you probably want to call that in GetControllerInfo(). The call to os.Hostname() shouldn't error, but if they added that value as a return, it may happen, in which case you risk getting an empty string there. Is that ok in your current code path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I went for having no error result, but there's still the context in the runner.GetControllerInfo .. do you think that still makes much sense ?

The context is useful if we want to limit who gets to fetch this info. There is no real sensitive info returned by this function, but we should probably still check for authorization here. In the future we may extend this function to return more metadata about the controller, especially if we will decouple the code and possibly deploy multiple instances of that particular component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will not nitpick too much on this. I am okay with merging as is, and making changes later as needed. The only thing I think is worth changing is the location of the call to os.Hostname(). It's worth moving it into GetControllerInfo() as that should resolve the hostname on every call to GetControllerInfo(), as opposed to caching it for the duration of the lifetime of the current process.

You mentioned previously that you would rather pick up on hostname changes without restarting garm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned previously that you would rather pick up on hostname changes without restarting garm.

right .. thanks for your reviews and your patience with me 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work you put into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants