Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Prometheus' targets are omitted if any single ServiceMonitor is configured incorrectly #380

Closed
jollinshead opened this issue May 19, 2017 · 9 comments
Labels

Comments

@jollinshead
Copy link

jollinshead commented May 19, 2017

We had:

  • prometheus-operator running in its own namespace
  • multiple ServiceMonitors in different namespaces which allowed Prometheus to scrape metrics from various sources

...however, when a new ServiceMonitor was created with a port number used (rather than the port name) all targets are removed from Prometheus.


- apiVersion: monitoring.coreos.com/v1alpha1
  kind: ServiceMonitor
  metadata:
    labels:
      app: seahorse-generic-app
    name: seahorse-generic-app
    namespace: seahorse
  spec:
    endpoints:
    - port: 80
    selector:
      matchLabels:
        release: seahorse

We understand port numbers are invalid (only names are valid), but the invalid configuration should not have cleared all targets in Prometheus.

FYI the error we saw in the prometheus-operator pod's logs was:

github.com/coreos/prometheus-operator/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1alpha1.ServiceMonitor: json: cannot unmarshal number into Go struct field Endpoint.port of type string
@brancz
Copy link
Contributor

brancz commented May 19, 2017

Yes this is something that shoulnd be part of validating manifests at creation time. Sadly this is currently not possible with third party resources, and I'm hesitant to making gracefully failing on errors in the workqueue, because that will likely cause actual errors to be swallowed and never seen.

Validation at time of creation will be possible one day though using user aggredated apiservers, which is one of the next things on our list to tackle.

@stale
Copy link

stale bot commented Aug 15, 2019

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

@stale stale bot added the stale label Aug 15, 2019
@alice-sawatzky
Copy link
Contributor

alice-sawatzky commented Oct 7, 2019

still an issue with AlertRules. We had an AlertRule with an invalid for: value ("<nil>", from a buggy helm template), and it caused the prometheus pods to CrashLoop.

@stale stale bot removed the stale label Oct 7, 2019
@paulfantom
Copy link
Member

@alice-sawatzky your case looks like something that should be addressed by team responsible for helm charts as we are not maintaining them.

@alice-sawatzky
Copy link
Contributor

this was an internal helm chart that i maintain. I fixed the bug just fine, the issue is that an invalid PrometheusRule object shouldn't take down Prometheus.

@brancz
Copy link
Contributor

brancz commented Oct 18, 2019

I agree that this is something that needs to be validated much more strictly. No single PrometheusRule should be able to take out alerting, and probably shouldn't be accepted in the first place. As in it shouldn't be possible to create in the first place.

@stale
Copy link

stale bot commented Dec 17, 2019

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

@stale stale bot added the stale label Dec 17, 2019
@brancz brancz removed the stale label Dec 17, 2019
@stale
Copy link

stale bot commented Feb 15, 2020

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

@stale stale bot added the stale label Feb 15, 2020
@brancz
Copy link
Contributor

brancz commented Jul 10, 2020

This should now be fixed with the validating webhooks. Should you use the validating webhooks and this still occurs, please open a new issue.

@brancz brancz closed this as completed Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants