Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vm-agent 1.94.0 fails to discover targets #5216

Closed
mindw opened this issue Oct 20, 2023 · 10 comments
Closed

vm-agent 1.94.0 fails to discover targets #5216

mindw opened this issue Oct 20, 2023 · 10 comments
Assignees
Labels
bug Something isn't working vmagent

Comments

@mindw
Copy link

mindw commented Oct 20, 2023

Describe the bug

Valid targets previous discovered by 1.93.4 aren't discovered by 1.94.0.

Also, kubernetes_sd_configs role = endpoint fails to add service labels (very likely to be related).

        kubernetes_sd_configs:
        - role: endpoints

To Reproduce

The target should have a history of not being discovered. For example, it was missing annotations or some labels.

Deploy kube-state-metrics chart into metrics namespace

helm install -n monitoring kube-state-metrics --create-namespace prometheus-community/kube-state-metrics

add a static discovery rule

      - job_name: kube-state-metrics
        honor_labels: true
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - action: keep
          source_labels:
          - __meta_kubernetes_namespace
          - __meta_kubernetes_ednpoint_label_app
          - __meta_kubernetes_endpoint_port_name
          regex: monitoring;kube-state-metrics;http
        - target_label: endpoint
          replacement: http-metrics
        metric_relabel_configs:
        - action: labeldrop
          regex: (instance|endpoint)
        - action: drop
          source_labels: [__name__, secret]
          regex: 'kube_secret_.+;.+\.v\d+'

The kube-state-metrics target should list 0 discovered targets.
add a custom label:

helm diff upgrade -n monitoring kube-state-metrics prometheus-community/kube-state-metrics --set customLabels.foo=kube-state-metrics

Discovered targets should still be 0
Downgrade to 1.93.6.
Discovered targets should still be 1

Version

docker.io/victoriametrics/victoria-metrics:v1.94.0

Logs

didn't see anything relevant.

Screenshots

1.94.0
image
1.93.6
image

Used command-line flags

-envflag.enable="true"
-envflag.prefix="VM_"
-http.connTimeout="12h0m0s"
-http.idleConnTimeout="5m0s"
-loggerFormat="json"
-promscrape.config="/scrapeconfig/scrape.yml"
-promscrape.configCheckInterval="1m0s"
-retentionPeriod="8d"
-search.disableAutoCacheReset="true"
-storageDataPath="/storage"
-vmalert.proxyURL="http://victoria-metrics-alerts-server:8880

Additional information

Possible cause of trouble - 00685b6
and
#4855

@mindw mindw added the bug Something isn't working label Oct 20, 2023
@mindw mindw changed the title vm-agent 1.94.0 fails to discover tagets vm-agent 1.94.0 fails to discover targets Oct 20, 2023
@mindw mindw changed the title vm-agent 1.94.0 fails to discover targets vm-agent 1.94.0/1.95.5+ fails to discover targets Oct 20, 2023
@mindw mindw changed the title vm-agent 1.94.0/1.95.5+ fails to discover targets vm-agent 1.94.0 fails to discover targets Oct 20, 2023
@f41gh7 f41gh7 added the vmagent label Oct 20, 2023
@valyala
Copy link
Collaborator

valyala commented Oct 20, 2023

@mindw , thanks for filing detailed bugreport! The issue can be related also to #4850

@Amper Amper self-assigned this Oct 24, 2023
@Amper
Copy link
Contributor

Amper commented Oct 24, 2023

Hey @mindw.
I have managed to reproduce this issue with your instructions, but:

  1. Name __meta_kubernetes_ednpoint_label_app have a typo in word endpoint
  2. There are no label app in endpoints for kube-state-metrics if I install it with your instructions.

If i replace __meta_kubernetes_ednpoint_label_app with __meta_kubernetes_service_name, then everything is works fine for me.

@mindw
Copy link
Author

mindw commented Oct 25, 2023

@Amper yes, there is an unfortunate typo there.
I assume that it's tested on 1.94.0.
Is still doesn't explain the discrepancy in the discovered targets, nor the lack of __meta_kubernetes_service* labels during target discovery.

@Amper
Copy link
Contributor

Amper commented Oct 26, 2023

@mindw, sorry, I can't fully understand the problem. Could you elaborate a bit more?

Is still doesn't explain the discrepancy in the discovered targets

  • Do you have label app on endpoints for kube-state-metrics and if you use __meta_kubernetes_ednpoint_label_app (without typo), it isn't discovered, right?
  • Is the problem reproduced when __meta_kubernetes_ednpoint_label_app is replaced by __meta_kubernetes_service_name ?

nor the lack of __meta_kubernetes_service* labels during target discovery.

Could you provide more info about that?

@valyala
Copy link
Collaborator

valyala commented Oct 27, 2023

@mindw , could you also share logs generated by vmagent when performing the given steps to reproduce the issue for v1.93.6 and v1.94.0?

It would be great also to compare the output of http://vmagent:8429/service-discovery page for vmagent v1.93.6 and v1.94.0 just after performing the step given above and share the difference here. The /service-discovery page contains the list of all the discovered targets with all the discovered labels before the relabeling is applied to them.

@valyala
Copy link
Collaborator

valyala commented Oct 27, 2023

Update: it looks like v1.93.6 has identical code to v1.94.0 for lib/promscrape/discovery/kubernetes/ package, which is responsible for Kubernetes service discovery in vmagent. This means that both releases of vmagent - v1.93.6 and v1.94.0 - should exhibit identical behavior when discovering Kubernetes scrape targets.

Probably, you mean v1.93.4 instead of v1.93.6 release? The v1.93.6 release contains changes related to #4850 , which can lead to the issue described in this bugreport:

diff --git a/lib/promscrape/discovery/kubernetes/api_watcher.go b/lib/promscrape/discovery/kubernetes/api_watcher.go
index 78bb7ddd7..b677bd241 100644
--- a/lib/promscrape/discovery/kubernetes/api_watcher.go
+++ b/lib/promscrape/discovery/kubernetes/api_watcher.go
@@ -1,6 +1,7 @@
 package kubernetes

 import (
+       "context"
        "encoding/json"
        "errors"
        "flag"
@@ -16,11 +17,13 @@ import (
        "sync/atomic"
        "time"

+       "github.com/VictoriaMetrics/metrics"
+
        "github.com/VictoriaMetrics/VictoriaMetrics/lib/cgroup"
        "github.com/VictoriaMetrics/VictoriaMetrics/lib/logger"
        "github.com/VictoriaMetrics/VictoriaMetrics/lib/promauth"
        "github.com/VictoriaMetrics/VictoriaMetrics/lib/promutils"
-       "github.com/VictoriaMetrics/metrics"
+       "github.com/VictoriaMetrics/VictoriaMetrics/lib/timerpool"
 )

 var apiServerTimeout = flag.Duration("promscrape.kubernetes.apiServerTimeout", 30*time.Minute, "How frequently to reload the full state from Kubernetes API server")
@@ -269,7 +272,11 @@ func selectorsKey(selectors []Selector) string {

 var (
        groupWatchersLock sync.Mutex
-       groupWatchers     = make(map[string]*groupWatcher)
+       groupWatchers     = func() map[string]*groupWatcher {
+               gws := make(map[string]*groupWatcher)
+               go groupWatchersCleaner(gws)
+               return gws
+       }()

        _ = metrics.NewGauge(`vm_promscrape_discovery_kubernetes_group_watchers`, func() float64 {
                groupWatchersLock.Lock()
@@ -279,6 +286,21 @@ var (
        })
 )

+func groupWatchersCleaner(gws map[string]*groupWatcher) {
+       for {
+               time.Sleep(7 * time.Second)
+               groupWatchersLock.Lock()
+               for key, gw := range gws {
+                       gw.mu.Lock()
+                       if len(gw.m) == 0 {
+                               delete(gws, key)
+                       }
+                       gw.mu.Unlock()
+               }
+               groupWatchersLock.Unlock()
+       }
+}
+
 type swosByKeyWithLock struct {
        mu        sync.Mutex
        swosByKey map[string][]interface{}
@@ -378,31 +400,14 @@ func (gw *groupWatcher) startWatchersForRole(role string, aw *apiWatcher) {
                                // This should guarantee that the ScrapeWork objects for these objects are properly updated
                                // as soon as the objects they depend on are updated.
                                // This should fix https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240 .
-                               go func() {
-                                       const minSleepTime = 5 * time.Second
-                                       sleepTime := minSleepTime
-                                       for {
-                                               time.Sleep(sleepTime)
-                                               startTime := time.Now()
-                                               gw.mu.Lock()
-                                               if uw.needRecreateScrapeWorks {
-                                                       uw.needRecreateScrapeWorks = false
-                                                       uw.recreateScrapeWorksLocked(uw.objectsByKey, uw.aws)
-                                                       sleepTime = time.Since(startTime)
-                                                       if sleepTime < minSleepTime {
-                                                               sleepTime = minSleepTime
-                                                       }
-                                               }
-                                               gw.mu.Unlock()
-                                       }
-                               }()
+                               go uw.recreateScrapeWorks()
                        }
                }
        }
 }

 // doRequest performs http request to the given requestURL.
-func (gw *groupWatcher) doRequest(requestURL string) (*http.Response, error) {
+func (gw *groupWatcher) doRequest(ctx context.Context, requestURL string) (*http.Response, error) {
        if strings.Contains(requestURL, "/apis/networking.k8s.io/v1/") && atomic.LoadUint32(&gw.useNetworkingV1Beta1) == 1 {
                // Update networking URL for old Kubernetes API, which supports only v1beta1 path.
                requestURL = strings.Replace(requestURL, "/apis/networking.k8s.io/v1/", "/apis/networking.k8s.io/v1beta1/", 1)
@@ -411,7 +416,7 @@ func (gw *groupWatcher) doRequest(requestURL string) (*http.Response, error) {
                // Update discovery URL for old Kubernetes API, which supports only v1beta1 path.
                requestURL = strings.Replace(requestURL, "/apis/discovery.k8s.io/v1/", "/apis/discovery.k8s.io/v1beta1/", 1)
        }
-       req, err := http.NewRequest(http.MethodGet, requestURL, nil)
+       req, err := http.NewRequestWithContext(ctx, http.MethodGet, requestURL, nil)
        if err != nil {
                logger.Fatalf("cannot create a request for %q: %s", requestURL, err)
        }
@@ -423,11 +428,11 @@ func (gw *groupWatcher) doRequest(requestURL string) (*http.Response, error) {
        if resp.StatusCode == http.StatusNotFound {
                if strings.Contains(requestURL, "/apis/networking.k8s.io/v1/") && atomic.LoadUint32(&gw.useNetworkingV1Beta1) == 0 {
                        atomic.StoreUint32(&gw.useNetworkingV1Beta1, 1)
-                       return gw.doRequest(requestURL)
+                       return gw.doRequest(ctx, requestURL)
                }
                if strings.Contains(requestURL, "/apis/discovery.k8s.io/v1/") && atomic.LoadUint32(&gw.useDiscoveryV1Beta1) == 0 {
                        atomic.StoreUint32(&gw.useDiscoveryV1Beta1, 1)
-                       return gw.doRequest(requestURL)
+                       return gw.doRequest(ctx, requestURL)
                }
        }
        return resp, nil
@@ -446,6 +451,9 @@ func (gw *groupWatcher) unsubscribeAPIWatcher(aw *apiWatcher) {
        defer gw.mu.Unlock()
        for _, uw := range gw.m {
                uw.unsubscribeAPIWatcherLocked(aw)
+               if len(uw.aws)+len(uw.awsPending) == 0 {
+                       time.AfterFunc(10*time.Second, uw.stopIfNoUsers)
+               }
        }
 }

@@ -458,6 +466,9 @@ type urlWatcher struct {
        apiURL    string
        gw        *groupWatcher

+       ctx    context.Context
+       cancel context.CancelFunc
+
        parseObject     parseObjectFunc
        parseObjectList parseObjectListFunc

@@ -488,11 +499,16 @@ type urlWatcher struct {
 func newURLWatcher(role, apiURL string, gw *groupWatcher) *urlWatcher {
        parseObject, parseObjectList := getObjectParsersForRole(role)
        metrics.GetOrCreateCounter(fmt.Sprintf(`vm_promscrape_discovery_kubernetes_url_watchers{role=%q}`, role)).Inc()
+
+       ctx, cancel := context.WithCancel(context.Background())
        uw := &urlWatcher{
                role:   role,
                apiURL: apiURL,
                gw:     gw,

+               ctx:    ctx,
+               cancel: cancel,
+
                parseObject:     parseObject,
                parseObjectList: parseObjectList,

@@ -510,6 +526,44 @@ func newURLWatcher(role, apiURL string, gw *groupWatcher) *urlWatcher {
        return uw
 }

+func (uw *urlWatcher) stopIfNoUsers() {
+       gw := uw.gw
+       gw.mu.Lock()
+       if len(uw.aws)+len(uw.awsPending) == 0 {
+               uw.cancel()
+               delete(gw.m, uw.apiURL)
+       }
+       gw.mu.Unlock()
+}
+
+func (uw *urlWatcher) recreateScrapeWorks() {
+       const minSleepTime = 5 * time.Second
+       sleepTime := minSleepTime
+       gw := uw.gw
+       stopCh := uw.ctx.Done()
+       for {
+               t := timerpool.Get(sleepTime)
+               select {
+               case <-stopCh:
+                       timerpool.Put(t)
+                       return
+               case <-t.C:
+                       timerpool.Put(t)
+               }
+               startTime := time.Now()
+               gw.mu.Lock()
+               if uw.needRecreateScrapeWorks {
+                       uw.needRecreateScrapeWorks = false
+                       uw.recreateScrapeWorksLocked(uw.objectsByKey, uw.aws)
+                       sleepTime = time.Since(startTime)
+                       if sleepTime < minSleepTime {
+                               sleepTime = minSleepTime
+                       }
+               }
+               gw.mu.Unlock()
+       }
+}
+
 func (uw *urlWatcher) subscribeAPIWatcherLocked(aw *apiWatcher) {
        if _, ok := uw.aws[aw]; !ok {
                if _, ok := uw.awsPending[aw]; !ok {
@@ -587,9 +641,11 @@ func (uw *urlWatcher) reloadObjects() string {
        // and https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4855 .
        delimiter := getQueryArgsDelimiter(apiURL)
        requestURL := apiURL + delimiter + "resourceVersion=0&resourceVersionMatch=NotOlderThan"
-       resp, err := uw.gw.doRequest(requestURL)
+       resp, err := uw.gw.doRequest(uw.ctx, requestURL)
        if err != nil {
-               logger.Errorf("cannot perform request to %q: %s", requestURL, err)
+               if !errors.Is(err, context.Canceled) {
+                       logger.Errorf("cannot perform request to %q: %s", requestURL, err)
+               }
                return ""
        }
        if resp.StatusCode != http.StatusOK {
@@ -653,10 +709,18 @@ func (uw *urlWatcher) reloadObjects() string {
 //
 // See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
 func (uw *urlWatcher) watchForUpdates() {
+       stopCh := uw.ctx.Done()
        backoffDelay := time.Second
        maxBackoffDelay := 30 * time.Second
        backoffSleep := func() {
-               time.Sleep(backoffDelay)
+               t := timerpool.Get(backoffDelay)
+               select {
+               case <-stopCh:
+                       timerpool.Put(t)
+                       return
+               case <-t.C:
+                       timerpool.Put(t)
+               }
                backoffDelay *= 2
                if backoffDelay > maxBackoffDelay {
                        backoffDelay = maxBackoffDelay
@@ -667,16 +731,26 @@ func (uw *urlWatcher) watchForUpdates() {
        timeoutSeconds := time.Duration(0.9 * float64(uw.gw.client.Timeout)).Seconds()
        apiURL += delimiter + "watch=1&allowWatchBookmarks=true&timeoutSeconds=" + strconv.Itoa(int(timeoutSeconds))
        for {
+               select {
+               case <-stopCh:
+                       metrics.GetOrCreateCounter(fmt.Sprintf(`vm_promscrape_discovery_kubernetes_url_watchers{role=%q}`, uw.role)).Dec()
+                       logger.Infof("stopped %s watcher for %q", uw.role, uw.apiURL)
+                       return
+               default:
+               }
+
                resourceVersion := uw.reloadObjects()
                if resourceVersion == "" {
                        backoffSleep()
                        continue
                }
                requestURL := apiURL + "&resourceVersion=" + url.QueryEscape(resourceVersion)
-               resp, err := uw.gw.doRequest(requestURL)
+               resp, err := uw.gw.doRequest(uw.ctx, requestURL)
                if err != nil {
-                       logger.Errorf("cannot perform request to %q: %s", requestURL, err)
-                       backoffSleep()
+                       if !errors.Is(err, context.Canceled) {
+                               logger.Errorf("cannot perform request to %q: %s", requestURL, err)
+                               backoffSleep()
+                       }
                        continue
                }
                if resp.StatusCode != http.StatusOK {
@@ -697,7 +771,7 @@ func (uw *urlWatcher) watchForUpdates() {
                err = uw.readObjectUpdateStream(resp.Body)
                _ = resp.Body.Close()
                if err != nil {
-                       if !errors.Is(err, io.EOF) {
+                       if !(errors.Is(err, io.EOF) || errors.Is(err, context.Canceled)) {
                                logger.Errorf("error when reading WatchEvent stream from %q: %s", requestURL, err)
                                uw.resourceVersion = ""
                        }

Note that changes related to #4855 are already included in v1.93.4 in the commit 3665c16 .

valyala added a commit that referenced this issue Oct 27, 2023
… belong to a particular groupWatcher, at once

Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates #5216

The issue has been introduced in v1.93.5 when addressing #4850
@valyala
Copy link
Collaborator

valyala commented Oct 27, 2023

@mindw , the commit 632d788 should fix the issue. Could you build vmagent from this commit according to these instructions and verify whether it fixes the role: endpoints service discovery for Kubernetes? You can also build single-node VictoriaMetrics from the same commit according to these instructions if you prefer using single-node VictoriaMetrics for scraping Prometheus-compatible targets.

You can also use Docker tag heads-public-single-node-0-g632d788b6 for vmagent and single-node VictoriaMetrics - this tag contains the bugfix from the commit referred above.

valyala added a commit that referenced this issue Oct 27, 2023
… belong to a particular groupWatcher, at once

Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates #5216

The issue has been introduced in v1.93.5 when addressing #4850
valyala added a commit that referenced this issue Oct 27, 2023
… belong to a particular groupWatcher, at once

Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates #5216

The issue has been introduced in v1.93.5 when addressing #4850
valyala added a commit that referenced this issue Oct 27, 2023
… belong to a particular groupWatcher, at once

Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates #5216

The issue has been introduced in v1.93.5 when addressing #4850
@valyala
Copy link
Collaborator

valyala commented Nov 2, 2023

FYI, vmagent should properly discover role: endpoints targets in Kubernetes starting from v1.93.7 LTS release. The bugfix will be also included in the next non-LTS release (v1.95.0).

Closing the issue as fixed.

@mindw , feel free re-opening the issue if you still see inconsistent behavior in role: endpoints target discovery in v1.93.7 release of vmagent or single-node VictoriaMetrics.

@valyala valyala closed this as completed Nov 2, 2023
AndrewChubatiuk pushed a commit to AndrewChubatiuk/VictoriaMetrics that referenced this issue Nov 15, 2023
… belong to a particular groupWatcher, at once

Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates VictoriaMetrics#5216

The issue has been introduced in v1.93.5 when addressing VictoriaMetrics#4850
@valyala
Copy link
Collaborator

valyala commented Nov 15, 2023

FYI, the bugfix has been included in vmagent starting from v1.95.0.

@mindw
Copy link
Author

mindw commented Nov 30, 2023

@valyala my apologies for not getting back sooner! Was kept away by other matters.

I've deployed 1.95.1 and so far so good. All targets are detected as expected.
I'll re-open / file a new issue if the above changes.
Thank you for your support and for VM! a truly high quality product :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vmagent
Projects
None yet
Development

No branches or pull requests

4 participants