Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmagent debug kube-state-metrics 0/0 target #5389

Closed
2 of 3 tasks
k0nstantinv opened this issue Nov 24, 2023 · 29 comments
Closed
2 of 3 tasks

vmagent debug kube-state-metrics 0/0 target #5389

k0nstantinv opened this issue Nov 24, 2023 · 29 comments
Assignees
Labels
question The question issue vmagent

Comments

@k0nstantinv
Copy link

Is your question request related to a specific component?

vmagent

Describe the question in detail

I'm in a process of moving from Prometheus to VictoiraMetrics in a huge AWS cluster
I use kube-prometheus-stack and vm-stack at the same time during the moving

VictoriaMetrics deployed via chart victoria-metrics-k8s-stack. All the ServiceMonitors where converted to VMServiceScrappers

We've discovered that lots of vmagent targets show0/0 and we have no idea why. There are no relative errors logs across the vmagents or somewhere else

Please give some advises how to debug a target, for example kube-state-metrics target

image

Also, vmagent's discovered targets endpoint does not contain any entries about the kube-state-metrics target
ready to post any logs or attachments

environment Info:
aws eks
vmcluster - v1.94.0
vmagent - v1.94.0, also v1.95.1 tried

VMserviceScrape kube-state-metrics

kind: VMServiceScrape
metadata:
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: dev-eks-eu-central1-monitoring
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/version: 2.6.0
    argocd.argoproj.io/instance: dev-eks-eu-central1-monitoring
    helm.sh/chart: kube-state-metrics-4.18.0
    k8slens-edit-resource-version: v1
    release: dev-eks-eu-central1-monitoring
  name: kube-state-metrics
  namespace: monitoring
  ownerReferences:
  - apiVersion: monitoring.coreos.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ServiceMonitor
    name: kube-state-metrics
    uid: 17b65c47-fcf5-4bea-82b4-4c7ae3724a6e
  resourceVersion: "1294781553"
  uid: f9bace3e-cc40-480c-93d7-0a96684f04eb
spec:
  endpoints:
  - attach_metadata: {}
    honorLabels: true
    interval: 1m
    port: http
    scrapeTimeout: 30s
  jobLabel: app.kubernetes.io/name
  namespaceSelector: {}
  selector:
    matchLabels:
      app.kubernetes.io/instance: dev-eks-eu-central1-monitoring
      app.kubernetes.io/name: kube-state-metrics

job definition from vmagent

- job_name: serviceScrape/monitoring/kube-state-metrics/0
  honor_labels: true
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  scrape_interval: 1m
  scrape_timeout: 30s
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_instance
    regex: dev-eks-eu-central1-monitoring
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_name
    regex: kube-state-metrics
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Node;(.*)
    replacement: ${1}
    target_label: node
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Pod;(.*)
    replacement: ${1}
    target_label: pod
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_pod_container_name
    target_label: container
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_name
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: http

Thanks for your attention anyway, appreciate any help

Troubleshooting docs

@k0nstantinv k0nstantinv added the question The question issue label Nov 24, 2023
@dmitryk-dk
Copy link
Contributor

dmitryk-dk commented Nov 24, 2023

Hi @k0nstantinv ! Can you share the values.yml it will help to understand where can be the mistake.
vmagent - v1.94.0 has a bug which was fixed at 1.93.7 LTS and 1.95.0
But as I can see you tried to use v1.95.1 and that release should not have a bug

@k0nstantinv
Copy link
Author

k0nstantinv commented Nov 24, 2023

@dmitryk-dk kube-state-metrics deployed separetly from vm-stack, for now it's a part of kube-prometheus-stack chart, so vm-stack's values file contains nothing about kube-state-metrics except the enabled:false

@dmitryk-dk
Copy link
Contributor

kube-state-metrics deployed separetly from vm-stack

Hi @k0nstantinv, we want to reproduce your issue, but we can't deploy kube-state-metrics without any values.yml.
Is it possible to get this file? Or can you explain how you deploy your kube-state-metrics.

At that moment we have our test where we see the target
Screenshot 2023-11-24 at 14 29 05

How we did it:

  1. We have deployed kube-state-metrics
    helm install vm/victoria-metrics-k8s-stack -f charts/victoria-metrics-k8s-stack/values.yaml and have default values.yaml
  2. We have changed standard service scrape to your here is an example
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: dev-eks-eu-central1-monitoring
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/version: 2.6.0
    argocd.argoproj.io/instance: dev-eks-eu-central1-monitoring
    helm.sh/chart: kube-state-metrics-4.18.0
    k8slens-edit-resource-version: v1
    release: dev-eks-eu-central1-monitoring
  name: kube-state-metrics
#  namespace: monitoring
#  ownerReferences:
#    - apiVersion: monitoring.coreos.com/v1
#      blockOwnerDeletion: true
#      controller: true
#      kind: ServiceMonitor
#      name: kube-state-metrics
#      uid: 17b65c47-fcf5-4bea-82b4-4c7ae3724a6e
#  resourceVersion: "1294781553"
#  uid: f9bace3e-cc40-480c-93d7-0a96684f04eb
spec:
  endpoints:
    - attach_metadata: {}
      honorLabels: true
      interval: 1m
      port: http
      scrapeTimeout: 30s
  jobLabel: app.kubernetes.io/name
  namespaceSelector: {}
  selector:
    matchLabels:
      app.kubernetes.io/instance: k8s-test
      app.kubernetes.io/name: kube-state-metrics
  1. After that we check the vmagent UI please see the image above.

@k0nstantinv
Copy link
Author

k0nstantinv commented Nov 26, 2023

@dmitryk-dk thanks! I understand your installation way. As I mentioned above kube-state-metrics itself is a part of kube-prometheus-stack which was deployed earlier. I want to understand the way of debugging zero targets in vmagent, so there is no difference how KMS was actually deployed

I have KMS servicemonitor, I have KMS vmservicescrape, I have job definition in vmagent conf, but this one particular target shows only 0/0. There are no error log or something, so my goal is to learn how to make it work

@k0nstantinv
Copy link
Author

@dmitryk-dk some additional info, cluster is really huge

  • there are almost 10000 VMServiceScrapers across the all namespaces
  • cluster victoriametrics setup
  • 5 sharded vmagent
  • 13 mil active time series
    etc.

my active targets tab shows thousands (0/0) targets, like
image

and I really stuck in debugging what is the reason with no luck, kube-state-metrics was just example one to start from

here is the values file

nameOverride: ""
fullnameOverride: vm-stack
tenant: "0"
argocdReleaseOverride: "victoriametrics"

victoria-metrics-operator:
 env:
 - name: VM_VMALERTDEFAULT_CONFIGRELOADERCPU
   value: 200m
 - name: VM_VMALERTDEFAULT_CONFIGRELOADERMEMORY
   value: 250Mi
 - name: VM_VMAGENTDEFAULT_CONFIGRELOADERCPU
   value: 200m
 - name: VM_VMAGENTDEFAULT_CONFIGRELOADERMEMORY
   value: 250Mi
 annotations:
   argocd.argoproj.io/sync-options: ServerSideApply=true
 enabled: true
 fullnameOverride: victoria-metrics-operator
 cleanupCRD: true
 cleanupImage:
   repository: gcr.io/google_containers/hyperkube
   tag: v1.18.0
   pullPolicy: IfNotPresent
 createCRD: false
 operator:
   disable_prometheus_converter: false
   psp_auto_creation_enabled: false
   prometheus_converter_add_argocd_ignore_annotations: true
   enable_converter_ownership: true
   useCustomConfigReloader: true
 nodeSelector:
   dedicated-to: infra
 tolerations:
 - effect: NoSchedule
   key: dedicated-to
   operator: Equal
   value: victoriametrics
 resources:
   limits:
     cpu: 500m
     memory: 3Gi
   requests:
     cpu: 300m
     memory: 2Gi

serviceAccount:
 create: true
 name: ""
 
defaultRules:
 create: true
 rules:
   etcd: true
   general: true
   k8s: true
   kubeApiserver: true
   kubeApiserverAvailability: true
   kubeApiserverBurnrate: true
   kubeApiserverHistogram: true
   kubeApiserverSlos: true
   kubelet: true
   kubePrometheusGeneral: true
   kubePrometheusNodeRecording: true
   kubernetesApps: true
   kubernetesResources: true
   kubernetesStorage: true
   kubernetesSystem: true
   kubeScheduler: true
   kubeStateMetrics: true
   network: true
   node: true
   vmagent: true
   vmsingle: false
   vmhealth: true
   alertmanager: true
 runbookUrl: https://runbooks.prometheus-operator.dev/runbooks
 appNamespacesTarget: ".*"

defaultDashboardsEnabled: true

experimentalDashboardsEnabled: true

vmsingle:
 enabled: false

vmcluster:
 enabled: true
 spec:
   retentionPeriod: "7"
   replicationFactor: 1
   
   vmstorage:
     image:
       tag: v1.94.0-cluster
     replicaCount: 6
     extraArgs:
       search.maxUniqueTimeseries: "12000000"
       dedup.minScrapeInterval: 60s
       loggerFormat: json
     nodeSelector:
       dedicated-to: infra
     tolerations:
     - effect: NoSchedule
       key: dedicated-to
       operator: Equal
       value: victoriametrics    
     affinity:
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           - labelSelector:
               matchExpressions:
                 - key: app.kubernetes.io/name
                   operator: In
                   values:
                     - vmstorage
             topologyKey: kubernetes.io/hostname
     storageDataPath: "/vm-data"
     storage:
       volumeClaimTemplate:
         spec:
           accessModes: [ "ReadWriteOnce" ]
           storageClassName: gp3
           resources:
             requests:
               storage: 200Gi
     resources:
       limits:
         cpu: 4
         memory: 5Gi
       requests:
         cpu: 2
         memory: 3Gi
   
   vmselect:
     image:
       tag: v1.94.0-cluster
     replicaCount: 6
     extraArgs:
       loggerFormat: json
       dedup.minScrapeInterval: 60s
       search.maxQueryDuration: 240s
       search.maxConcurrentRequests: "64"
       search.logSlowQueryDuration: 10s
       search.maxQueryLen: "65536"
       memory.allowedPercent: "60"
     nodeSelector:
       dedicated-to: infra 
     tolerations:
     - effect: NoSchedule
       key: dedicated-to
       operator: Equal
       value: victoriametrics
     affinity:
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           - labelSelector:
               matchExpressions:
                 - key: app.kubernetes.io/name
                   operator: In
                   values:
                     - vmselect
             topologyKey: kubernetes.io/hostname
     cacheMountPath: "/select-cache"
     storage:
       volumeClaimTemplate:
         spec:
           accessModes: [ "ReadWriteOnce" ]
           storageClassName: gp3
           resources:
             requests:
               storage: 20Gi
     resources:
       limits:
         cpu: 1
         memory: 2Gi
       requests:
         cpu: 500m
         memory: 500Mi
   
   vminsert:
     image:
       tag: v1.94.0-cluster
     replicaCount: 6
     extraArgs:
       loggerFormat: json
       maxLabelsPerTimeseries: "100"
     nodeSelector:
       dedicated-to: infra
     tolerations:
     - effect: NoSchedule
       key: dedicated-to
       operator: Equal
       value: victoriametrics    
     affinity:
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           - labelSelector:
               matchExpressions:
                 - key: app.kubernetes.io/name
                   operator: In
                   values:
                     - vminsert
             topologyKey: kubernetes.io/hostname
     resources:
       limits:
         cpu: 2
         memory: 2Gi
       requests:
         cpu: 1
         memory: 1Gi

 ingress:
   storage:
     enabled: false
   select:
     enabled: false
   insert:
     enabled: false

alertmanager:
 enabled: false

vmalert:
 enabled: true
 remoteWriteVMAgent: false
 spec:
   selectAllByDefault: true
   evaluationInterval: 45s
   nodeSelector:
     dedicated-to: infra
   tolerations:
   - effect: NoSchedule
     key: dedicated-to
     operator: Equal
     value: victoriametrics
   resources:
     requests:
       cpu: 200m
       memory: 200Mi
     limits:
       cpu: 1
       memory: 1Gi

vmagent:
 enabled: true
 replicaCount: 1
 shardCount: 5
 scrapeInterval: 30s
 spec:
   extraArgs:
     loggerFormat: json
     promscrape.maxScrapeSize: 1GB
     promscrape.noStaleMarkers: "true"
     promscrape.streamParse: "true"
     promscrape.suppressScrapeErrors: "true"
     remoteWrite.maxBlockSize: 256MB
     remoteWrite.maxRowsPerBlock: "50000"
     remoteWrite.queues: "150"
     remoteWrite.tlsInsecureSkipVerify: "true"
   externalLabels:
     cluster: dev-eks-eu-central1
   nodeSelector:
     dedicated-to: infra
   tolerations:
   - effect: NoSchedule
     key: dedicated-to
     operator: Equal
     value: victoriametrics
   affinity:
     podAntiAffinity:
       requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
               - key: app.kubernetes.io/name
                 operator: In
                 values:
                   - vmagent
           topologyKey: kubernetes.io/hostname
   resources:
     requests:
       cpu: 1
       memory: 4Gi
     limits:
       cpu: 8
       memory: 7Gi
 ingress:
   enabled: false
 additionalScrapeConfigs: |
   - job_name: opencost
     honor_labels: true
     scrape_interval: 1m
     scrape_timeout: 10s
     metrics_path: /metrics
     scheme: http
     dns_sd_configs:
     - names:
       - opencost.opencost
       type: 'A'
       port: 9003

prometheus-node-exporter:
 enabled: false
 vmServiceScrape:
   enabled: true

kube-state-metrics:
 enabled: false
 vmServiceScrape: # TODO is it working???
   enabled: true

kubelet:
 enabled: true
 cadvisor: true
 probes: true
 spec:
   scheme: "https"
   honorLabels: true
   interval: "30s"
   scrapeTimeout: "5s"
   tlsConfig:
     insecureSkipVerify: true
     caFile: "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
   bearerTokenFile: "/var/run/secrets/kubernetes.io/serviceaccount/token"
   # drop high cardinality label and useless metrics for cadvisor and kubelet
   metricRelabelConfigs:
     - action: labeldrop
       regex: (uid)
     - action: labeldrop
       regex: (id|name)
     - action: drop
       source_labels: [__name__]
       regex: (rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count)
     - action: labeldrop
       regex: container_id
     - sourceLabels: [__name__]
       separator: ;
       regex: container_(|tasks_state|cpu_load_average_10s)
       replacement: $1
       action: drop
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_network_tcp_.*'
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_network_udp_.*'
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_net_tcp_.*'
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_http_.*'
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_fs_(reads|writes)_total'
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_blkio_device_usage_total'
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_network_(transmit|receive)_errors_total'
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_network_transmit_packets_total'
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_network_receive_packets_dropped_total'
     - sourceLabels: [__name__]
       action: drop
       regex: 'container_memory_failures_total'
   relabelConfigs:
     - action: labelmap
       regex: __meta_kubernetes_node_label_(.+)
     - sourceLabels: [__metrics_path__]
       targetLabel: metrics_path
     - targetLabel: "job"
       replacement: "kubelet"
   # ignore timestamps of cadvisor's metrics by default
   # more info here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4697#issuecomment-1656540535
   honorTimestamps: false

crds:
 enabled: true

hope it helps

@dmitryk-dk
Copy link
Contributor

kube-state-metrics:
enabled: false
vmServiceScrape: # TODO is it working???
enabled: true

Hi @k0nstantinv ! First of all, you should use vmagent v1.95.1 because it has bugfix for kubernetes service discovery.

Can you share an example of service which service scrape should scrape? You should check the labels of the service and the service. Those labels should be equal. And the namespaces should be equal. But if you can share this information it could help to find the problem

@k0nstantinv
Copy link
Author

k0nstantinv commented Nov 27, 2023

you should use vmagent v1.95.1

I tried vmagent 1.95.1 and result is the same

Can you share an example of service which service scrape should scrape?

sure, here is kube-state-metrics

apiVersion: v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"prometheus.io/scrape":"true"},"labels":{"app.kubernetes.io/component":"metrics","app.kubernetes.io/instance":"dev-eks-eu-central1-monitoring","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"kube-state-metrics","app.kubernetes.io/part-of":"kube-state-metrics","app.kubernetes.io/version":"2.6.0","argocd.argoproj.io/instance":"dev-eks-eu-central1-monitoring","helm.sh/chart":"kube-state-metrics-4.18.0","release":"dev-eks-eu-central1-monitoring"},"name":"kube-state-metrics","namespace":"monitoring"},"spec":{"ports":[{"name":"http","port":8080,"protocol":"TCP","targetPort":8080}],"selector":{"app.kubernetes.io/instance":"dev-eks-eu-central1-monitoring","app.kubernetes.io/name":"kube-state-metrics"},"type":"ClusterIP"}}
    prometheus.io/scrape: "true"
  creationTimestamp: "2023-01-16T17:37:23Z"
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: dev-eks-eu-central1-monitoring
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/version: 2.6.0
    argocd.argoproj.io/instance: dev-eks-eu-central1-monitoring
    helm.sh/chart: kube-state-metrics-4.18.0
    release: dev-eks-eu-central1-monitoring
  name: kube-state-metrics
  namespace: monitoring
  resourceVersion: "56783309"
  uid: 0d357d8a-23ad-40b7-8b70-8364806042e7
spec:
  clusterIP: 172.20.254.110
  clusterIPs:
  - 172.20.254.110
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app.kubernetes.io/instance: dev-eks-eu-central1-monitoring
    app.kubernetes.io/name: kube-state-metrics
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

labels are similar ones to that ones in VMServiceScrape above
namespaces also are similar

all the staff either vm-stack or prometheus-stack deployed in the namespace monitoring

@dmitryk-dk
Copy link
Contributor

dmitryk-dk commented Nov 27, 2023

namespaceSelector

Can you check in the describe of the service can find (match) any target? Because labels are identical and namespaces the same.

@k0nstantinv
Copy link
Author

@dmitryk-dk sure

$ k -n monitoring describe svc kube-state-metrics
Name:              kube-state-metrics
Namespace:         monitoring
Labels:            app.kubernetes.io/component=metrics
                   app.kubernetes.io/instance=dev-eks-eu-central1-monitoring
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=kube-state-metrics
                   app.kubernetes.io/part-of=kube-state-metrics
                   app.kubernetes.io/version=2.6.0
                   argocd.argoproj.io/instance=dev-eks-eu-central1-monitoring
                   helm.sh/chart=kube-state-metrics-4.18.0
                   release=dev-eks-eu-central1-monitoring
Annotations:       prometheus.io/scrape: true
Selector:          app.kubernetes.io/instance=dev-eks-eu-central1-monitoring,app.kubernetes.io/name=kube-state-metrics
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                172.20.254.110
IPs:               172.20.254.110
Port:              http  8080/TCP
TargetPort:        8080/TCP
Endpoints:         10.10.39.57:8080 <-------
Session Affinity:  None
Events:            <none>

$ k -n monitoring get pods -o wide | grep 10.10.39.57
kube-state-metrics-8c7857964-mcm2j                    1/1     Running            0     4d3h    10.10.39.57     ip-10-80-53-73.eu-central-1.compute.internal     <none>           <none>

seems like there are some more complicated issues with my setup, maybe some bottleneck in vmagent or kubernetes_sd misconfiguration...no idea

@dmitryk-dk
Copy link
Contributor

@k0nstantinv could you try to downgrade the vmagent to the 1.93.5 version to be sure that vmagent is a bottleneck?

@k0nstantinv
Copy link
Author

k0nstantinv commented Nov 27, 2023

@k0nstantinv could you try to downgrade the vmagent to the 1.93.5 version to be sure that vmagent is a bottleneck?

tried with no luck

$ k -n monitoring logs vmagent-vm-stack-1-9d55fc8f5-c6j9d vmagent
{"ts":"2023-11-27T13:04:11.356Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:12","msg":"build version: vmagent-20230919-042301-tags-v1.93.5-0-g3efbb0af2b"}
....
{"ts":"2023-11-27T13:04:35.470Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/config.go:128","msg":"started service discovery routines in 18.956 seconds"}
{"ts":"2023-11-27T13:04:35.470Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/scraper.go:156","msg":"SIGHUP received; reloading Prometheus configs from \"/etc/vmagent/config_out/vmagent.env.yaml\""}
{"ts":"2023-11-27T13:04:45.083Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/scraper.go:166","msg":"nothing changed in \"/etc/vmagent/config_out/vmagent.env.yaml\""}
{"ts":"2023-11-27T13:05:30.554Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/scraper.go:430","msg":"kubernetes_sd_configs: added targets: 2996, removed targets: 0; total targets: 2996"}
{"ts":"2023-11-27T13:10:46.460Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/scraper.go:430","msg":"kubernetes_sd_configs: added targets: 0, removed targets: 1; total targets: 2995"}
{"ts":"2023-11-27T13:12:27.940Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/scraper.go:430","msg":"kubernetes_sd_configs: added targets: 1, removed targets: 1; total targets: 2995"}

still 0/0 targets with no errors

@dmitryk-dk
Copy link
Contributor

kubernetes_sd_configs

I would like to set up an environment with precisely the same configuration. Can I ask you to share the deployment config?
Can you share vmagent CRD?

@k0nstantinv
Copy link
Author

Here is VMagent

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAgent
metadata:
  finalizers:
  - apps.victoriametrics.com/finalizer
  labels:
    app.kubernetes.io/instance: monitoring
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: victoria-metrics-k8s-stack
    app.kubernetes.io/version: v1.94.0
    argocd.argoproj.io/instance: dev-eks-eu-central1-monitoring-vm
    helm.sh/chart: victoria-metrics-k8s-stack-0.18.5
  name: vm-stack
  namespace: monitoring
  resourceVersion: "1351427863"
  uid: ae799f53-13dd-4d8b-a21c-cc86e2999c3a
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - vmagent
        topologyKey: kubernetes.io/hostname
  arbitraryFSAccessThroughSMs: {}
  containers:
  - name: config-reloader
    resources:
      limits:
        cpu: 200m
        memory: 250Mi
  externalLabels:
    cluster: dev-eks-eu-central1
  extraArgs:
    loggerFormat: json
    promscrape.disableCompression: "true"
    promscrape.discovery.concurrency: "300"
    promscrape.kubernetesSDCheckInterval: 90s
    promscrape.maxDroppedTargets: "6000"
    promscrape.maxScrapeSize: 2GB
    promscrape.noStaleMarkers: "true"
    promscrape.streamParse: "true"
    promscrape.suppressScrapeErrors: "true"
    remoteWrite.maxBlockSize: 256MB
    remoteWrite.maxRowsPerBlock: "20000"
    remoteWrite.queues: "200"
    remoteWrite.tlsInsecureSkipVerify: "true"
  image:
    tag: v1.94.0
  initContainers:
  - name: config-init
    resources:
      limits:
        cpu: 200m
        memory: 250Mi
  nodeSelector:
    dedicated-to: infra
  remoteWrite:
  - url: http://vminsert-vm-stack.monitoring.svc:8480/insert/0/prometheus/api/v1/write
  replicaCount: 1
  resources:
    limits:
      cpu: 8
      memory: 7Gi
    requests:
      cpu: 1
      memory: 4Gi
  scrapeInterval: 30s
  selectAllByDefault: true
  shardCount: 5
  tolerations:
  - effect: NoSchedule
    key: dedicated-to
    operator: Equal
    value: victoriametrics

@dmitryk-dk
Copy link
Contributor

vmagent:
enabled: true
replicaCount: 1
shardCount: 5
scrapeInterval: 30s
spec:
extraArgs:
loggerFormat: json
promscrape.maxScrapeSize: 1GB
promscrape.noStaleMar

Hi @k0nstantinv ! In the previous message, I asked about the Deployment manifest itself. Could you please share it?

@k0nstantinv
Copy link
Author

here is the deployment of the first vmagent shard

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "57"
  creationTimestamp: "2023-11-22T08:52:28Z"
  finalizers:
  - apps.victoriametrics.com/finalizer
  generation: 60
  labels:
    app.kubernetes.io/component: monitoring
    app.kubernetes.io/instance: vm-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vmagent
    app.kubernetes.io/version: v1.94.0
    argocd.argoproj.io/instance: dev-eks-eu-central1-monitoring-vm
    helm.sh/chart: victoria-metrics-k8s-stack-0.18.5
    managed-by: vm-operator
  name: vmagent-vm-stack-0
  namespace: monitoring
  ownerReferences:
  - apiVersion: operator.victoriametrics.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: VMAgent
    name: vm-stack
    uid: ae799f53-13dd-4d8b-a21c-cc86e2999c3a
  resourceVersion: "1365891400"
  uid: 7ecc3eac-24c9-4175-8947-aeec09b00967
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: monitoring
      app.kubernetes.io/instance: vm-stack
      app.kubernetes.io/name: vmagent
      managed-by: vm-operator
      shard-num: "0"
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: monitoring
        app.kubernetes.io/instance: vm-stack
        app.kubernetes.io/name: vmagent
        managed-by: vm-operator
        shard-num: "0"
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - vmagent
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - --reload-url=http://localhost:8429/-/reload
        - --config-envsubst-file=/etc/vmagent/config_out/vmagent.env.yaml
        - --watched-dir=/etc/vm/relabeling
        - --watched-dir=/etc/vm/stream-aggr
        - --config-secret-name=monitoring/vmagent-vm-stack
        - --config-secret-key=vmagent.yaml.gz
        command:
        - /usr/local/bin/config-reloader
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: victoriametrics/operator:config-reloader-v0.38.0
        imagePullPolicy: IfNotPresent
        name: config-reloader
        resources:
          limits:
            cpu: 200m
            memory: 250Mi
          requests:
            cpu: 200m
            memory: 250Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/vmagent/config_out
          name: config-out
        - mountPath: /etc/vm/relabeling
          name: relabeling-assets
          readOnly: true
        - mountPath: /etc/vm/stream-aggr
          name: stream-aggr-conf
          readOnly: true
      - args:
        - -httpListenAddr=:8429
        - -loggerFormat=json
        - -promscrape.cluster.name=dev-eks-eu-central1
        - -promscrape.config=/etc/vmagent/config_out/vmagent.env.yaml
        - -promscrape.disableCompression=false
        - -promscrape.discovery.concurrency=200
        - -promscrape.maxDroppedTargets=5000
        - -promscrape.maxScrapeSize=4GB
        - -promscrape.noStaleMarkers=true
        - -promscrape.streamParse=true
        - -promscrape.suppressScrapeErrors=true
        - -remoteWrite.maxBlockSize=200MB
        - -remoteWrite.maxDiskUsagePerURL=1073741824
        - -remoteWrite.maxRowsPerBlock=11000
        - -remoteWrite.queues=170
        - -remoteWrite.tlsInsecureSkipVerify=true
        - -remoteWrite.tmpDataPath=/tmp/vmagent-remotewrite-data
        - -remoteWrite.url=http://vminsert-vm-stack.monitoring.svc:8480/insert/0/prometheus/api/v1/write
        - -promscrape.cluster.membersCount=6
        - -promscrape.cluster.memberNum=0
        image: victoriametrics/vmagent:v1.94.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
          httpGet:
            path: /health
            port: 8429
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 5
        name: vmagent
        ports:
        - containerPort: 8429
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 10
          httpGet:
            path: /health
            port: 8429
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 3200m
            memory: 7Gi
          requests:
            cpu: 1500m
            memory: 4Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /tmp/vmagent-remotewrite-data
          name: persistent-queue-data
        - mountPath: /etc/vmagent/config_out
          name: config-out
          readOnly: true
        - mountPath: /etc/vmagent-tls/certs
          name: tls-assets
          readOnly: true
        - mountPath: /etc/vm/relabeling
          name: relabeling-assets
          readOnly: true
        - mountPath: /etc/vm/stream-aggr
          name: stream-aggr-conf
          readOnly: true
        - mountPath: /etc/vmagent/config
          name: config
          readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - --reload-url=http://localhost:8429/-/reload
        - --config-envsubst-file=/etc/vmagent/config_out/vmagent.env.yaml
        - --watched-dir=/etc/vm/relabeling
        - --watched-dir=/etc/vm/stream-aggr
        - --config-secret-name=monitoring/vmagent-vm-stack
        - --config-secret-key=vmagent.yaml.gz
        - --only-init-config
        command:
        - /usr/local/bin/config-reloader
        image: victoriametrics/operator:config-reloader-v0.38.0
        imagePullPolicy: IfNotPresent
        name: config-init
        resources:
          limits:
            cpu: 200m
            memory: 250Mi
          requests:
            cpu: 200m
            memory: 250Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/vmagent/config_out
          name: config-out
      nodeSelector:
        dedicated-to: infra
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: vmagent-vm-stack
      serviceAccountName: vmagent-vm-stack
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: dedicated-to
        operator: Equal
        value: victoriametrics
      volumes:
      - emptyDir: {}
        name: persistent-queue-data
      - name: tls-assets
        secret:
          defaultMode: 420
          secretName: tls-assets-vmagent-vm-stack
      - emptyDir: {}
        name: config-out
      - configMap:
          defaultMode: 420
          name: relabelings-assets-vmagent-vm-stack
        name: relabeling-assets
      - configMap:
          defaultMode: 420
          name: stream-aggr-vmagent-vm-stack
        name: stream-aggr-conf
      - name: config
        secret:
          defaultMode: 420
          secretName: vmagent-vm-stack

@k0nstantinv
Copy link
Author

k0nstantinv commented Nov 29, 2023

@dmitryk-dk
looks like increasing scrapeTimeout in kube-state-metrics VMSerivceScrape did the trick with metrics
I see metrics, I see job in up state like
image
but target still shows 0/0 and seems to be it is incorrect. It's really unclear what is going on with the target, I mean there should be either some scrape errors or logs if the scrape can't be finished, but now vmagent is completely silent about

I understand my VMagent definition has the option - -promscrape.suppressScrapeErrors=true, but I can assure you I've checked it even without that option

@valyala
Copy link
Collaborator

valyala commented Nov 30, 2023

If vmagent shows jobs with 0/0 targets at http://vmagent:8429/targets page, this means that targets for this job have been discovered and then dropped during relabeling phase. The original labels for such targets are displayed at the http://vmagent:8429/service-discovery page. The maximum number of targets, which could be displayed on this page, is controlled via the command-line flag -promscrape.maxDroppedTargets. If your vmagent discovers more than 10K targets, then try increasing the -promscrape.maxDroppedTargets command-line flag value to 20000. This may increase vmagent memory usage, but this should help identifying the original labels for the dropped target at http://vmagent:8429/service-discovery page. This page also contains the debug relabeling link for every dropped target - click on the link in order to debug the actual relabeling rules on the actual discovered target labels. This may help understanding why the target has been dropped.

@k0nstantinv
Copy link
Author

k0nstantinv commented Nov 30, 2023

@valyala Thanks a lot! I know http://vmagent:8429/service-discovery can help with debugging targets. I've tried promscrape.maxDroppedTargets to 20000 (it was 5000). Nothing has changed at service discovery page (I have near 2000 targets down due to timeouts, connection refuses etc.) There are neither drops nor error entries about kube-state-metrics at service-discovery page. And it's still unclear for me how it is working when you have a job in up state, you have a metrics, you have no scrape errors/drops and you have a target state 0/0 at the same time. Just want to make it clear

@dmitryk-dk
Copy link
Contributor

Hi @k0nstantinv ! I will try to reproduce it today on my local. But if I find that scrape targets are available, so that we will need to check the code what can cause the issue if there are a lot of targets present

@dmitryk-dk
Copy link
Contributor

here is the deployment of the first vmagent shard

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "57"
  creationTimestamp: "2023-11-22T08:52:28Z"
  finalizers:
  - apps.victoriametrics.com/finalizer
  generation: 60
  labels:
    app.kubernetes.io/component: monitoring
    app.kubernetes.io/instance: vm-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vmagent
    app.kubernetes.io/version: v1.94.0
    argocd.argoproj.io/instance: dev-eks-eu-central1-monitoring-vm
    helm.sh/chart: victoria-metrics-k8s-stack-0.18.5
    managed-by: vm-operator
  name: vmagent-vm-stack-0
  namespace: monitoring
  ownerReferences:
  - apiVersion: operator.victoriametrics.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: VMAgent
    name: vm-stack
    uid: ae799f53-13dd-4d8b-a21c-cc86e2999c3a
  resourceVersion: "1365891400"
  uid: 7ecc3eac-24c9-4175-8947-aeec09b00967
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: monitoring
      app.kubernetes.io/instance: vm-stack
      app.kubernetes.io/name: vmagent
      managed-by: vm-operator
      shard-num: "0"
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: monitoring
        app.kubernetes.io/instance: vm-stack
        app.kubernetes.io/name: vmagent
        managed-by: vm-operator
        shard-num: "0"
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - vmagent
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - --reload-url=http://localhost:8429/-/reload
        - --config-envsubst-file=/etc/vmagent/config_out/vmagent.env.yaml
        - --watched-dir=/etc/vm/relabeling
        - --watched-dir=/etc/vm/stream-aggr
        - --config-secret-name=monitoring/vmagent-vm-stack
        - --config-secret-key=vmagent.yaml.gz
        command:
        - /usr/local/bin/config-reloader
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: victoriametrics/operator:config-reloader-v0.38.0
        imagePullPolicy: IfNotPresent
        name: config-reloader
        resources:
          limits:
            cpu: 200m
            memory: 250Mi
          requests:
            cpu: 200m
            memory: 250Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/vmagent/config_out
          name: config-out
        - mountPath: /etc/vm/relabeling
          name: relabeling-assets
          readOnly: true
        - mountPath: /etc/vm/stream-aggr
          name: stream-aggr-conf
          readOnly: true
      - args:
        - -httpListenAddr=:8429
        - -loggerFormat=json
        - -promscrape.cluster.name=dev-eks-eu-central1
        - -promscrape.config=/etc/vmagent/config_out/vmagent.env.yaml
        - -promscrape.disableCompression=false
        - -promscrape.discovery.concurrency=200
        - -promscrape.maxDroppedTargets=5000
        - -promscrape.maxScrapeSize=4GB
        - -promscrape.noStaleMarkers=true
        - -promscrape.streamParse=true
        - -promscrape.suppressScrapeErrors=true
        - -remoteWrite.maxBlockSize=200MB
        - -remoteWrite.maxDiskUsagePerURL=1073741824
        - -remoteWrite.maxRowsPerBlock=11000
        - -remoteWrite.queues=170
        - -remoteWrite.tlsInsecureSkipVerify=true
        - -remoteWrite.tmpDataPath=/tmp/vmagent-remotewrite-data
        - -remoteWrite.url=http://vminsert-vm-stack.monitoring.svc:8480/insert/0/prometheus/api/v1/write
        - -promscrape.cluster.membersCount=6
        - -promscrape.cluster.memberNum=0
        image: victoriametrics/vmagent:v1.94.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
          httpGet:
            path: /health
            port: 8429
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 5
        name: vmagent
        ports:
        - containerPort: 8429
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 10
          httpGet:
            path: /health
            port: 8429
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 3200m
            memory: 7Gi
          requests:
            cpu: 1500m
            memory: 4Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /tmp/vmagent-remotewrite-data
          name: persistent-queue-data
        - mountPath: /etc/vmagent/config_out
          name: config-out
          readOnly: true
        - mountPath: /etc/vmagent-tls/certs
          name: tls-assets
          readOnly: true
        - mountPath: /etc/vm/relabeling
          name: relabeling-assets
          readOnly: true
        - mountPath: /etc/vm/stream-aggr
          name: stream-aggr-conf
          readOnly: true
        - mountPath: /etc/vmagent/config
          name: config
          readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - --reload-url=http://localhost:8429/-/reload
        - --config-envsubst-file=/etc/vmagent/config_out/vmagent.env.yaml
        - --watched-dir=/etc/vm/relabeling
        - --watched-dir=/etc/vm/stream-aggr
        - --config-secret-name=monitoring/vmagent-vm-stack
        - --config-secret-key=vmagent.yaml.gz
        - --only-init-config
        command:
        - /usr/local/bin/config-reloader
        image: victoriametrics/operator:config-reloader-v0.38.0
        imagePullPolicy: IfNotPresent
        name: config-init
        resources:
          limits:
            cpu: 200m
            memory: 250Mi
          requests:
            cpu: 200m
            memory: 250Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/vmagent/config_out
          name: config-out
      nodeSelector:
        dedicated-to: infra
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: vmagent-vm-stack
      serviceAccountName: vmagent-vm-stack
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: dedicated-to
        operator: Equal
        value: victoriametrics
      volumes:
      - emptyDir: {}
        name: persistent-queue-data
      - name: tls-assets
        secret:
          defaultMode: 420
          secretName: tls-assets-vmagent-vm-stack
      - emptyDir: {}
        name: config-out
      - configMap:
          defaultMode: 420
          name: relabelings-assets-vmagent-vm-stack
        name: relabeling-assets
      - configMap:
          defaultMode: 420
          name: stream-aggr-vmagent-vm-stack
        name: stream-aggr-conf
      - name: config
        secret:
          defaultMode: 420
          secretName: vmagent-vm-stack

Hi @k0nstantinv ! Can you share deployment for the kube-state-metrics? Because I want to make a deployment the same as you have, but I do not have this deployment file.

@k0nstantinv
Copy link
Author

@dmitryk-dk sure

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: dev-eks-eu-central1-monitoring
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/version: 2.6.0
    argocd.argoproj.io/instance: dev-eks-eu-central1-monitoring
    helm.sh/chart: kube-state-metrics-4.18.0
    k8slens-edit-resource-version: v1
    release: dev-eks-eu-central1-monitoring
  name: kube-state-metrics
  namespace: monitoring
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: dev-eks-eu-central1-monitoring
      app.kubernetes.io/name: kube-state-metrics
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: metrics
        app.kubernetes.io/instance: dev-eks-eu-central1-monitoring
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/part-of: kube-state-metrics
        app.kubernetes.io/version: 2.6.0
        helm.sh/chart: kube-state-metrics-4.18.0
        release: dev-eks-eu-central1-monitoring
    spec:
      containers:
      - args:
        - --port=8080
        - --resources=certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,verticalpodautoscalers,ingresses,jobs,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
        image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.6.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 1800m
            memory: 5Gi
          requests:
            cpu: 600m
            memory: 3Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        dedicated-to: infra
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsUser: 65534
      serviceAccount: kube-state-metrics
      serviceAccountName: kube-state-metrics
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: dedicated-to
        operator: Equal
        value: infra

valyala added a commit that referenced this issue Dec 1, 2023
…discovery page

Previously the /service-discovery page didn't show targets dropped because of sharding
( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).

Show also the reason why every target is dropped at /service-discovery page.
This should improve debuging why particular targets are dropped.

While at it, do not remove dropped targets from the list at /service-discovery page
until the total number of targets exceeds the limit passed to -promscrape.maxDroppedTargets .
Previously the list was cleaned up every 10 minutes from the entries, which weren't updated
for the last minute. This could complicate debugging of dropped targets.

Updates #5389
@valyala
Copy link
Collaborator

valyala commented Dec 1, 2023

it's still unclear for me how it is working when you have a job in up state, you have a metrics, you have no scrape errors/drops and you have a target state 0/0 at the same time

Oh, I didn't pay attention that you pass -promscrape.cluster.* command-line flags to vmagent as shown in this comment. It looks like you spread scrape targets among -promscrape.cluster.membersCount=6 vmagent instances. In this case every vmagent instance discovers all the targets, but then drops all the targets except of -promscrape.cluster.replicationFactor / -promscrape.cluster.membersCount'th targets before applying target relabeling. For example, if vmagent discovers 12K targets when -promscrape.cluster.membersCount=6, then only 12K/6=2K targets are left per each vmagent instance before starting the target relabeling phase. Unfortunately vmagent doesn't show targets dropped before the relabeling at http://vmagent:8429/service-discovery page. It is likely the kube-state-metrics target has been dropped before the relabeling at the given vmagent instance, so it isn't visible at /service-discovery page. The temporary workaround is to search for kube-state-metrics target at /service-discovery page on the remaining 5 instances of vmagent, which have -promscrape.cluster.memberNum in the range 1-5.

I think it would be better from debuggability PoV to show all the dropped targets at /service-discovery page (up to the limit provided via -promscrape.maxDroppedTargets command-line flag). This functionality is implemented in the commit 487f638 , which will be included in the next release. You can test this functionality by building vmagent from this commit according to these docs.

@dmitryk-dk
Copy link
Contributor

Hi @k0nstantinv ! I have tested your configurations, it works as expected:

  1. In your configuration you have 5 shards of the vmagent, and one kube-state-metrics endpoint
  2. After targets targets have been discovered each vmagent calculates assigned targets based on -promscrape.cluster.memberNum and -promscrape.cluster.membersCount provided by operator
  3. Each vmgents UI displays only targets assigns to this vmagent

@k0nstantinv
Copy link
Author

@valyala @dmitryk-dk Thanks for the details! it is clear for me now. I’ll try it asap

valyala added a commit that referenced this issue Dec 4, 2023
…discovery page

Previously the /service-discovery page didn't show targets dropped because of sharding
( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).

Show also the reason why every target is dropped at /service-discovery page.
This should improve debuging why particular targets are dropped.

While at it, do not remove dropped targets from the list at /service-discovery page
until the total number of targets exceeds the limit passed to -promscrape.maxDroppedTargets .
Previously the list was cleaned up every 10 minutes from the entries, which weren't updated
for the last minute. This could complicate debugging of dropped targets.

Updates #5389
valyala added a commit that referenced this issue Dec 6, 2023
… instances, which scrape the given dropped target at /service-discovery page

The /service-discovery page contains the list of all the discovered targets
after the commit 487f638 on all the vmagent instances
in cluster mode ( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).

This commit improves debuggability of targets in cluster mode by providing a list of -promscrape.cluster.memberNum
values per each target at /service-discovery page, which has been dropped becasue of sharding,
e.g. if this target is scraped by other vmagent instances in the cluster.

Updates #5389
Updates #4018
valyala added a commit that referenced this issue Dec 6, 2023
… instances, which scrape the given dropped target at /service-discovery page

The /service-discovery page contains the list of all the discovered targets
after the commit 487f638 on all the vmagent instances
in cluster mode ( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).

This commit improves debuggability of targets in cluster mode by providing a list of -promscrape.cluster.memberNum
values per each target at /service-discovery page, which has been dropped becasue of sharding,
e.g. if this target is scraped by other vmagent instances in the cluster.

Updates #5389
Updates #4018
valyala added a commit that referenced this issue Dec 6, 2023
…axDroppedTargets` at `/service-discovery` page

Suggest increasing `-promscrape.maxDroppedTargets` command-line flag value if /service-discovery page
misses some dropped targets.

Updates #5389
Updates #4018
valyala added a commit that referenced this issue Dec 6, 2023
…axDroppedTargets` at `/service-discovery` page

Suggest increasing `-promscrape.maxDroppedTargets` command-line flag value if /service-discovery page
misses some dropped targets.

Updates #5389
Updates #4018
@valyala
Copy link
Collaborator

valyala commented Dec 9, 2023

FYI, the next release of vmagent will also show a list of vmagent instances for every dropped target, which is scraped by other vmagent instances in the cluster at /service-discovery page as explained here.

@k0nstantinv
Copy link
Author

Thank you for your attention! Seems like everything works as described here. Glad to know our report helps to improve something in the project

@valyala
Copy link
Collaborator

valyala commented Dec 13, 2023

vmagent displays all the discovered targets at the http://vmagent:8429/service-discovery page, including the targets, which were dropped because of -promscrape.cluster.* settings, starting from the release v1.96.0.

valyala added a commit that referenced this issue Feb 14, 2024
…els command-line flag is set

This should save some CPU

This regression has been introduced in 487f638
when working on #5389
valyala added a commit that referenced this issue Feb 14, 2024
…els command-line flag is set

This should save some CPU

This regression has been introduced in 487f638
when working on #5389
valyala added a commit that referenced this issue Feb 14, 2024
… to promutils.PutLabels()

This should reduce memory allocations.

This is a follow-up for b09bd6c

Updates #5389
valyala added a commit that referenced this issue Feb 14, 2024
… to promutils.PutLabels()

This should reduce memory allocations.

This is a follow-up for b09bd6c

Updates #5389
@valyala
Copy link
Collaborator

valyala commented Feb 14, 2024

FYI, vmagent versions between v1.96.0 (including) and v1.98.0 (excluding) have a performance regression, which triggers when -promscrape.dropOriginalLabels command-line flag is specified. This performance regression could lead to higher CPU and RAM usage when vmagent discovers big number of targets. The regression has been fixed in v1.98.0.

@valyala
Copy link
Collaborator

valyala commented Feb 14, 2024

FYI, the regression fix has been also included in v1.97.2 LTS release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The question issue vmagent
Projects
None yet
Development

No branches or pull requests

4 participants