Metrics become very large over time on Kubernetes platform #157

iqbalaydrus · 2022-09-16T02:10:12Z

Installed on kubernetes via the helm chart provided on this repo.

I see the metrics endpoint also has hostname label. The thing with kubernetes is, the hostname has some randomly generated suffix if you're use Deployment resource. So for every restart/update to the pods, it will generate new hostname.

I'm also using kubernetes' CronJob to call celery tasks. This also generates new hostname every time a job is called.

And now the grafana dashboard load time worsens as time goes by. Do you know any approach I can take to tackle this issue?

danihodovic · 2022-09-16T11:11:18Z

cc @adinhodovic

adinhodovic · 2022-09-16T12:36:40Z

Installed on kubernetes via the helm chart provided on this repo.

I see the metrics endpoint also has hostname label. The thing with kubernetes is, the hostname has some randomly generated suffix if you're use Deployment resource. So for every restart/update to the pods, it will generate new hostname.

I'm also using kubernetes' CronJob to call celery tasks. This also generates new hostname every time a job is called.

And now the grafana dashboard load time worsens as time goes by. Do you know any approach I can take to tackle this issue?

Hi, you can always provide custom relabeling configs. Here's an example and quick hotfix to drop labels with hostname:

serviceMonitor:
  enabled: true
  relabelings:
     - action: "labeldrop"
       regex: "hostname"

We could maybe provide some regex to rename the hostname label that removes the pod random generated suffix.

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

kittywaresz · 2022-09-21T13:57:18Z

Hi! @adinhodovic

I am not sure should I create a separate issue, but I would like to discuss your reply

As you mentioned there is a workaround to reduce cardinality using relabelings, but what happens with metrics like celery_worker_up in this case if we omit hostname label? I tried to reproduce this behaviour using metric_relabels_config on local environment and as I can see we will fetch 1 or 0 as value of celery_worker_up, since this metric non unique anymore. Here is a config I used:

scrape_configs:
  - job_name: celery-worker_labeldrop
    static_configs:
      - targets:
          - my-cool-exporter:9808
    metric_relabel_configs:
      - action: labeldrop
        regex: hostname

How can I see, one of the possible way to handle this issue - implement some logic to clear outdated metrics stored in exporter memory (in my opinion, when worker become offline all of his metrics become outdated and useless to be scrapped), i.e. when worker goes offline, we need to remove all metrics with hostname label equal to worker hostname

Does this solution make any sense from your perspective?

adinhodovic · 2022-09-21T19:31:47Z

Hi! @adinhodovic

I am not sure should I create a separate issue, but I would like to discuss your reply

As you mentioned there is a workaround to reduce cardinality using relabelings, but what happens with metrics like celery_worker_up in this case if we omit hostname label? I tried to reproduce this behaviour using metric_relabels_config on local environment and as I can see we will fetch 1 or 0 as value of celery_worker_up, since this metric non unique anymore. Here is a config I used:
scrape_configs:
  - job_name: celery-worker_labeldrop
    static_configs:
      - targets:
          - my-cool-exporter:9808
    metric_relabel_configs:
      - action: labeldrop
        regex: hostname
How can I see, one of the possible way to handle this issue - implement some logic to clear outdated metrics stored in exporter memory (in my opinion, when worker become offline all of his metrics become outdated and useless to be scrapped), i.e. when worker goes offline, we need to remove all metrics with hostname label equal to worker hostname

Is this solution make any sense from your perspective?

Yep, metric relabelings was just a quick hotfix with downsides - I think other metrics get squashed as well. I think in general we'd lean towards the solution you mentioned. Maybe we could introduce a flag as Flower has: FLOWER_PURGE_OFFLINE_WORKERS and remove metrics related to workers after a certain period of time. But what happens if it goes online again? Tricky situation, but maybe it does not matter all that much.

I guess the best temporary workaround is to create statefulset that usually have fixed host names. (celery-worker-0, celery-worker-1, celery-worker-2)

https://flower.readthedocs.io/en/latest/config.html#purge-offline-workers

danihodovic · 2022-10-09T21:56:49Z

The thing with kubernetes is, the hostname has some randomly generated suffix if you're use Deployment resource. So for every restart/update to the pods, it will generate new hostname.

Use statefulsets that recycle the hostname. celery-worker-0, celery-worker-1, celery-worker-2 and so on.

I'm also using kubernetes' CronJob to call celery tasks. This also generates new hostname every time a job is called.

Why would a new hostname be generated for a worker if a Celery task is called 🤔 ?

adinhodovic · 2023-06-01T13:02:10Z

@kittywaresz @iqbalaydrus The newest release will prune metrics for a worker that goes offline after 10 minutes by default (adjustable). Should result in way less active time series.

humbertowoody mentioned this issue Jun 2, 2023

Memory usage grows non-stop #253

Closed

adinhodovic closed this as completed Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics become very large over time on Kubernetes platform #157

Metrics become very large over time on Kubernetes platform #157

iqbalaydrus commented Sep 16, 2022

danihodovic commented Sep 16, 2022

adinhodovic commented Sep 16, 2022 •

edited

kittywaresz commented Sep 21, 2022 •

edited

adinhodovic commented Sep 21, 2022 •

edited

danihodovic commented Oct 9, 2022

adinhodovic commented Jun 1, 2023

Metrics become very large over time on Kubernetes platform #157

Metrics become very large over time on Kubernetes platform #157

Comments

iqbalaydrus commented Sep 16, 2022

danihodovic commented Sep 16, 2022

adinhodovic commented Sep 16, 2022 • edited

kittywaresz commented Sep 21, 2022 • edited

adinhodovic commented Sep 21, 2022 • edited

danihodovic commented Oct 9, 2022

adinhodovic commented Jun 1, 2023

adinhodovic commented Sep 16, 2022 •

edited

kittywaresz commented Sep 21, 2022 •

edited

adinhodovic commented Sep 21, 2022 •

edited