Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics become very large over time on Kubernetes platform #157

Closed
iqbalaydrus opened this issue Sep 16, 2022 · 6 comments
Closed

Metrics become very large over time on Kubernetes platform #157

iqbalaydrus opened this issue Sep 16, 2022 · 6 comments

Comments

@iqbalaydrus
Copy link

Installed on kubernetes via the helm chart provided on this repo.

I see the metrics endpoint also has hostname label. The thing with kubernetes is, the hostname has some randomly generated suffix if you're use Deployment resource. So for every restart/update to the pods, it will generate new hostname.

I'm also using kubernetes' CronJob to call celery tasks. This also generates new hostname every time a job is called.

And now the grafana dashboard load time worsens as time goes by. Do you know any approach I can take to tackle this issue?

@danihodovic
Copy link
Owner

cc @adinhodovic

@adinhodovic
Copy link
Collaborator

adinhodovic commented Sep 16, 2022

Installed on kubernetes via the helm chart provided on this repo.

I see the metrics endpoint also has hostname label. The thing with kubernetes is, the hostname has some randomly generated suffix if you're use Deployment resource. So for every restart/update to the pods, it will generate new hostname.

I'm also using kubernetes' CronJob to call celery tasks. This also generates new hostname every time a job is called.

And now the grafana dashboard load time worsens as time goes by. Do you know any approach I can take to tackle this issue?

Hi, you can always provide custom relabeling configs. Here's an example and quick hotfix to drop labels with hostname:

serviceMonitor:
  enabled: true
  relabelings:
     - action: "labeldrop"
       regex: "hostname"

We could maybe provide some regex to rename the hostname label that removes the pod random generated suffix.

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

@kittywaresz
Copy link
Contributor

kittywaresz commented Sep 21, 2022

Hi! @adinhodovic

I am not sure should I create a separate issue, but I would like to discuss your reply

As you mentioned there is a workaround to reduce cardinality using relabelings, but what happens with metrics like celery_worker_up in this case if we omit hostname label? I tried to reproduce this behaviour using metric_relabels_config on local environment and as I can see we will fetch 1 or 0 as value of celery_worker_up, since this metric non unique anymore. Here is a config I used:

scrape_configs:
  - job_name: celery-worker_labeldrop
    static_configs:
      - targets:
          - my-cool-exporter:9808
    metric_relabel_configs:
      - action: labeldrop
        regex: hostname

How can I see, one of the possible way to handle this issue - implement some logic to clear outdated metrics stored in exporter memory (in my opinion, when worker become offline all of his metrics become outdated and useless to be scrapped), i.e. when worker goes offline, we need to remove all metrics with hostname label equal to worker hostname

Does this solution make any sense from your perspective?

@adinhodovic
Copy link
Collaborator

adinhodovic commented Sep 21, 2022

Hi! @adinhodovic

I am not sure should I create a separate issue, but I would like to discuss your reply

As you mentioned there is a workaround to reduce cardinality using relabelings, but what happens with metrics like celery_worker_up in this case if we omit hostname label? I tried to reproduce this behaviour using metric_relabels_config on local environment and as I can see we will fetch 1 or 0 as value of celery_worker_up, since this metric non unique anymore. Here is a config I used:

scrape_configs:
  - job_name: celery-worker_labeldrop
    static_configs:
      - targets:
          - my-cool-exporter:9808
    metric_relabel_configs:
      - action: labeldrop
        regex: hostname

How can I see, one of the possible way to handle this issue - implement some logic to clear outdated metrics stored in exporter memory (in my opinion, when worker become offline all of his metrics become outdated and useless to be scrapped), i.e. when worker goes offline, we need to remove all metrics with hostname label equal to worker hostname

Is this solution make any sense from your perspective?

Yep, metric relabelings was just a quick hotfix with downsides - I think other metrics get squashed as well. I think in general we'd lean towards the solution you mentioned. Maybe we could introduce a flag as Flower has: FLOWER_PURGE_OFFLINE_WORKERS and remove metrics related to workers after a certain period of time. But what happens if it goes online again? Tricky situation, but maybe it does not matter all that much.

I guess the best temporary workaround is to create statefulset that usually have fixed host names. (celery-worker-0, celery-worker-1, celery-worker-2)

https://flower.readthedocs.io/en/latest/config.html#purge-offline-workers

@danihodovic
Copy link
Owner

The thing with kubernetes is, the hostname has some randomly generated suffix if you're use Deployment resource. So for every restart/update to the pods, it will generate new hostname.

Use statefulsets that recycle the hostname. celery-worker-0, celery-worker-1, celery-worker-2 and so on.

I'm also using kubernetes' CronJob to call celery tasks. This also generates new hostname every time a job is called.

Why would a new hostname be generated for a worker if a Celery task is called 🤔 ?

@adinhodovic
Copy link
Collaborator

@kittywaresz @iqbalaydrus The newest release will prune metrics for a worker that goes offline after 10 minutes by default (adjustable). Should result in way less active time series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants