Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMAgent dropping target on high load #582

Closed
AzSiAz opened this issue Jun 23, 2020 · 10 comments
Closed

VMAgent dropping target on high load #582

AzSiAz opened this issue Jun 23, 2020 · 10 comments
Labels
bug Something isn't working

Comments

@AzSiAz
Copy link

AzSiAz commented Jun 23, 2020

Describe the bug
When there is a large number of target to scrape from, VMAgent have error on scraping, seem to happen only on large number of node/pods (cf last screenshot) before dropping all scraping target

Expected behavior
Should scrape target, like prometheus

Screenshots
image
Sometimes VMAgent also remove all scraping target:
image

Last 1h (number of pods)
image
Last 12h (number of pods)
image

Version
The one from your helm chart VMAgent: appVersion: v1.37.2

Used command-line flags
The one from your VMAgent helm chart and 3 customs:

  • remoteWrite.maxBlockSize: "1000000"
  • remoteWrite.basicAuth.username
  • remoteWrite.basicAuth.password

Additional context
K8s is Azure AKS

@valyala
Copy link
Collaborator

valyala commented Jun 23, 2020

The first screenshot with error logs show that K8S API server had some issues. It couldn't dial certain K8S nodes via /api/v1/nodes/*/proxy/metrics/cadvisor with the error no route to host. This error means that the given K8S nodes were unreachable from K8S API server at this time.

vmagent couldn't dial certain targets at port 3101 during the same time with the error dialing to the given TCP address timed out.

These errors suggest that there were networking issues in K8S during this time frame.

Sometimes VMAgent also remove all scraping target

This looks like a bug in vmagent. It should leave the previous targets if it cannot obtain new target list due to errors listed on the first screenshot. Could you provide log messages emitted before the total targets: 0 message?

@valyala valyala added the bug Something isn't working label Jun 23, 2020
@AzSiAz
Copy link
Author

AzSiAz commented Jun 23, 2020

The first screenshot with error logs show that K8S API server had some issues. It couldn't dial certain K8S nodes via /api/v1/nodes/*/proxy/metrics/cadvisor with the error no route to host. This error means that the given K8S nodes were unreachable from K8S API server at this time.

vmagent couldn't dial certain targets at port 3101 during the same time with the error dialing to the given TCP address timed out.

These errors suggest that there were networking issues in K8S during this time frame.

That's why it's strange, I don't recall having those error with prometheus, and no gap in graph either

This looks like a bug in vmagent. It should leave the previous targets if it cannot obtain new target list due to errors listed on
the first screenshot. Could you provide log messages emitted before the total targets: 0 message?

There is more log like the first screenshot, but I also found this, should be linked to target drop:
image
There was also no problem with discovery on prometheus, or at least no log of it

@valyala
Copy link
Collaborator

valyala commented Jun 23, 2020

That's why it's strange, I don't recall having those error with prometheus, and no gap in graph either

Prometheus doesn't log scrape errors. The last error per each target can be seen at /targets page in both Prometheus and in vmagent. It is possible to suppress logging for scrape errors by passing -promscrape.suppressScrapeErrors command-line flag to vmagent. See https://victoriametrics.github.io/vmagent.html#troubleshooting for details.

As for gaps, they may be related to the bug with targets' drop in vmagent. This leads to gaps on graphs.

There is more log like the first screenshot, but I also found this, should be linked to target drop

Thanks for these screenshots! They show the real cause of the issue with dropped targets - when vmagent couldn't query K8S API server for updates, it was logging error when discovering kuberenets targets error and then dropping all the scrape targets. This will be fixed soon.

@AzSiAz
Copy link
Author

AzSiAz commented Jun 23, 2020

That's why it's strange, I don't recall having those error with prometheus, and no gap in graph either

Prometheus doesn't log scrape errors. The last error per each target can be seen at /targets page in both Prometheus and in vmagent. It is possible to suppress logging for scrape errors by passing -promscrape.suppressScrapeErrors command-line flag to vmagent. See https://victoriametrics.github.io/vmagent.html#troubleshooting for details.

As for gaps, they may be related to the bug with targets' drop in vmagent. This leads to gaps on graphs.

Oh, that's good to know for prometheus, thanks
I can say for certain since there is a lot of node and pod scrapped but last I checked I don't think there was scrape error in this webpage either

There is more log like the first screenshot, but I also found this, should be linked to target drop

Thanks for these screenshots! They show the real cause of the issue with dropped targets - when vmagent couldn't query K8S API server for updates, it was logging error when discovering kuberenets targets error and then dropping all the scrape targets. This will be fixed soon.

Happy to help :)
Thanks for your work, waiting to test this fix then :)

valyala added a commit that referenced this issue Jun 23, 2020
@valyala
Copy link
Collaborator

valyala commented Jun 23, 2020

@AzSiAz , the fix is available in the commit 8f0bcec . Could you build vmagent from this commit according to these instructions and verify whether it stops dropping targets on discovery errors when K8S API server is temporarily unavailable?

I can say for certain since there is a lot of node and pod scrapped but last I checked I don't think there was scrape error in this webpage either

Both Prometheus and vmagent record up metric per each scrape target. The values for this metric equals to 1 on successful scrape and equals to 0 on scrape error. So it is easy to determine failing targets with the following query: avg_over_time(up[5m]) < 1 . This query returns non-empty data points for targets, which were temporarily unavailable during the last 5 minutes since each data point.

@AzSiAz
Copy link
Author

AzSiAz commented Jun 23, 2020

@AzSiAz , the fix is available in the commit 8f0bcec . Could you build vmagent from this commit according to these instructions and verify whether it stops dropping targets on discovery errors when K8S API server is temporarily unavailable?

Thanks I will try with this commit and come back, hopefully with good news

Both Prometheus and vmagent record up metric per each scrape target. The values for this metric equals to 1 on successful scrape and equals to 0 on scrape error. So it is easy to determine failing targets with the following query: avg_over_time(up[5m]) < 1 . This query returns non-empty data points for targets, which were temporarily unavailable during the last 5 minutes since each data point.

Well, I did not think of that one, I will use it to check uptime with new version

@AzSiAz
Copy link
Author

AzSiAz commented Jun 23, 2020

There is still a lot of scraping error, but it's not dropping target anymore with your latest fix, thanks :)

image

valyala added a commit that referenced this issue Jun 23, 2020
@AzSiAz
Copy link
Author

AzSiAz commented Jun 25, 2020

Well, after regularly forcing scaling for 2 days, I am happy to say this problem did not happend again, so all is now good, issue can be closed on next VMAgent version 😄

@valyala
Copy link
Collaborator

valyala commented Jun 25, 2020

@AzSiAz , thanks for the update!

@valyala
Copy link
Collaborator

valyala commented Jun 25, 2020

The bugfix has been included in v1.37.3. Closing the issue as fixed.

@valyala valyala closed this as completed Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants