VMAgent dropping target on high load #582

AzSiAz · 2020-06-23T10:01:25Z

Describe the bug
When there is a large number of target to scrape from, VMAgent have error on scraping, seem to happen only on large number of node/pods (cf last screenshot) before dropping all scraping target

Expected behavior
Should scrape target, like prometheus

Screenshots

Sometimes VMAgent also remove all scraping target:

Last 1h (number of pods)

Last 12h (number of pods)

Version
The one from your helm chart VMAgent: appVersion: v1.37.2

Used command-line flags
The one from your VMAgent helm chart and 3 customs:

remoteWrite.maxBlockSize: "1000000"
remoteWrite.basicAuth.username
remoteWrite.basicAuth.password

Additional context
K8s is Azure AKS

The text was updated successfully, but these errors were encountered:

valyala · 2020-06-23T11:06:27Z

The first screenshot with error logs show that K8S API server had some issues. It couldn't dial certain K8S nodes via /api/v1/nodes/*/proxy/metrics/cadvisor with the error no route to host. This error means that the given K8S nodes were unreachable from K8S API server at this time.

vmagent couldn't dial certain targets at port 3101 during the same time with the error dialing to the given TCP address timed out.

These errors suggest that there were networking issues in K8S during this time frame.

Sometimes VMAgent also remove all scraping target

This looks like a bug in vmagent. It should leave the previous targets if it cannot obtain new target list due to errors listed on the first screenshot. Could you provide log messages emitted before the total targets: 0 message?

AzSiAz · 2020-06-23T11:14:28Z

The first screenshot with error logs show that K8S API server had some issues. It couldn't dial certain K8S nodes via /api/v1/nodes/*/proxy/metrics/cadvisor with the error no route to host. This error means that the given K8S nodes were unreachable from K8S API server at this time.

vmagent couldn't dial certain targets at port 3101 during the same time with the error dialing to the given TCP address timed out.

These errors suggest that there were networking issues in K8S during this time frame.

That's why it's strange, I don't recall having those error with prometheus, and no gap in graph either

This looks like a bug in vmagent. It should leave the previous targets if it cannot obtain new target list due to errors listed on
the first screenshot. Could you provide log messages emitted before the total targets: 0 message?

There is more log like the first screenshot, but I also found this, should be linked to target drop:

There was also no problem with discovery on prometheus, or at least no log of it

valyala · 2020-06-23T11:32:08Z

That's why it's strange, I don't recall having those error with prometheus, and no gap in graph either

Prometheus doesn't log scrape errors. The last error per each target can be seen at /targets page in both Prometheus and in vmagent. It is possible to suppress logging for scrape errors by passing -promscrape.suppressScrapeErrors command-line flag to vmagent. See https://victoriametrics.github.io/vmagent.html#troubleshooting for details.

As for gaps, they may be related to the bug with targets' drop in vmagent. This leads to gaps on graphs.

There is more log like the first screenshot, but I also found this, should be linked to target drop

Thanks for these screenshots! They show the real cause of the issue with dropped targets - when vmagent couldn't query K8S API server for updates, it was logging error when discovering kuberenets targets error and then dropping all the scrape targets. This will be fixed soon.

AzSiAz · 2020-06-23T11:45:39Z

That's why it's strange, I don't recall having those error with prometheus, and no gap in graph either

Prometheus doesn't log scrape errors. The last error per each target can be seen at /targets page in both Prometheus and in vmagent. It is possible to suppress logging for scrape errors by passing -promscrape.suppressScrapeErrors command-line flag to vmagent. See https://victoriametrics.github.io/vmagent.html#troubleshooting for details.

As for gaps, they may be related to the bug with targets' drop in vmagent. This leads to gaps on graphs.

Oh, that's good to know for prometheus, thanks
I can say for certain since there is a lot of node and pod scrapped but last I checked I don't think there was scrape error in this webpage either

There is more log like the first screenshot, but I also found this, should be linked to target drop

Thanks for these screenshots! They show the real cause of the issue with dropped targets - when vmagent couldn't query K8S API server for updates, it was logging error when discovering kuberenets targets error and then dropping all the scrape targets. This will be fixed soon.

Happy to help :)
Thanks for your work, waiting to test this fix then :)

…ry errors per each `job_name` Updates #582

valyala · 2020-06-23T12:48:44Z

@AzSiAz , the fix is available in the commit 8f0bcec . Could you build vmagent from this commit according to these instructions and verify whether it stops dropping targets on discovery errors when K8S API server is temporarily unavailable?

I can say for certain since there is a lot of node and pod scrapped but last I checked I don't think there was scrape error in this webpage either

Both Prometheus and vmagent record up metric per each scrape target. The values for this metric equals to 1 on successful scrape and equals to 0 on scrape error. So it is easy to determine failing targets with the following query: avg_over_time(up[5m]) < 1 . This query returns non-empty data points for targets, which were temporarily unavailable during the last 5 minutes since each data point.

AzSiAz · 2020-06-23T12:54:23Z

@AzSiAz , the fix is available in the commit 8f0bcec . Could you build vmagent from this commit according to these instructions and verify whether it stops dropping targets on discovery errors when K8S API server is temporarily unavailable?

Thanks I will try with this commit and come back, hopefully with good news

Both Prometheus and vmagent record up metric per each scrape target. The values for this metric equals to 1 on successful scrape and equals to 0 on scrape error. So it is easy to determine failing targets with the following query: avg_over_time(up[5m]) < 1 . This query returns non-empty data points for targets, which were temporarily unavailable during the last 5 minutes since each data point.

Well, I did not think of that one, I will use it to check uptime with new version

AzSiAz · 2020-06-23T13:32:28Z

There is still a lot of scraping error, but it's not dropping target anymore with your latest fix, thanks :)

…ry errors per each `job_name` Updates #582

AzSiAz · 2020-06-25T11:35:05Z

Well, after regularly forcing scaling for 2 days, I am happy to say this problem did not happend again, so all is now good, issue can be closed on next VMAgent version 😄

valyala · 2020-06-25T20:54:14Z

@AzSiAz , thanks for the update!

valyala · 2020-06-25T22:11:41Z

The bugfix has been included in v1.37.3. Closing the issue as fixed.

valyala added the bug Something isn't working label Jun 23, 2020

valyala added a commit that referenced this issue Jun 23, 2020

lib/promscrape: preserve the previously discovered targets on discove…

8f0bcec

…ry errors per each `job_name` Updates #582

valyala added a commit that referenced this issue Jun 23, 2020

lib/promscrape: preserve the previously discovered targets on discove…

de7e585

…ry errors per each `job_name` Updates #582

valyala closed this as completed Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VMAgent dropping target on high load #582

VMAgent dropping target on high load #582

AzSiAz commented Jun 23, 2020 •

edited

valyala commented Jun 23, 2020 •

edited

AzSiAz commented Jun 23, 2020 •

edited

valyala commented Jun 23, 2020

AzSiAz commented Jun 23, 2020

valyala commented Jun 23, 2020

AzSiAz commented Jun 23, 2020

AzSiAz commented Jun 23, 2020

AzSiAz commented Jun 25, 2020 •

edited

valyala commented Jun 25, 2020

valyala commented Jun 25, 2020

VMAgent dropping target on high load #582

VMAgent dropping target on high load #582

Comments

AzSiAz commented Jun 23, 2020 • edited

valyala commented Jun 23, 2020 • edited

AzSiAz commented Jun 23, 2020 • edited

valyala commented Jun 23, 2020

AzSiAz commented Jun 23, 2020

valyala commented Jun 23, 2020

AzSiAz commented Jun 23, 2020

AzSiAz commented Jun 23, 2020

AzSiAz commented Jun 25, 2020 • edited

valyala commented Jun 25, 2020

valyala commented Jun 25, 2020

AzSiAz commented Jun 23, 2020 •

edited

valyala commented Jun 23, 2020 •

edited

AzSiAz commented Jun 23, 2020 •

edited

AzSiAz commented Jun 25, 2020 •

edited