New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vmselect brakes once api/v1/labels requested #4932
Comments
So to check our idea we have modified vmselect code (WARNING! It's incorrect, just for debugging) to read response further and vmselect started working fine with this "broken" tenant data set! Of course it broke with the correct data as well. So seems there is something in the buffer (probably label name) that makes readBytes here https://github.com/VictoriaMetrics/VictoriaMetrics/blob/v1.91.3-cluster/app/vmselect/netstorage/netstorage.go#L2206 to decide there is nothing left to read while actually there is. |
Ok, we found the problem label name. I have no idea how this could happen. Looking for a way to insert a metric with an empty string as a label name. |
Thanks for investigation, I'll take a look on it. |
It's global vmcluster RPC issue. vmstorage uses an empty string "" as delimiter for the end of the request. If for some reason, an empty string was written during request ( e.g. label name == ""). It leaves data inside connection buffer and it will be read by the next request. Such behavior breaks current request and any other requests until data will be read from buffer. FYI @valyala @hagen1778 |
it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it #4932
Hello, @ptimofee can you try to build vmstorage from commit 1544d67 And test api behavior? Build instructions https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#building-from-sources |
@f41gh7 thank a lot, I highly appreciate such fast reply! |
It works fine! |
I have a few notes regarding your fix here:
|
I found a way to to reproduce this, i. e. how to insert such data which breaks vmselect - vmstorage communication: So create a simple web server, like nginx, which returns one static file.
Then ask vmagent to scrape this web server and send the data to vminsert via api/v1/write. This gives me an idea how we could get this inserted. I have just verified that |
The issue with broken
Yes, I think VictoriaMetrics should reject inserting samples with empty label names, since such label names aren't used in practice. Let's track this in a separate issue, while closing this one. |
@valyala thanks a lot! Am I supposed to create a separate issue for reject inserting empty label names? Or somebody has already done it? |
@valyala @f41gh7 now I know what was the root cause of these weird labels inserted into db. We have rolled back vmagent and problem disappeared. I'll create a separate issue for vmagent. |
Describe the bug
We have multiple tenant vmcluster.
Something happened with one tenant.
Once we call api/v1/labels to get metric label names for this tenant vmselect becomes unusable.
I. e it returns the response, but any subsequent calls to vmselect are failing for any tenant.
Workaround is to wait for some time untill vmstorage prints out "connection reset by peer" error, see below.
Or just restart vmselect and it starts working fine.
We are suspicious about "bad" labels name (with non-printable characters) inserted into tsdb, because the first successful response from vmselect has many non-printable characters:
According to our debug (query tracing, code modification, etc.) when api/v1/labels requests happens vmselect asks for answer all 3 vmstorages. All of the vmstorages answer, but connection to 2 of them "hangs" for some reason, it looks like.
And it seems like these hanged connections to vmstorage are reused by vmselect. That's why subsequent requests to vmselect fail until they are reset (see vmstorage's "connection reset by peer" error).
Generally the initial "good" response may be coming from that one vmstorage which is working fine (two others where connection hangs are "bad"?).
Later we came to the idea, that vmstorage is working on the request more time then vmselect expects, thinking it has read all the response, while data from vmstorage is still coming. That's why we see those random label names in vmselect errors after the last colon. It looks like those are remnants in the buffer (vmselect <-> vmstorage).
Can someone give us a clue where to dig further?
To Reproduce
Can't provide the actual steps to reproduce it since we are not sure what causing it.
Version
vmselect-20230630-163028-tags-v1.91.3-cluster-0-g12f262c331
vmstorage-20230630-163052-tags-v1.91.3-cluster-0-g12f262c331
Logs
vmselect logs are:
2023-08-31T19:42:37.295Z warn VictoriaMetrics/app/vmselect/main.go:487 error in "/select/20/prometheus/api/v1/labels?start=1693500710&end=1693510710": cannot obtain labels: cannot fetch labels from vmstorage nodes: cannot get labels from vmstorage vmstorage-vmcluster-0.vmstorage-vmcluster.sre-monitoring:8401: cannot execute funcName="labelNames_v5" on vmstorage "10.42.1.216:8401": handler
2023-08-31T20:01:59.838Z warn VictoriaMetrics/app/vmselect/main.go:487 error in "/select/20/prometheus/api/v1/labels?start=1693434601&end=1693434901": cannot obtain labels: cannot fetch labels from vmstorage nodes: cannot get labels from vmstorage vmstorage-vmcluster-0.vmstorage-vmcluster.sre-monitoring:8401: cannot execute funcName="labelNames_v5" on vmstorage "10.42.1.216:8401": raper
2023-08-31T20:02:05.978Z warn VictoriaMetrics/app/vmselect/main.go:487 error in "/select/20/prometheus/api/v1/labels?start=1693434601&end=1693434901": cannot obtain labels: cannot fetch labels from vmstorage nodes: cannot get labels from vmstorage vmstorage-vmcluster-0.vmstorage-vmcluster.sre-monitoring:8401: cannot execute funcName="labelNames_v5" on vmstorage "10.42.1.216:8401": dialer_name
vmstorage logs are:
2023-08-31T20:15:05.171Z error VictoriaMetrics/lib/vmselectapi/server.go:216 cannot process vmselect conn 10.42.3.241:47916: cannot process vmselect request: cannot read rpcName: cannot read data size: cannot read data in 3.221 seconds: read tcp4 10.42.1.216:8401->10.42.3.241:47916: read: connection reset by peer
2023-08-31T20:15:16.951Z error VictoriaMetrics/lib/vmselectapi/server.go:216 cannot process vmselect conn 10.42.3.244:36964: cannot process vmselect request: cannot read rpcName: cannot read data size: cannot read data in 44.613 seconds: read tcp4 10.42.1.216:8401->10.42.3.244:36964: read: connection reset by peer
Screenshots
No response
Used command-line flags
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: