Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmselect brakes once api/v1/labels requested #4932

Closed
ptimofee opened this issue Aug 31, 2023 · 12 comments
Closed

vmselect brakes once api/v1/labels requested #4932

ptimofee opened this issue Aug 31, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@ptimofee
Copy link

ptimofee commented Aug 31, 2023

Describe the bug

We have multiple tenant vmcluster.
Something happened with one tenant.

Once we call api/v1/labels to get metric label names for this tenant vmselect becomes unusable.
I. e it returns the response, but any subsequent calls to vmselect are failing for any tenant.
Workaround is to wait for some time untill vmstorage prints out "connection reset by peer" error, see below.
Or just restart vmselect and it starts working fine.

We are suspicious about "bad" labels name (with non-printable characters) inserted into tsdb, because the first successful response from vmselect has many non-printable characters:

{
"status": "success",
"isPartial": false,
"data": [
"\u0006\u0012\u000eeCsma",
"\tsubsys",
"\n\u0007ifIndex\u0012",
"\n65\n\r\u000265\n",
"\u000b777�\u0016\u000b888",
"\u0017\u000c0",
"$\u0018sent\u0010\t",
"2121222",
"2n__fab",
"30\n(\n\u0003jo",
"33333",
"33333333",
"444",
"525262",
"5555",
"5556667",
"66666666�",
"8999",
"9��\u0001\n(\u001csnm",
"From",
"GOARCH",
"GOOS",
"Local",
"Remote",
"Serial",
"To",
"UUID",
"__name__",
"access_mode",
...
}

According to our debug (query tracing, code modification, etc.) when api/v1/labels requests happens vmselect asks for answer all 3 vmstorages. All of the vmstorages answer, but connection to 2 of them "hangs" for some reason, it looks like.
And it seems like these hanged connections to vmstorage are reused by vmselect. That's why subsequent requests to vmselect fail until they are reset (see vmstorage's "connection reset by peer" error).
Generally the initial "good" response may be coming from that one vmstorage which is working fine (two others where connection hangs are "bad"?).
Later we came to the idea, that vmstorage is working on the request more time then vmselect expects, thinking it has read all the response, while data from vmstorage is still coming. That's why we see those random label names in vmselect errors after the last colon. It looks like those are remnants in the buffer (vmselect <-> vmstorage).

Can someone give us a clue where to dig further?

To Reproduce

Can't provide the actual steps to reproduce it since we are not sure what causing it.

Version

vmselect-20230630-163028-tags-v1.91.3-cluster-0-g12f262c331
vmstorage-20230630-163052-tags-v1.91.3-cluster-0-g12f262c331

Logs

vmselect logs are:
2023-08-31T19:42:37.295Z warn VictoriaMetrics/app/vmselect/main.go:487 error in "/select/20/prometheus/api/v1/labels?start=1693500710&end=1693510710": cannot obtain labels: cannot fetch labels from vmstorage nodes: cannot get labels from vmstorage vmstorage-vmcluster-0.vmstorage-vmcluster.sre-monitoring:8401: cannot execute funcName="labelNames_v5" on vmstorage "10.42.1.216:8401": handler
2023-08-31T20:01:59.838Z warn VictoriaMetrics/app/vmselect/main.go:487 error in "/select/20/prometheus/api/v1/labels?start=1693434601&end=1693434901": cannot obtain labels: cannot fetch labels from vmstorage nodes: cannot get labels from vmstorage vmstorage-vmcluster-0.vmstorage-vmcluster.sre-monitoring:8401: cannot execute funcName="labelNames_v5" on vmstorage "10.42.1.216:8401": raper
2023-08-31T20:02:05.978Z warn VictoriaMetrics/app/vmselect/main.go:487 error in "/select/20/prometheus/api/v1/labels?start=1693434601&end=1693434901": cannot obtain labels: cannot fetch labels from vmstorage nodes: cannot get labels from vmstorage vmstorage-vmcluster-0.vmstorage-vmcluster.sre-monitoring:8401: cannot execute funcName="labelNames_v5" on vmstorage "10.42.1.216:8401": dialer_name

vmstorage logs are:
2023-08-31T20:15:05.171Z error VictoriaMetrics/lib/vmselectapi/server.go:216 cannot process vmselect conn 10.42.3.241:47916: cannot process vmselect request: cannot read rpcName: cannot read data size: cannot read data in 3.221 seconds: read tcp4 10.42.1.216:8401->10.42.3.241:47916: read: connection reset by peer
2023-08-31T20:15:16.951Z error VictoriaMetrics/lib/vmselectapi/server.go:216 cannot process vmselect conn 10.42.3.244:36964: cannot process vmselect request: cannot read rpcName: cannot read data size: cannot read data in 44.613 seconds: read tcp4 10.42.1.216:8401->10.42.3.244:36964: read: connection reset by peer

Screenshots

No response

Used command-line flags

No response

Additional information

No response

@ptimofee ptimofee added the bug Something isn't working label Aug 31, 2023
@ptimofee
Copy link
Author

ptimofee commented Aug 31, 2023

So to check our idea we have modified vmselect code (WARNING! It's incorrect, just for debugging) to read response further and vmselect started working fine with this "broken" tenant data set! Of course it broke with the correct data as well.

So seems there is something in the buffer (probably label name) that makes readBytes here https://github.com/VictoriaMetrics/VictoriaMetrics/blob/v1.91.3-cluster/app/vmselect/netstorage/netstorage.go#L2206 to decide there is nothing left to read while actually there is.

@ptimofee
Copy link
Author

ptimofee commented Sep 1, 2023

Ok, we found the problem label name.
Somehow we got empty string as label name inserted into tsdb. This is the root cause.

I have no idea how this could happen. Looking for a way to insert a metric with an empty string as a label name.

@f41gh7
Copy link
Contributor

f41gh7 commented Sep 1, 2023

Thanks for investigation, I'll take a look on it.

@f41gh7
Copy link
Contributor

f41gh7 commented Sep 1, 2023

It's global vmcluster RPC issue. vmstorage uses an empty string "" as delimiter for the end of the request.

If for some reason, an empty string was written during request ( e.g. label name == ""). It leaves data inside connection buffer and it will be read by the next request.

Such behavior breaks current request and any other requests until data will be read from buffer.

FYI @valyala @hagen1778

f41gh7 added a commit that referenced this issue Sep 1, 2023
it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it
#4932
@f41gh7
Copy link
Contributor

f41gh7 commented Sep 1, 2023

Hello, @ptimofee can you try to build vmstorage from commit 1544d67

And test api behavior?

Build instructions https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#building-from-sources

@ptimofee
Copy link
Author

ptimofee commented Sep 1, 2023

@f41gh7 thank a lot, I highly appreciate such fast reply!
That's exactly what we did to read all the data.
I'm sure it works, but let me check it again just in case.

@ptimofee
Copy link
Author

ptimofee commented Sep 1, 2023

It works fine!

@ptimofee
Copy link
Author

ptimofee commented Sep 1, 2023

I have a few notes regarding your fix here:

  • the comment says "Skip empty label names, since they have no sense for prometheus". It's referring to prometheus however the problem is not related to prometheus. It's general problem breaking vmselect - vmstorage communication. I'd say the comment is a bit misleading.
  • the fix is for vmstorage in case the data was inserted somehow. But what about vminsert fix to prevent insertion of such data?

@ptimofee
Copy link
Author

ptimofee commented Sep 1, 2023

I found a way to to reproduce this, i. e. how to insert such data which breaks vmselect - vmstorage communication:

So create a simple web server, like nginx, which returns one static file.
Put this into the file:

echo -ne 'metric_name{label1="val1",\0="val2"} 1' > static_file

Then ask vmagent to scrape this web server and send the data to vminsert via api/v1/write.
So this NULL (\0) as a label name will break vmcluster.

This gives me an idea how we could get this inserted.
Somehow vmagent scraped some binary output. It could be a coincidence when we were reconfiguring its crapes and pointed it to some "bad" endpoint.
Another idea is SNMP. This tenant has a bunch of different PDUs and network devices which are getting added/removed sometimes. And SNMP has been known for bad/broken/weird implementations since the beginning, so this data could come from some device.

I have just verified that usePromCompatibleNaming option set on vminsert mitigates the issue for new inserts. Instead of empty label name you get just an underscore ('_') as expected.

valyala added a commit that referenced this issue Sep 1, 2023
…#4936)

* lib/vmselectapi: do not send empty label names for labelNames request
it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it
#4932

* typo fix

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
valyala added a commit that referenced this issue Sep 1, 2023
…#4936)

* lib/vmselectapi: do not send empty label names for labelNames request
it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it
#4932

* typo fix

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
valyala added a commit that referenced this issue Sep 1, 2023
…#4936)

* lib/vmselectapi: do not send empty label names for labelNames request
it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it
#4932

* typo fix

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
valyala added a commit that referenced this issue Sep 1, 2023
…#4936)

* lib/vmselectapi: do not send empty label names for labelNames request
it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it
#4932

* typo fix

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
valyala added a commit that referenced this issue Sep 1, 2023
…#4936)

* lib/vmselectapi: do not send empty label names for labelNames request
it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it
#4932

* typo fix

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
valyala added a commit that referenced this issue Sep 1, 2023
…#4936)

* lib/vmselectapi: do not send empty label names for labelNames request
it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it
#4932

* typo fix

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
@valyala
Copy link
Collaborator

valyala commented Sep 2, 2023

The issue with broken /api/v1/labels when VictoriaMetrics contains an empty label name, has been fixed in v1.93.3.

the fix is for vmstorage in case the data was inserted somehow. But what about vminsert fix to prevent insertion of such data?

Yes, I think VictoriaMetrics should reject inserting samples with empty label names, since such label names aren't used in practice. Let's track this in a separate issue, while closing this one.

@valyala valyala closed this as completed Sep 2, 2023
@ptimofee
Copy link
Author

ptimofee commented Sep 2, 2023

@valyala thanks a lot! Am I supposed to create a separate issue for reject inserting empty label names? Or somebody has already done it?

@ptimofee
Copy link
Author

ptimofee commented Sep 5, 2023

@valyala @f41gh7 now I know what was the root cause of these weird labels inserted into db.
We've got the same problem once we updated vmagent from v1.91.3 to v1.93.1 (or to v1.93.3, not fully sure) for other tenants.
Now we have random values and random labels set to random metrics.
It looks like all metrics coming through vmagent are now randomly mixed and I have a feeling that remote_write or relabeling has broken between those releases. This BUGFIX looks suspicious.

We have rolled back vmagent and problem disappeared.

I'll create a separate issue for vmagent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants