vmselect brakes once api/v1/labels requested #4932

ptimofee · 2023-08-31T20:40:33Z

Describe the bug

We have multiple tenant vmcluster.
Something happened with one tenant.

Once we call api/v1/labels to get metric label names for this tenant vmselect becomes unusable.
I. e it returns the response, but any subsequent calls to vmselect are failing for any tenant.
Workaround is to wait for some time untill vmstorage prints out "connection reset by peer" error, see below.
Or just restart vmselect and it starts working fine.

We are suspicious about "bad" labels name (with non-printable characters) inserted into tsdb, because the first successful response from vmselect has many non-printable characters:

{
"status": "success",
"isPartial": false,
"data": [
"\u0006\u0012\u000eeCsma",
"\tsubsys",
"\n\u0007ifIndex\u0012",
"\n65\n\r\u000265\n",
"\u000b777�\u0016\u000b888",
"\u0017\u000c0",
"$\u0018sent\u0010\t",
"2121222",
"2n__fab",
"30\n(\n\u0003jo",
"33333",
"33333333",
"444",
"525262",
"5555",
"5556667",
"66666666�",
"8999",
"9��\u0001\n(\u001csnm",
"From",
"GOARCH",
"GOOS",
"Local",
"Remote",
"Serial",
"To",
"UUID",
"__name__",
"access_mode",
...
}

According to our debug (query tracing, code modification, etc.) when api/v1/labels requests happens vmselect asks for answer all 3 vmstorages. All of the vmstorages answer, but connection to 2 of them "hangs" for some reason, it looks like.
And it seems like these hanged connections to vmstorage are reused by vmselect. That's why subsequent requests to vmselect fail until they are reset (see vmstorage's "connection reset by peer" error).
Generally the initial "good" response may be coming from that one vmstorage which is working fine (two others where connection hangs are "bad"?).
Later we came to the idea, that vmstorage is working on the request more time then vmselect expects, thinking it has read all the response, while data from vmstorage is still coming. That's why we see those random label names in vmselect errors after the last colon. It looks like those are remnants in the buffer (vmselect <-> vmstorage).

Can someone give us a clue where to dig further?

To Reproduce

Can't provide the actual steps to reproduce it since we are not sure what causing it.

Version

vmselect-20230630-163028-tags-v1.91.3-cluster-0-g12f262c331
vmstorage-20230630-163052-tags-v1.91.3-cluster-0-g12f262c331

Logs

vmselect logs are:
2023-08-31T19:42:37.295Z warn VictoriaMetrics/app/vmselect/main.go:487 error in "/select/20/prometheus/api/v1/labels?start=1693500710&end=1693510710": cannot obtain labels: cannot fetch labels from vmstorage nodes: cannot get labels from vmstorage vmstorage-vmcluster-0.vmstorage-vmcluster.sre-monitoring:8401: cannot execute funcName="labelNames_v5" on vmstorage "10.42.1.216:8401": handler
2023-08-31T20:01:59.838Z warn VictoriaMetrics/app/vmselect/main.go:487 error in "/select/20/prometheus/api/v1/labels?start=1693434601&end=1693434901": cannot obtain labels: cannot fetch labels from vmstorage nodes: cannot get labels from vmstorage vmstorage-vmcluster-0.vmstorage-vmcluster.sre-monitoring:8401: cannot execute funcName="labelNames_v5" on vmstorage "10.42.1.216:8401": raper
2023-08-31T20:02:05.978Z warn VictoriaMetrics/app/vmselect/main.go:487 error in "/select/20/prometheus/api/v1/labels?start=1693434601&end=1693434901": cannot obtain labels: cannot fetch labels from vmstorage nodes: cannot get labels from vmstorage vmstorage-vmcluster-0.vmstorage-vmcluster.sre-monitoring:8401: cannot execute funcName="labelNames_v5" on vmstorage "10.42.1.216:8401": dialer_name

vmstorage logs are:
2023-08-31T20:15:05.171Z error VictoriaMetrics/lib/vmselectapi/server.go:216 cannot process vmselect conn 10.42.3.241:47916: cannot process vmselect request: cannot read rpcName: cannot read data size: cannot read data in 3.221 seconds: read tcp4 10.42.1.216:8401->10.42.3.241:47916: read: connection reset by peer
2023-08-31T20:15:16.951Z error VictoriaMetrics/lib/vmselectapi/server.go:216 cannot process vmselect conn 10.42.3.244:36964: cannot process vmselect request: cannot read rpcName: cannot read data size: cannot read data in 44.613 seconds: read tcp4 10.42.1.216:8401->10.42.3.244:36964: read: connection reset by peer

Screenshots

No response

Used command-line flags

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

ptimofee · 2023-08-31T22:33:15Z

So to check our idea we have modified vmselect code (WARNING! It's incorrect, just for debugging) to read response further and vmselect started working fine with this "broken" tenant data set! Of course it broke with the correct data as well.

So seems there is something in the buffer (probably label name) that makes readBytes here https://github.com/VictoriaMetrics/VictoriaMetrics/blob/v1.91.3-cluster/app/vmselect/netstorage/netstorage.go#L2206 to decide there is nothing left to read while actually there is.

ptimofee · 2023-09-01T03:05:03Z

Ok, we found the problem label name.
Somehow we got empty string as label name inserted into tsdb. This is the root cause.

I have no idea how this could happen. Looking for a way to insert a metric with an empty string as a label name.

f41gh7 · 2023-09-01T10:51:29Z

Thanks for investigation, I'll take a look on it.

f41gh7 · 2023-09-01T12:38:52Z

It's global vmcluster RPC issue. vmstorage uses an empty string "" as delimiter for the end of the request.

If for some reason, an empty string was written during request ( e.g. label name == ""). It leaves data inside connection buffer and it will be read by the next request.

Such behavior breaks current request and any other requests until data will be read from buffer.

FYI @valyala @hagen1778

it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it #4932

f41gh7 · 2023-09-01T13:30:42Z

Hello, @ptimofee can you try to build vmstorage from commit 1544d67

And test api behavior?

Build instructions https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#building-from-sources

ptimofee · 2023-09-01T14:42:25Z

@f41gh7 thank a lot, I highly appreciate such fast reply!
That's exactly what we did to read all the data.
I'm sure it works, but let me check it again just in case.

ptimofee · 2023-09-01T15:14:05Z

It works fine!

ptimofee · 2023-09-01T15:42:21Z

I have a few notes regarding your fix here:

the comment says "Skip empty label names, since they have no sense for prometheus". It's referring to prometheus however the problem is not related to prometheus. It's general problem breaking vmselect - vmstorage communication. I'd say the comment is a bit misleading.
the fix is for vmstorage in case the data was inserted somehow. But what about vminsert fix to prevent insertion of such data?

ptimofee · 2023-09-01T21:08:37Z

I found a way to to reproduce this, i. e. how to insert such data which breaks vmselect - vmstorage communication:

So create a simple web server, like nginx, which returns one static file.
Put this into the file:

echo -ne 'metric_name{label1="val1",\0="val2"} 1' > static_file

Then ask vmagent to scrape this web server and send the data to vminsert via api/v1/write.
So this NULL (\0) as a label name will break vmcluster.

This gives me an idea how we could get this inserted.
Somehow vmagent scraped some binary output. It could be a coincidence when we were reconfiguring its crapes and pointed it to some "bad" endpoint.
Another idea is SNMP. This tenant has a bunch of different PDUs and network devices which are getting added/removed sometimes. And SNMP has been known for bad/broken/weird implementations since the beginning, so this data could come from some device.

I have just verified that usePromCompatibleNaming option set on vminsert mitigates the issue for new inserts. Instead of empty label name you get just an underscore ('_') as expected.

…#4936) * lib/vmselectapi: do not send empty label names for labelNames request it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it #4932 * typo fix * wip --------- Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>

valyala · 2023-09-02T08:26:05Z

The issue with broken /api/v1/labels when VictoriaMetrics contains an empty label name, has been fixed in v1.93.3.

the fix is for vmstorage in case the data was inserted somehow. But what about vminsert fix to prevent insertion of such data?

Yes, I think VictoriaMetrics should reject inserting samples with empty label names, since such label names aren't used in practice. Let's track this in a separate issue, while closing this one.

ptimofee · 2023-09-02T10:43:33Z

@valyala thanks a lot! Am I supposed to create a separate issue for reject inserting empty label names? Or somebody has already done it?

ptimofee · 2023-09-05T20:11:10Z

@valyala @f41gh7 now I know what was the root cause of these weird labels inserted into db.
We've got the same problem once we updated vmagent from v1.91.3 to v1.93.1 (or to v1.93.3, not fully sure) for other tenants.
Now we have random values and random labels set to random metrics.
It looks like all metrics coming through vmagent are now randomly mixed and I have a feeling that remote_write or relabeling has broken between those releases. This BUGFIX looks suspicious.

We have rolled back vmagent and problem disappeared.

I'll create a separate issue for vmagent.

ptimofee added the bug Something isn't working label Aug 31, 2023

f41gh7 added a commit that referenced this issue Sep 1, 2023

lib/vmselectapi: do not send empty label names for labelNames request

1544d67

it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it #4932

f41gh7 mentioned this issue Sep 1, 2023

lib/vmselectapi: do not send empty label names for labelNames request #4936

Merged

valyala added a commit that referenced this issue Sep 1, 2023

Merge branch 'public-cluster' into gh-4932

f37b23c

valyala closed this as completed Sep 2, 2023

This was referenced Sep 5, 2023

vmagent mixes metric names, labels and values #4961

Closed

Don't let vminsert to ingest metrics with empty label name #4962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmselect brakes once api/v1/labels requested #4932

vmselect brakes once api/v1/labels requested #4932

ptimofee commented Aug 31, 2023 •

edited

ptimofee commented Aug 31, 2023 •

edited

ptimofee commented Sep 1, 2023 •

edited

f41gh7 commented Sep 1, 2023

f41gh7 commented Sep 1, 2023

f41gh7 commented Sep 1, 2023

ptimofee commented Sep 1, 2023

ptimofee commented Sep 1, 2023

ptimofee commented Sep 1, 2023

ptimofee commented Sep 1, 2023 •

edited

valyala commented Sep 2, 2023

ptimofee commented Sep 2, 2023

ptimofee commented Sep 5, 2023 •

edited

vmselect brakes once api/v1/labels requested #4932

vmselect brakes once api/v1/labels requested #4932

Comments

ptimofee commented Aug 31, 2023 • edited

Describe the bug

To Reproduce

Version

Logs

Screenshots

Used command-line flags

Additional information

ptimofee commented Aug 31, 2023 • edited

ptimofee commented Sep 1, 2023 • edited

f41gh7 commented Sep 1, 2023

f41gh7 commented Sep 1, 2023

f41gh7 commented Sep 1, 2023

ptimofee commented Sep 1, 2023

ptimofee commented Sep 1, 2023

ptimofee commented Sep 1, 2023

ptimofee commented Sep 1, 2023 • edited

valyala commented Sep 2, 2023

ptimofee commented Sep 2, 2023

ptimofee commented Sep 5, 2023 • edited

ptimofee commented Aug 31, 2023 •

edited

ptimofee commented Aug 31, 2023 •

edited

ptimofee commented Sep 1, 2023 •

edited

ptimofee commented Sep 1, 2023 •

edited

ptimofee commented Sep 5, 2023 •

edited