[vmselect] Add a capability to set `search.denyPartialResponse` per storage groups #5883

Sinketsu · 2024-02-27T17:05:21Z

Is your feature request related to a problem? Please describe

Hello!
We have several vmstorage clusters in different availability zones (AZ). We also have one vmselect above them.
We would like to do the following - when one vmstorage node in one AZ is unavailable, ignore responses from that AZ (because we do not use replication).

Now there is a solution to our problem through raising additional vmselect in each AZ and one above them (multi-level vmselect). However, it doesn't give us any advantages beyond that, but it consumes a lot of resources (for additional instances).

Describe the solution you'd like

We saw that vmselect already knows how to divide hosts into groups, but it does this only for replication. It would be great to support setting -search.denyPartialResponse per group.

Thanks!

Describe alternatives you've considered

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

Haleygo · 2024-02-28T07:27:24Z

Hello and thanks you the issue!

We have several vmstorage clusters in different availability zones (AZ).

You mean you have the exact same data in each vmstorage cluster, right? Are you planning to achieve something like this?

Each storage pool contains complete data and vmselct returns result from the fastest storage pool with complete data. vmselect drops result from storage pool when -search.denyPartialResponse=ture and more than replicationFactor-1 vmstorage nodes in that pool are unavailable during querying.

Sinketsu · 2024-02-28T12:56:33Z

You mean you have the exact same data in each vmstorage cluster, right? Are you planning to achieve something like this?

Yes, we have the exact same data in each vmstorage pool.

Each storage pool contains complete data and vmselct returns result from the fastest storage pool with complete data. vmselect drops result from storage pool when -search.denyPartialResponse=ture and more than replicationFactor-1 vmstorage nodes in that pool are unavailable during querying.

Yes

Haleygo · 2024-02-29T05:23:19Z

In that case, I would recommend to have separate vmselect in each AZ and use vmauth as query endpoint, see VictoriaMetrics/helm-charts#789 (comment) for reference.
The main benefits:

vmstorage and vmselect nodes under one AZ can be managed within one vmcluster easily, and components in each AZ can be configured and upgraded independently;
reduce cross AZ network usage, vmauth will prefer the "local" vmselect first and using other vmselects if "local" is unavailable;
reduce resource usage, global vmselect described here will always spread the query requests to all the vmstorage nodes, caused higher resource usage in total and slower response.

Sinketsu · 2024-02-29T16:17:48Z

In that case, I would recommend to have separate vmselect in each AZ and use vmauth as query endpoint, see VictoriaMetrics/helm-charts#789 (comment) for reference. The main benefits:

vmstorage and vmselect nodes under one AZ can be managed within one vmcluster easily, and components in each AZ can be configured and upgraded independently;

reduce cross AZ network usage, vmauth will prefer the "local" vmselect first and using other vmselects if "local" is unavailable;

reduce resource usage, global vmselect described here will always spread the query requests to all the vmstorage nodes, caused higher resource usage in total and slower response.

Ok, thank you a lot for the idea, I will try it on our load!

Sinketsu · 2024-03-07T13:24:27Z

Thanks for the advice! We tried to raise it from us and it worked. At the same time, it reduced the overall load on the cluster.

I'm closing the issue.

Sinketsu · 2024-03-07T15:16:59Z

But no, I was in a hurry, I'm sorry.

We have a problem when there is a temporary unavailability of one AZ. After it is restored, there is no data in it for a certain time (the time of unavailability).
At the same time, requests get into it and we show incorrect data. Can you please give us any advice in this case?

Thanks!

Haleygo · 2024-03-11T02:40:11Z

But no, I was in a hurry, I'm sorry.

We have a problem when there is a temporary unavailability of one AZ. After it is restored, there is no data in it for a certain time (the time of unavailability). At the same time, requests get into it and we show incorrect data. Can you please give us any advice in this case?

Thanks!

When one of the AZ is down for a short period of time, client like vmagent stores pending data on it's disk and waits for remote storage to be available again and resends, so no data loss.
But if it takes too long to recover the AZ or client doesn't have enough disk space to store pending data, there will be data loss. In this case, if you have other healthy AZ which contains those data, you can use vmctl to restore data from it, see https://docs.victoriametrics.com/vmctl/#migrating-data-from-victoriametrics for details.

Sinketsu · 2024-03-11T08:01:11Z

When one of the AZ is down for a short period of time, client like vmagent stores pending data on it's disk and waits for remote storage to be available again and resends, so no data loss.

We haven't enough space for temporary queueing the data.
But even if there were, we still needed to automatically open/close the AZ from reading at some point so that recording/alert rules would not be performed on incomplete data.

But if it takes too long to recover the AZ or client doesn't have enough disk space to store pending data, there will be data loss. In this case, if you have other healthy AZ which contains those data, you can use vmctl to restore data from it, see https://docs.victoriametrics.com/vmctl/#migrating-data-from-victoriametrics for details.

Yes, we thought about it, but it has some disadvantages:

We need to automatically detect period, when one AZ has more data, than another, and close appropriate vmauth in that AZ
We need to automatically start recovering process (migrate data via vmctl)
We need to automatically open vmauth in that AZ after migrating was complete

It will be nice if some tool appears, which will be able to merge (and deduplicate) results from multiple of vmselect in different AZ. Now there is something like this in vmselect (multi-level scheme), but it consumes a lot of resources for that (because of merging all raw data only on the top level).

Sinketsu added the enhancement New feature or request label Feb 27, 2024

Haleygo added question The question issue and removed enhancement New feature or request labels Feb 29, 2024

Haleygo self-assigned this Feb 29, 2024

Haleygo mentioned this issue Mar 6, 2024

vmselect fetches merged data from Multi-level cluster #5913

Open

Sinketsu closed this as completed Mar 7, 2024

Sinketsu reopened this Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vmselect] Add a capability to set `search.denyPartialResponse` per storage groups #5883

[vmselect] Add a capability to set `search.denyPartialResponse` per storage groups #5883

Sinketsu commented Feb 27, 2024

Haleygo commented Feb 28, 2024

Sinketsu commented Feb 28, 2024

Haleygo commented Feb 29, 2024

Sinketsu commented Feb 29, 2024 •

edited

Sinketsu commented Mar 7, 2024

Sinketsu commented Mar 7, 2024

Haleygo commented Mar 11, 2024

Sinketsu commented Mar 11, 2024

[vmselect] Add a capability to set search.denyPartialResponse per storage groups #5883

[vmselect] Add a capability to set search.denyPartialResponse per storage groups #5883

Comments

Sinketsu commented Feb 27, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional information

Haleygo commented Feb 28, 2024

Sinketsu commented Feb 28, 2024

Haleygo commented Feb 29, 2024

Sinketsu commented Feb 29, 2024 • edited

Sinketsu commented Mar 7, 2024

Sinketsu commented Mar 7, 2024

Haleygo commented Mar 11, 2024

Sinketsu commented Mar 11, 2024

[vmselect] Add a capability to set `search.denyPartialResponse` per storage groups #5883

[vmselect] Add a capability to set `search.denyPartialResponse` per storage groups #5883

Sinketsu commented Feb 29, 2024 •

edited