Add a new chart for multi-AZ setup #789

Haleygo · 2023-12-11T04:33:37Z

For now, we provide charts like victoria-metrics-cluster and victoria-metrics-k8s-stack to create a customize vmcluster instance with other optional components like vmalert. But there are cases where users want to setup multiple availability zone with one release, a new chart like victoria-metrics-cluster-distributed could help.

Describe alternatives you've considered

Install victoria-metrics-k8s-stack chart per available zone with different affinity for components.

Additional information

mimir-distributed supports zoneAwareReplication.

The text was updated successfully, but these errors were encountered:

Haleygo · 2023-12-11T04:39:11Z

cc @f41gh7 @Amper @hagen1778

hagen1778 · 2023-12-13T15:06:14Z

Thanks for creating the issue @Haleygo!
I think we need to agreed on the VM configuration for multi AZ setup. We have a related doc here, but I think it just explains a general direction for implementing this. We need to carefully define the ingestion and read path here, the components and settings involved in this.

tenmozes · 2024-01-08T11:07:13Z

What i typically recommend to people for multi-az setup

Use two AZs for multi-az setup. Sometimes, people want to use three AZs because of their experience of running other distributed applications. For VM this is not the case
Setup cluster or single node per AZ, put vmagent in front of them and multiplex the data between AZ

There are three ways how you can do that (all of them have own pros and cons)

Solution A - simple option

vmagent in from of two AZs
vminsert + vmstorage per AZ
vmselect knows about ALL storages in each AZ

note: replication factor in that case should be 1

pros

simple schema

cons

may be costly because cross AZ traffic on vmagent side and vmselect

Solution B - complex but more cost-saving. Data writing. Here i cover only ingestion, and reading can be taken from Solutions A or C

Setup insert + storage per AZ. Expose two writing endpoints. Application/agent must write data into two endpoints (Prometheus, vmagent, libraries support that)

pros

It should be less costly in terms of cross-az traffic. You delegate this responsibility on the writer, so it may still be expensive

cons

more complex setup, you must be sure that every producer writes into two endpoints (each AZ)

Solution C - complex but more control over reading. Here i cover only reading, data writing can be taken from Solutions A and B

Setup vmselect in each AZ, with its own list of storage
Setup load balancer(vmauth, nginx...) after vmselect and
OR
Setup vmselect(multi-level vmselect) that reads from vmselect

pros

more granular control of how to read the data. May result into cost-saving if read data from one AZ per query
cons
harder to implement

i recommend to implement solution A
side note: i don't recommend to use multi-level vminsert setup because in that case you work with one big cluster instead of two different and you lose all advantages of multi-az setup

hagen1778 · 2024-01-10T14:32:40Z

Hi! I think option 3 is the most valuable one. But I'd propose to tweak it a bit. See picture below:

this topology should provide single URLs for writes and reads
per-zone URLs for reads, if user wants to run per-zone Grafana datasource
alerting should be out of scope for this helm chart. It is up to user to decide where to deploy alerting. User can use global URLs for alerting
there should be an option to disable specified AZ for reads and/or writes for maintenance purposes

chenlujjj · 2024-03-20T15:53:33Z

Hi, I'm wondering the data consistency between the two AZ should be considered in this setup.

The data ingestion cannot be 100% same for the two clusters, imagine a scenario like this: the first query goes to AZ 1, then the second query goes to AZ 2, if there is a ingestion lag in AZ 2, user may find that second query lacks partial result of first query. It’s confusing and can also make alerts false.

Haleygo · 2024-04-08T04:48:46Z

Hi @chenlujjj ,

the first query goes to AZ 1, then the second query goes to AZ 2, if there is a ingestion lag in AZ 2, user may find that second query lacks partial result of first query.

It doesn't just exist in multi-az set up, but could happen in any single vmcluster. If you have ingestion lag, let say 1min, caused by slow ingestion or slow network, vmalert which uses -search.latencyOffset(default 30s) will always get imcomplete results and could give false alerts, see this doc for details.
The best way is to fix the cause of ingestion lag, which could be lack of resources, slow disk or slow network. And if the lag can't be fixed, increase -search.latencyOffset on vmselect to ensure the data is completed to be queried.

The same goes here, since vmagent sends identical data to multiple zones and vmselect has -search.latencyOffset(default 30s), if there is no bottlenect on resources and network bandwidth, the query result must be same no matter which vmcluster is serveing. But if the lag between zones can't be eliminated, you can also use the second vmauth(vmauth hot-standby in this diagram) as your query endpoint since it always prefer "local" vmcluster, and only switch to other vmcluster when "local" vmcluster is down.

chenlujjj · 2024-04-08T06:26:18Z

@Haleygo Thanks for the detailed explanation

hagen1778 · 2024-04-10T08:58:03Z

the first query goes to AZ 1, then the second query goes to AZ 2, if there is a ingestion lag in AZ 2, user may find that second query lacks partial result of first query

@chenlujjj it is expected that all queries will stick to one AZ (see vmauth hot-standby mode). And will switch to different AZ only if the first AZ responded with unrecoverable error (parital result or service unavailability). When you stick to the same datasource, there will be no discrepancy for a user. A discrepancy might appear in the case of fallback to another AZ, but it is expected that by the time this happens, system administrator will detect and fix "if there is a ingestion lag in AZ 2" problem.
If AZ1 fails, and AZ2 has constant ingestion delay, then the whole system is unreliable and there are bigger problems than discrepancy for identical queries.

J-Owens · 2024-04-11T10:36:35Z

it is expected that all queries will stick to one AZ (see vmauth hot-standby mode). And will switch to different AZ only if the first AZ responded with unrecoverable error (parital result or service unavailability).

This is not the case in the reference diagram; there is a vmauth hot-standby in both AZs and the load balancer is balancing queries across both AZs.

So the scenario outlined where queries are initially routed to separate AZs is not solved as even with the vmauth hot-standby sticking to the local cluster, that preference is only after the initial routing to one of the AZs.

If we were to routing all queries to one AZ only and falling back to the other AZ it would defeat the point of the hot-standby vmauths. So outside of fixing the root cause of the lag or adjusting parameters, as per @Haleygo, I don't think there is another solution?

hagen1778 · 2024-04-12T10:30:47Z

This is not the case in the reference diagram; there is a vmauth hot-standby in both AZs and the load balancer is balancing queries across both AZs.

Right. The balancer is there to provide a single endpoint for accessing the service. The diagram doesn't mention it, but it is expected that this balancer will be configured by the user according to their priorities and environment. With two AZ it is always better to stick to one of the AZs until it fails. The sticking logic could be based on geospatial info (using geographically closest AZ), for example.
For any system it doesn't make much sense to balance requests across all available AZs for the following reasons:

There will be always a difference in data freshness just because of distances between AZs
Hitting different AZs for read queries won't utilize caches efficiently
Geographical or infrastructure difference between AZ may introduce other factors such as latency.

J-Owens · 2024-04-12T11:21:04Z

Thank you for the detailed response @hagen1778!

In that case, a couple of questions:

Is that for both insertion and querying?
Should the standby AZ should be sitting at minimum capacity for vmselect nodes and then scale up if the switch is made? vminsert and vmstorage nodes would still need to be at an equal capacity in both AZs as insertions are all replicated.

So instead of two clusters that are balanced, it's two clusters where one is a backup, but insertions are replicated in both so if the main cluster fails then insertions and queries switch to the backup AZ, vmstorage nodes scale up, and all of the data is already there due to the replication?

hagen1778 · 2024-04-12T12:26:43Z

Is that for both insertion and querying?

It is for querying. For insertion, latency, freshness, etc doesn't matter that much.

Should the standby AZ should be sitting at minimum capacity for vmselect nodes and then scale up if the switch is made?

This makes sense. HPA should work pretty well here for vmselects.

So instead of two clusters that are balanced, it's two clusters where one is a backup, but insertions are replicated in both so if the main cluster fails then insertions and queries switch to the backup AZ, vmstorage nodes scale up, and all of the data is already there due to the replication?

Precisely. If we imagine our two AZs are in different US states, or one in US and another in EU - it makes sense to read from the closest one. Btw, if you also have two teams using the service: one in US and second in EU - both of them could read data from the closest AZ. And both of them would have consistent data for subsequent requests.

Haleygo added the enhancement New feature or request label Dec 11, 2023

Haleygo self-assigned this Jan 11, 2024

Haleygo mentioned this issue Jan 19, 2024

vmauth missing from victoria-metrics-k8s-stack helm chart #829

Open

This was referenced Feb 29, 2024

[vmselect] Add a capability to set search.denyPartialResponse per storage groups VictoriaMetrics/VictoriaMetrics#5883

Open

vmselect fetches merged data from Multi-level cluster VictoriaMetrics/VictoriaMetrics#5913

Open

Haleygo mentioned this issue Apr 8, 2024

add victoria-metrics-distributed chart #960

Merged

hagen1778 mentioned this issue Apr 15, 2024

Add fault domain awareness to storage VictoriaMetrics/VictoriaMetrics#6054

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new chart for multi-AZ setup #789

Add a new chart for multi-AZ setup #789

Haleygo commented Dec 11, 2023

Haleygo commented Dec 11, 2023

hagen1778 commented Dec 13, 2023

tenmozes commented Jan 8, 2024 •

edited by Haleygo

Loading

hagen1778 commented Jan 10, 2024 •

edited

Loading

chenlujjj commented Mar 20, 2024

Haleygo commented Apr 8, 2024

chenlujjj commented Apr 8, 2024

hagen1778 commented Apr 10, 2024

J-Owens commented Apr 11, 2024

hagen1778 commented Apr 12, 2024 •

edited

Loading

J-Owens commented Apr 12, 2024

hagen1778 commented Apr 12, 2024

Add a new chart for multi-AZ setup #789

Add a new chart for multi-AZ setup #789

Comments

Haleygo commented Dec 11, 2023

Describe alternatives you've considered

Additional information

Haleygo commented Dec 11, 2023

hagen1778 commented Dec 13, 2023

tenmozes commented Jan 8, 2024 • edited by Haleygo Loading

hagen1778 commented Jan 10, 2024 • edited Loading

chenlujjj commented Mar 20, 2024

Haleygo commented Apr 8, 2024

chenlujjj commented Apr 8, 2024

hagen1778 commented Apr 10, 2024

J-Owens commented Apr 11, 2024

hagen1778 commented Apr 12, 2024 • edited Loading

J-Owens commented Apr 12, 2024

hagen1778 commented Apr 12, 2024

tenmozes commented Jan 8, 2024 •

edited by Haleygo

Loading

hagen1778 commented Jan 10, 2024 •

edited

Loading

hagen1778 commented Apr 12, 2024 •

edited

Loading