Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new chart for multi-AZ setup #789

Open
Haleygo opened this issue Dec 11, 2023 · 12 comments
Open

Add a new chart for multi-AZ setup #789

Haleygo opened this issue Dec 11, 2023 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@Haleygo
Copy link
Contributor

Haleygo commented Dec 11, 2023

For now, we provide charts like victoria-metrics-cluster and victoria-metrics-k8s-stack to create a customize vmcluster instance with other optional components like vmalert. But there are cases where users want to setup multiple availability zone with one release, a new chart like victoria-metrics-cluster-distributed could help.

Describe alternatives you've considered

Install victoria-metrics-k8s-stack chart per available zone with different affinity for components.

Additional information

mimir-distributed supports zoneAwareReplication.

@Haleygo
Copy link
Contributor Author

Haleygo commented Dec 11, 2023

cc @f41gh7 @Amper @hagen1778

@Haleygo Haleygo added the enhancement New feature or request label Dec 11, 2023
@hagen1778
Copy link
Contributor

Thanks for creating the issue @Haleygo!
I think we need to agreed on the VM configuration for multi AZ setup. We have a related doc here, but I think it just explains a general direction for implementing this. We need to carefully define the ingestion and read path here, the components and settings involved in this.

@tenmozes
Copy link
Collaborator

tenmozes commented Jan 8, 2024

What i typically recommend to people for multi-az setup

  1. Use two AZs for multi-az setup. Sometimes, people want to use three AZs because of their experience of running other distributed applications. For VM this is not the case
  2. Setup cluster or single node per AZ, put vmagent in front of them and multiplex the data between AZ

There are three ways how you can do that (all of them have own pros and cons)

Solution A - simple option

  1. vmagent in from of two AZs
  2. vminsert + vmstorage per AZ
  3. vmselect knows about ALL storages in each AZ
image

note: replication factor in that case should be 1

pros

  • simple schema

cons

  • may be costly because cross AZ traffic on vmagent side and vmselect

Solution B - complex but more cost-saving. Data writing. Here i cover only ingestion, and reading can be taken from Solutions A or C

  1. Setup insert + storage per AZ. Expose two writing endpoints. Application/agent must write data into two endpoints (Prometheus, vmagent, libraries support that)
image

pros

  • It should be less costly in terms of cross-az traffic. You delegate this responsibility on the writer, so it may still be expensive

cons

  • more complex setup, you must be sure that every producer writes into two endpoints (each AZ)

Solution C - complex but more control over reading. Here i cover only reading, data writing can be taken from Solutions A and B

  1. Setup vmselect in each AZ, with its own list of storage
  2. Setup load balancer(vmauth, nginx...) after vmselect and
    OR
  3. Setup vmselect(multi-level vmselect) that reads from vmselect
image

pros

  • more granular control of how to read the data. May result into cost-saving if read data from one AZ per query
    cons
  • harder to implement

i recommend to implement solution A
side note: i don't recommend to use multi-level vminsert setup because in that case you work with one big cluster instead of two different and you lose all advantages of multi-az setup

@hagen1778
Copy link
Contributor

hagen1778 commented Jan 10, 2024

Hi! I think option 3 is the most valuable one. But I'd propose to tweak it a bit. See picture below:
image

  • this topology should provide single URLs for writes and reads
  • per-zone URLs for reads, if user wants to run per-zone Grafana datasource
  • alerting should be out of scope for this helm chart. It is up to user to decide where to deploy alerting. User can use global URLs for alerting
  • there should be an option to disable specified AZ for reads and/or writes for maintenance purposes

@chenlujjj
Copy link
Contributor

Hi, I'm wondering the data consistency between the two AZ should be considered in this setup.

The data ingestion cannot be 100% same for the two clusters, imagine a scenario like this: the first query goes to AZ 1, then the second query goes to AZ 2, if there is a ingestion lag in AZ 2, user may find that second query lacks partial result of first query. It’s confusing and can also make alerts false.

@Haleygo
Copy link
Contributor Author

Haleygo commented Apr 8, 2024

Hi @chenlujjj ,

the first query goes to AZ 1, then the second query goes to AZ 2, if there is a ingestion lag in AZ 2, user may find that second query lacks partial result of first query.

It doesn't just exist in multi-az set up, but could happen in any single vmcluster. If you have ingestion lag, let say 1min, caused by slow ingestion or slow network, vmalert which uses -search.latencyOffset(default 30s) will always get imcomplete results and could give false alerts, see this doc for details.
The best way is to fix the cause of ingestion lag, which could be lack of resources, slow disk or slow network. And if the lag can't be fixed, increase -search.latencyOffset on vmselect to ensure the data is completed to be queried.

The same goes here, since vmagent sends identical data to multiple zones and vmselect has -search.latencyOffset(default 30s), if there is no bottlenect on resources and network bandwidth, the query result must be same no matter which vmcluster is serveing. But if the lag between zones can't be eliminated, you can also use the second vmauth(vmauth hot-standby in this diagram) as your query endpoint since it always prefer "local" vmcluster, and only switch to other vmcluster when "local" vmcluster is down.

@chenlujjj
Copy link
Contributor

@Haleygo Thanks for the detailed explanation

@hagen1778
Copy link
Contributor

the first query goes to AZ 1, then the second query goes to AZ 2, if there is a ingestion lag in AZ 2, user may find that second query lacks partial result of first query

@chenlujjj it is expected that all queries will stick to one AZ (see vmauth hot-standby mode). And will switch to different AZ only if the first AZ responded with unrecoverable error (parital result or service unavailability). When you stick to the same datasource, there will be no discrepancy for a user. A discrepancy might appear in the case of fallback to another AZ, but it is expected that by the time this happens, system administrator will detect and fix "if there is a ingestion lag in AZ 2" problem.
If AZ1 fails, and AZ2 has constant ingestion delay, then the whole system is unreliable and there are bigger problems than discrepancy for identical queries.

@J-Owens
Copy link

J-Owens commented Apr 11, 2024

it is expected that all queries will stick to one AZ (see vmauth hot-standby mode). And will switch to different AZ only if the first AZ responded with unrecoverable error (parital result or service unavailability).

This is not the case in the reference diagram; there is a vmauth hot-standby in both AZs and the load balancer is balancing queries across both AZs.

So the scenario outlined where queries are initially routed to separate AZs is not solved as even with the vmauth hot-standby sticking to the local cluster, that preference is only after the initial routing to one of the AZs.

If we were to routing all queries to one AZ only and falling back to the other AZ it would defeat the point of the hot-standby vmauths. So outside of fixing the root cause of the lag or adjusting parameters, as per @Haleygo, I don't think there is another solution?

@hagen1778
Copy link
Contributor

hagen1778 commented Apr 12, 2024

This is not the case in the reference diagram; there is a vmauth hot-standby in both AZs and the load balancer is balancing queries across both AZs.

Right. The balancer is there to provide a single endpoint for accessing the service. The diagram doesn't mention it, but it is expected that this balancer will be configured by the user according to their priorities and environment. With two AZ it is always better to stick to one of the AZs until it fails. The sticking logic could be based on geospatial info (using geographically closest AZ), for example.
For any system it doesn't make much sense to balance requests across all available AZs for the following reasons:

  1. There will be always a difference in data freshness just because of distances between AZs
  2. Hitting different AZs for read queries won't utilize caches efficiently
  3. Geographical or infrastructure difference between AZ may introduce other factors such as latency.

@J-Owens
Copy link

J-Owens commented Apr 12, 2024

Thank you for the detailed response @hagen1778!

In that case, a couple of questions:

  1. Is that for both insertion and querying?
  2. Should the standby AZ should be sitting at minimum capacity for vmselect nodes and then scale up if the switch is made? vminsert and vmstorage nodes would still need to be at an equal capacity in both AZs as insertions are all replicated.

So instead of two clusters that are balanced, it's two clusters where one is a backup, but insertions are replicated in both so if the main cluster fails then insertions and queries switch to the backup AZ, vmstorage nodes scale up, and all of the data is already there due to the replication?

@hagen1778
Copy link
Contributor

Is that for both insertion and querying?

It is for querying. For insertion, latency, freshness, etc doesn't matter that much.

Should the standby AZ should be sitting at minimum capacity for vmselect nodes and then scale up if the switch is made?

This makes sense. HPA should work pretty well here for vmselects.

So instead of two clusters that are balanced, it's two clusters where one is a backup, but insertions are replicated in both so if the main cluster fails then insertions and queries switch to the backup AZ, vmstorage nodes scale up, and all of the data is already there due to the replication?

Precisely. If we imagine our two AZs are in different US states, or one in US and another in EU - it makes sense to read from the closest one. Btw, if you also have two teams using the service: one in US and second in EU - both of them could read data from the closest AZ. And both of them would have consistent data for subsequent requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants