vmselect: query over multiple availability zones #4792

hagen1778 · 2023-08-07T10:27:44Z

Is your question request related to a specific component?

vmselect

Describe the question in detail

It is a common approach to run multiple VictoriaMetrics clusters in each availability zone (AZ) for reliability purposes. Usually, each AZ will contain identical data, ensuring that if one AZ fails - another AZ will continue returning complete results.

The recommended multi-AZ setup assumes that user has more than one AZ, each AZ has a cluster of vmstorage and vminsert nodes. The data stored in each AZ is supposed to be identical. For querying the data it is expected for a user to choose from one of the options:

Run one or more vmselects closer to the user. Each vmselect is configured with vmstorage nodes from all AZs. In this way, each request to vmselect will fetch data from all vmstorage nodes in all AZs, merge them and return to the user. Deduplication on vmselect will ensure user gets only one copy of data.
Run multi-level vmselect setup, which does the similar job as in p1, but simpler from architectural PoV.

Both cases will protect the user in case one of AZs becomes unreachable or returns partial results. But each of them will suffer from the increased latency produced by the slowest AZ as vmselect will wait for responses from all vmstorage nodes in all AZs.

One of the ways to solve this would be to make vmselect smarter by introducing storage groups. With introducing additional logic vmselect may query only the fastest AZ, and fallback to other options of the fastest AZ is unreachable or returns incomplete data.

However, such approach could be also implemented without adding extra logic to vmselect. Using nginx or any other reverse proxy would suffice. For example, HTTP load balancing in nginx already provides such features as:

server groups - the list of servers for HTTP balancing
various load balancing methods - useful options to balance the load among configured servers.

On the pic above we have two VM clusters in zone A and B. These are independent clusters but with identical data.
Tor read from both clusters we use 2 layers of Nginx:

2nd layer nginx is dedicated to a specific zone and balances the load using LeastConnections method among vmselects of a specific VM cluster. If one or more of vmstorages fails to respond in the cluster - vmselect will return an error to 1st layer nginx due to
-search.denyPartialResponse setting.
1st layer nginx balances the load using LeastTime method to choose the fastest (or likely the closest) zone (any other alternative balancing method can be used). If 2nd layer nginx returns an error (due to partial response) or becomes unreachable - the other zone will be queried instead.

The text was updated successfully, but these errors were encountered:

valyala · 2023-08-11T07:46:49Z

The nginx can be replaced with vmauth in the scheme above - it supports load balancing among the specified backends. It also simplifies configuring the needed authorization, routing, filtering and concurrency limiting for incoming requests comparing to nginx. For example, the following -auth.config can be used for the top level of vmauth for spreading the incoming requests among availability zones:

unauthorized_user:
  url_prefix:
  - http://vmauth-zone-A/?deny_partial_response=1
  - http://vmauth-zone-B/?deny_partial_response=1

This config adds deny_partial_response=1 query arg to all the requests to the lower per-zone vmauth services, in order to guarantee that they return either full responses or an error if some of vmselect nodes are temporarily unavailable in the zone. See these docs for details.

The lower level of vmauth nodes can have the following config for spreading incoming requests among per-zone vmselect nodes:

unauthorized_user:
  url_map:
  - src_paths:
    - /api/v1/query
    - /api/v1/query_range
    url_prefix:
    - http://vmselect-1:8481/select/0/prometheus
    # ...
    - http://vmselect-N:8481/select/0/prometheus

…d and some of vmstorage nodes are temporarily unavailable This should help detecting this case and automatic retrying the query at healthy cluster replica in another availability zone. This commit is needed as a preparation for automatic query retry at another backend at vmauth on 5xx errors as described at #4792 (comment)

…odes This should allow implementing high availability scheme described at #4792 (comment) See also #4893

valyala · 2023-11-01T18:49:49Z

The related issue - #5197

…load balancing policy vmauth in `hot standby` mode sends requests to the first url_prefix while it is available. If the first url_prefix becomes unavailable, then vmauth falls back to the next url_prefix. This allows building highly available setup as described at https://docs.victoriametrics.com/vmauth.html#high-availability Updates #4893 Updates #4792

valyala · 2023-12-08T21:52:17Z

FYI, the upcoming release of vmauth will provide the ability to send all the requests to the closest AZ and to fall back to other AZs only when the closest AZ is unavailable. For example, the following -auth.config instructs vmauth to send requests to https://vmselect-az1/ while it is available and it can return full response. It falls back to https://vmselect-az2/ if https://vmselect-az1/ isn't available or if it cannot return full response:

unauthorized_user:
  url_prefix:
  - 'https://vmselect-az1/?deny_partial_response=1'
  - 'https://vmselect-az2/?deny_partial_response=1'
  retry_status_codes: [500, 502, 503]
  load_balancing_policy: first_available

vmselect responds with 503 Service Unavailable status code when it cannot produce full response because some of vmstorage nodes are temporarily unavailable and when deny_partial_response query arg is present in the query. See these docs for information about deny_partial_response query arg. In this case vmauth re-tries the request at https://vmselect-az2/ because retry_status_codes lists the 503 status code.

This allows building highly available setups as described in these docs.

vmselect nodes at every AZ can be hidden behind an additional vmauth with the following config, which performs even distribution of incoming requests among available vmselect nodes:

unauthorized_user:
  url_prefix:
  - http://vmselect-1:8481/
 ...
  - http://vmselect-N:8481/

See these docs for details on how vmauth balances load among the configured url_prefix entries.

This functionality can be tested by building vmauth from the commit 0422675 according to these docs.

Link to the related issue - #4792 Fix heading for `Modifying HTTP headers` chapter at docs/vmagent.md

ivankovnatsky · 2023-12-11T19:44:33Z

FYI, the upcoming release of vmauth will provide the ability to send all the requests to the closest AZ and to fall back to other AZs only when the closest AZ is unavailable

But what if vmauth runs in az2 and gets a full response from az1 vmselect
endpoint, that could not be counted as the closest one, I presume.

I understand that not all running their VM clusters on k8s, yet if they are I
think using topology aware routing could be pretty decent solution ensuring we
send to same az endpoint. [1]

Update: Tried it and it does not seem like to work at all:
kubernetes/kubernetes#121516.

References:

[1] https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing/#three-or-more-endpoints-per-zone

We a very interested in something similar yet a bit different, we want to make
sure we only use same AZ endpoint, but for vmstorage, as the ingesting traffic
is by far the biggest contributor to our cross-AZ costs. We are currently
running single node, but thinking about moving to cluster version for HA.

While writing a comment I checked the docs and it seems like vminsert spreads
"evenly" the traffic between the vmstorage nodes, I fetched the entrypoint of
one of the vminsert replicas we're running for tests and it shows:

/vminsert-prod \
    --storageNode=vmcluster-victoria-metrics-cluster-vmstorage-0.vmcluster-victoria-metrics-cluster-vmstorage.vmcluster.svc.cluster.local:8400 \
    --storageNode=vmcluster-victoria-metrics-cluster-vmstorage-1.vmcluster-victoria-metrics-cluster-vmstorage.vmcluster.svc.cluster.local:8400 \
    --envflag.enable=true --envflag.prefix=VM_ --loggerFormat=json \
    --maxLabelsPerTimeseries=50

It's hardly we could make it balance, at least with this configuration. Also it
seems like pushing to same AZ would defeat the HA concept here.

What do you think?

valyala · 2023-12-13T00:04:28Z

But what if vmauth runs in az2 and gets a full response from az1 vmselect endpoint, that could not be counted as the closest one, I presume.

vmauth doesn't detect the closest backend - there is no magic here. It just proxies requests to the first backend in the url_prefix list while this backend is available if load_balancing_policy is set to first_available as described in these docs. It starts proxying requests to the next backend from the url_prefix list when the first backend becomes unavailable. This means that you must run distinct vmauth instances per each AZ with distinct order of backends in the url_prefix list.

we want to make sure we only use same AZ endpoint, but for vmstorage, as the ingesting traffic is by far the biggest contributor to our cross-AZ costs

It isn't recommended to spread vmstorage nodes of a single cluster across multiple AZs, since this may result in reduced data ingestion performance because of bigger network latencies between AZs. This also may result in higher costs, since cross-AZ traffic is usually billed. It is recommended to run all the vmstorage nodes for a single VictoriaMetrics cluster in the same AZ with low network latencies. Read more details in these docs.

it seems like pushing to same AZ would defeat the HA concept here.

You need to replicate incoming data among multiple completely independent VictoriaMetrics clusters located in different AZs in order to achieve real HA. If a single AZ becomes unavailable, then all the data remains for querying available in other AZs, while new data continues flowing into these other AZs. The data can be replicated among multiple AZs with vmagent as described here and here.

valyala · 2023-12-13T00:09:07Z

The -loadBalancingPolicy command-line flag and load_balancing_policy option is available in vmauth starting from v1.96.0. See these docs for details. Closing this feature request then, since now vmauth allows building HA setups across multiple backends with the same data as described here.

ivankovnatsky · 2023-12-13T15:51:11Z

Sorry to post on a closed issue, but while the context is here fresh wanted to clarify. Do I understand correctly that practically with this setup:

unauthorized_user:
  url_prefix:
  - 'https://vmselect-az1/?deny_partial_response=1'
  - 'https://vmselect-az2/?deny_partial_response=1'
  retry_status_codes: [500, 502, 503]
  load_balancing_policy: first_available

I can basically run two single servers in different zones, and pushing metrics only it's own az (I will need to have configured dynamic write urls for services/vmagent in different azs). Meaning every vm will only have metrics from the services in it's own az.

valyala · 2023-12-13T19:47:31Z

@ivankovnatsky , you need to write the same data to VictoriaMetrics instances in every AZ. This can be done by specifying multiple -remoteWrite.url command-line options at vmagent, so it sends the same data to all the configured remote storage instances. See these docs. See also high availability docs for single-node VictoriaMetrics.

hagen1778 added the question The question issue label Aug 7, 2023

valyala added the enhancement New feature or request label Aug 8, 2023

hagen1778 mentioned this issue Aug 25, 2023

vmauth: enhance logic for retrying configured backends #4893

Closed

valyala added a commit that referenced this issue Sep 7, 2023

app/vmauth: retry requests at other backends on 5xx response status c…

b80d338

…odes This should allow implementing high availability scheme described at #4792 (comment) See also #4893

valyala added a commit that referenced this issue Sep 7, 2023

app/vmauth: retry requests at other backends on 5xx response status c…

3257fcf

…odes This should allow implementing high availability scheme described at #4792 (comment) See also #4893

valyala added a commit that referenced this issue Dec 8, 2023

docs: follow-up after 49552ea

6203c1a

Link to the related issue - #4792 Fix heading for `Modifying HTTP headers` chapter at docs/vmagent.md

valyala added a commit that referenced this issue Dec 8, 2023

docs: follow-up after 49552ea

c7504da

Link to the related issue - #4792 Fix heading for `Modifying HTTP headers` chapter at docs/vmagent.md

valyala closed this as completed Dec 13, 2023

krehel mentioned this issue Dec 13, 2023

victoriametrics 1.96.0 Homebrew/homebrew-core#157255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmselect: query over multiple availability zones #4792

vmselect: query over multiple availability zones #4792

hagen1778 commented Aug 7, 2023 •

edited

valyala commented Aug 11, 2023 •

edited

valyala commented Nov 1, 2023

valyala commented Dec 8, 2023 •

edited

ivankovnatsky commented Dec 11, 2023 •

edited

valyala commented Dec 13, 2023

valyala commented Dec 13, 2023

ivankovnatsky commented Dec 13, 2023 •

edited

valyala commented Dec 13, 2023

vmselect: query over multiple availability zones #4792

vmselect: query over multiple availability zones #4792

Comments

hagen1778 commented Aug 7, 2023 • edited

Is your question request related to a specific component?

Describe the question in detail

valyala commented Aug 11, 2023 • edited

valyala commented Nov 1, 2023

valyala commented Dec 8, 2023 • edited

ivankovnatsky commented Dec 11, 2023 • edited

valyala commented Dec 13, 2023

valyala commented Dec 13, 2023

ivankovnatsky commented Dec 13, 2023 • edited

valyala commented Dec 13, 2023

hagen1778 commented Aug 7, 2023 •

edited

valyala commented Aug 11, 2023 •

edited

valyala commented Dec 8, 2023 •

edited

ivankovnatsky commented Dec 11, 2023 •

edited

ivankovnatsky commented Dec 13, 2023 •

edited