Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmselect: query over multiple availability zones #4792

Closed
hagen1778 opened this issue Aug 7, 2023 · 8 comments
Closed

vmselect: query over multiple availability zones #4792

hagen1778 opened this issue Aug 7, 2023 · 8 comments
Labels
enhancement New feature or request question The question issue

Comments

@hagen1778
Copy link
Collaborator

hagen1778 commented Aug 7, 2023

Is your question request related to a specific component?

vmselect

Describe the question in detail

It is a common approach to run multiple VictoriaMetrics clusters in each availability zone (AZ) for reliability purposes. Usually, each AZ will contain identical data, ensuring that if one AZ fails - another AZ will continue returning complete results.

The recommended multi-AZ setup assumes that user has more than one AZ, each AZ has a cluster of vmstorage and vminsert nodes. The data stored in each AZ is supposed to be identical. For querying the data it is expected for a user to choose from one of the options:

  1. Run one or more vmselects closer to the user. Each vmselect is configured with vmstorage nodes from all AZs. In this way, each request to vmselect will fetch data from all vmstorage nodes in all AZs, merge them and return to the user. Deduplication on vmselect will ensure user gets only one copy of data.
  2. Run multi-level vmselect setup, which does the similar job as in p1, but simpler from architectural PoV.

Both cases will protect the user in case one of AZs becomes unreachable or returns partial results. But each of them will suffer from the increased latency produced by the slowest AZ as vmselect will wait for responses from all vmstorage nodes in all AZs.

One of the ways to solve this would be to make vmselect smarter by introducing storage groups. With introducing additional logic vmselect may query only the fastest AZ, and fallback to other options of the fastest AZ is unreachable or returns incomplete data.

However, such approach could be also implemented without adding extra logic to vmselect. Using nginx or any other reverse proxy would suffice. For example, HTTP load balancing in nginx already provides such features as:

On the pic above we have two VM clusters in zone A and B. These are independent clusters but with identical data.
Tor read from both clusters we use 2 layers of Nginx:

  1. 2nd layer nginx is dedicated to a specific zone and balances the load using LeastConnections method among vmselects of a specific VM cluster. If one or more of vmstorages fails to respond in the cluster - vmselect will return an error to 1st layer nginx due to
    -search.denyPartialResponse setting.
  2. 1st layer nginx balances the load using LeastTime method to choose the fastest (or likely the closest) zone (any other alternative balancing method can be used). If 2nd layer nginx returns an error (due to partial response) or becomes unreachable - the other zone will be queried instead.
@hagen1778 hagen1778 added the question The question issue label Aug 7, 2023
@valyala valyala added the enhancement New feature or request label Aug 8, 2023
@valyala
Copy link
Collaborator

valyala commented Aug 11, 2023

The nginx can be replaced with vmauth in the scheme above - it supports load balancing among the specified backends. It also simplifies configuring the needed authorization, routing, filtering and concurrency limiting for incoming requests comparing to nginx. For example, the following -auth.config can be used for the top level of vmauth for spreading the incoming requests among availability zones:

unauthorized_user:
  url_prefix:
  - http://vmauth-zone-A/?deny_partial_response=1
  - http://vmauth-zone-B/?deny_partial_response=1

This config adds deny_partial_response=1 query arg to all the requests to the lower per-zone vmauth services, in order to guarantee that they return either full responses or an error if some of vmselect nodes are temporarily unavailable in the zone. See these docs for details.

The lower level of vmauth nodes can have the following config for spreading incoming requests among per-zone vmselect nodes:

unauthorized_user:
  url_map:
  - src_paths:
    - /api/v1/query
    - /api/v1/query_range
    url_prefix:
    - http://vmselect-1:8481/select/0/prometheus
    # ...
    - http://vmselect-N:8481/select/0/prometheus

valyala added a commit that referenced this issue Sep 7, 2023
…d and some of vmstorage nodes are temporarily unavailable

This should help detecting this case and automatic retrying the query at healthy cluster replica
in another availability zone.

This commit is needed as a preparation for automatic query retry at another backend at vmauth on 5xx errors
as described at #4792 (comment)
valyala added a commit that referenced this issue Sep 7, 2023
…d and some of vmstorage nodes are temporarily unavailable

This should help detecting this case and automatic retrying the query at healthy cluster replica
in another availability zone.

This commit is needed as a preparation for automatic query retry at another backend at vmauth on 5xx errors
as described at #4792 (comment)
valyala added a commit that referenced this issue Sep 7, 2023
…d and some of vmstorage nodes are temporarily unavailable

This should help detecting this case and automatic retrying the query at healthy cluster replica
in another availability zone.

This commit is needed as a preparation for automatic query retry at another backend at vmauth on 5xx errors
as described at #4792 (comment)
valyala added a commit that referenced this issue Sep 7, 2023
…d and some of vmstorage nodes are temporarily unavailable

This should help detecting this case and automatic retrying the query at healthy cluster replica
in another availability zone.

This commit is needed as a preparation for automatic query retry at another backend at vmauth on 5xx errors
as described at #4792 (comment)
valyala added a commit that referenced this issue Sep 7, 2023
…d and some of vmstorage nodes are temporarily unavailable

This should help detecting this case and automatic retrying the query at healthy cluster replica
in another availability zone.

This commit is needed as a preparation for automatic query retry at another backend at vmauth on 5xx errors
as described at #4792 (comment)
valyala added a commit that referenced this issue Sep 7, 2023
…d and some of vmstorage nodes are temporarily unavailable

This should help detecting this case and automatic retrying the query at healthy cluster replica
in another availability zone.

This commit is needed as a preparation for automatic query retry at another backend at vmauth on 5xx errors
as described at #4792 (comment)
valyala added a commit that referenced this issue Sep 7, 2023
…odes

This should allow implementing high availability scheme described at #4792 (comment)

See also #4893
valyala added a commit that referenced this issue Sep 7, 2023
…odes

This should allow implementing high availability scheme described at #4792 (comment)

See also #4893
@valyala
Copy link
Collaborator

valyala commented Nov 1, 2023

The related issue - #5197

valyala added a commit that referenced this issue Dec 8, 2023
…load balancing policy

vmauth in `hot standby` mode sends requests to the first url_prefix while it is available.
If the first url_prefix becomes unavailable, then vmauth falls back to the next url_prefix.
This allows building highly available setup as described at https://docs.victoriametrics.com/vmauth.html#high-availability

Updates #4893
Updates #4792
valyala added a commit that referenced this issue Dec 8, 2023
…load balancing policy

vmauth in `hot standby` mode sends requests to the first url_prefix while it is available.
If the first url_prefix becomes unavailable, then vmauth falls back to the next url_prefix.
This allows building highly available setup as described at https://docs.victoriametrics.com/vmauth.html#high-availability

Updates #4893
Updates #4792
@valyala
Copy link
Collaborator

valyala commented Dec 8, 2023

FYI, the upcoming release of vmauth will provide the ability to send all the requests to the closest AZ and to fall back to other AZs only when the closest AZ is unavailable. For example, the following -auth.config instructs vmauth to send requests to https://vmselect-az1/ while it is available and it can return full response. It falls back to https://vmselect-az2/ if https://vmselect-az1/ isn't available or if it cannot return full response:

unauthorized_user:
  url_prefix:
  - 'https://vmselect-az1/?deny_partial_response=1'
  - 'https://vmselect-az2/?deny_partial_response=1'
  retry_status_codes: [500, 502, 503]
  load_balancing_policy: first_available

vmselect responds with 503 Service Unavailable status code when it cannot produce full response because some of vmstorage nodes are temporarily unavailable and when deny_partial_response query arg is present in the query. See these docs for information about deny_partial_response query arg. In this case vmauth re-tries the request at https://vmselect-az2/ because retry_status_codes lists the 503 status code.

This allows building highly available setups as described in these docs.

vmselect nodes at every AZ can be hidden behind an additional vmauth with the following config, which performs even distribution of incoming requests among available vmselect nodes:

unauthorized_user:
  url_prefix:
  - http://vmselect-1:8481/
 ...
  - http://vmselect-N:8481/

See these docs for details on how vmauth balances load among the configured url_prefix entries.

This functionality can be tested by building vmauth from the commit 0422675 according to these docs.

valyala added a commit that referenced this issue Dec 8, 2023
Link to the related issue - #4792
Fix heading for `Modifying HTTP headers` chapter at docs/vmagent.md
valyala added a commit that referenced this issue Dec 8, 2023
Link to the related issue - #4792
Fix heading for `Modifying HTTP headers` chapter at docs/vmagent.md
@ivankovnatsky
Copy link

ivankovnatsky commented Dec 11, 2023

FYI, the upcoming release of vmauth will provide the ability to send all the requests to the closest AZ and to fall back to other AZs only when the closest AZ is unavailable

But what if vmauth runs in az2 and gets a full response from az1 vmselect
endpoint, that could not be counted as the closest one, I presume.

I understand that not all running their VM clusters on k8s, yet if they are I
think using topology aware routing could be pretty decent solution ensuring we
send to same az endpoint. [1]

Update: Tried it and it does not seem like to work at all:
kubernetes/kubernetes#121516.

References:

[1] https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing/#three-or-more-endpoints-per-zone


We a very interested in something similar yet a bit different, we want to make
sure we only use same AZ endpoint, but for vmstorage, as the ingesting traffic
is by far the biggest contributor to our cross-AZ costs. We are currently
running single node, but thinking about moving to cluster version for HA.

While writing a comment I checked the docs and it seems like vminsert spreads
"evenly" the traffic between the vmstorage nodes, I fetched the entrypoint of
one of the vminsert replicas we're running for tests and it shows:

/vminsert-prod \
    --storageNode=vmcluster-victoria-metrics-cluster-vmstorage-0.vmcluster-victoria-metrics-cluster-vmstorage.vmcluster.svc.cluster.local:8400 \
    --storageNode=vmcluster-victoria-metrics-cluster-vmstorage-1.vmcluster-victoria-metrics-cluster-vmstorage.vmcluster.svc.cluster.local:8400 \
    --envflag.enable=true --envflag.prefix=VM_ --loggerFormat=json \
    --maxLabelsPerTimeseries=50

It's hardly we could make it balance, at least with this configuration. Also it
seems like pushing to same AZ would defeat the HA concept here.

What do you think?

@valyala
Copy link
Collaborator

valyala commented Dec 13, 2023

But what if vmauth runs in az2 and gets a full response from az1 vmselect endpoint, that could not be counted as the closest one, I presume.

vmauth doesn't detect the closest backend - there is no magic here. It just proxies requests to the first backend in the url_prefix list while this backend is available if load_balancing_policy is set to first_available as described in these docs. It starts proxying requests to the next backend from the url_prefix list when the first backend becomes unavailable. This means that you must run distinct vmauth instances per each AZ with distinct order of backends in the url_prefix list.

we want to make sure we only use same AZ endpoint, but for vmstorage, as the ingesting traffic is by far the biggest contributor to our cross-AZ costs

It isn't recommended to spread vmstorage nodes of a single cluster across multiple AZs, since this may result in reduced data ingestion performance because of bigger network latencies between AZs. This also may result in higher costs, since cross-AZ traffic is usually billed. It is recommended to run all the vmstorage nodes for a single VictoriaMetrics cluster in the same AZ with low network latencies. Read more details in these docs.

it seems like pushing to same AZ would defeat the HA concept here.

You need to replicate incoming data among multiple completely independent VictoriaMetrics clusters located in different AZs in order to achieve real HA. If a single AZ becomes unavailable, then all the data remains for querying available in other AZs, while new data continues flowing into these other AZs. The data can be replicated among multiple AZs with vmagent as described here and here.

@valyala
Copy link
Collaborator

valyala commented Dec 13, 2023

The -loadBalancingPolicy command-line flag and load_balancing_policy option is available in vmauth starting from v1.96.0. See these docs for details. Closing this feature request then, since now vmauth allows building HA setups across multiple backends with the same data as described here.

@valyala valyala closed this as completed Dec 13, 2023
@ivankovnatsky
Copy link

ivankovnatsky commented Dec 13, 2023

Sorry to post on a closed issue, but while the context is here fresh wanted to clarify. Do I understand correctly that practically with this setup:

unauthorized_user:
  url_prefix:
  - 'https://vmselect-az1/?deny_partial_response=1'
  - 'https://vmselect-az2/?deny_partial_response=1'
  retry_status_codes: [500, 502, 503]
  load_balancing_policy: first_available

I can basically run two single servers in different zones, and pushing metrics only it's own az (I will need to have configured dynamic write urls for services/vmagent in different azs). Meaning every vm will only have metrics from the services in it's own az.

@valyala
Copy link
Collaborator

valyala commented Dec 13, 2023

@ivankovnatsky , you need to write the same data to VictoriaMetrics instances in every AZ. This can be done by specifying multiple -remoteWrite.url command-line options at vmagent, so it sends the same data to all the configured remote storage instances. See these docs. See also high availability docs for single-node VictoriaMetrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question The question issue
Projects
None yet
Development

No branches or pull requests

3 participants