Skip to content

Add ability to set different timing offsets for scraping the same targets across distinct vmagent clusters #2679

@valyala

Description

@valyala

The issue

It is possible to shard scrape targets among multiple vmagent instances according to these docs. Sometimes it is needed to set up multiple clusters of vmagent instances, so they scrape the same targets. For instance, if these clusters are located in different availability zones (AZ) for the purpose of high availability (e.g. when a single AZ becomes unavailable, then vmagent cluster in the remaining AZ continues scraping targets and sending the scraped data to a centralized VictoriaMetrics). Deduplication must be configured in the centralized VictoriaMetrics in order to remove duplicate samples scraped by vmagent instances in distinct AZs - e.g. -dedup.minScrapeInterval must be set to the interval between samples (aka scrape_interval). Unfortunately, the deduplication in such multi-AZ setup doesn't work as expected - it leaves random samples scraped by vmagent instances from both AZs, while the expected behavior is to consistently leave samples scraped for the given target by an vmagent located in a single AZ under normal conditions (e.g. unless the AZ is unavailable). This behaviour is explained in the following way:

  • The given scrape target is scraped with the same time offset by vmagent instances in both AZs. This means that the samples collected by vmagent instances are identical with millisecond precision.
  • VictoriaMetrics stores timestamps for samples with millisecond precision. So samples received from both AZs for the same time series have identical timestamps.
  • Samples for the same series received from distinct AZs may have different values because of network and physical timing jitters during scraping. This is especially true for metrics, which change with high frequency (e.g. more than 1000 times per second).
  • The de-duplication in VictoriaMetrics leaves a random sample out of multiple samples for the same time series if they have identical timestamps with millisecond precision.

The solution

To introduce -promscrape.cluster.name command-line flag. Vmagent instances must be configured with distinct value for this flag per each AZ. For example, all the vmagent instances in AZ1 must run with -promscrape.cluster.name=AZ1, while all the vmagent instances in AZ2 must run with -promscrape.cluster.name=AZ2. vmagent transforms this flag to a deterministic time offset in the range (0..scrape_interval), which is applied to all the targets it scrapes. This means that the same target will be scraped at different times by vmagent instances located in different AZs, so the de-duplication will consistently leave per-target samples scraped by a single vmagent located in a particular AZ.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions