vmalert: plotting a recorded metric along with the original in grafana requires a offset for them to match #1232

raags · 2021-04-19T10:22:47Z

Describe the bug
The screenshot below shows the difference in the original query and the recording rule. The graph does not match without the offset equal to the scrape interval.

The following parameters have been explored:

-search.latencyOffset=15s - (victoriametrics) to ensure recording rule doesn't miss any datapoint (due to delay in scraping)

-datasource.queryStep=15s - (vmagent) - to ensure step matches grafana step
-evaluationInterval=15s - (vmagent) - matches scrape interval
-datasource.lookback=15s - (vmagent) - to let vmagent "offset" match the search latency offset. This should ideally fix the issue I'm seeing.

To Reproduce
docker-compose.yml

version: '2.4'
services:
  victoriametrics:
    container_name: victoriametrics
    image: victoriametrics/victoria-metrics
    ports:
      - 8428:8428
    volumes:
      - vmdata:/vmdata
    command:
      - '-storageDataPath=/vmdata'
      - '-httpListenAddr=:8428'
      - '-selfScrapeInterval=0'
      - '-search.latencyOffset=15s'
  vmagent:
    container_name: vmagent
    image: victoriametrics/vmagent
    depends_on:
      - victoriametrics
    ports:
      - 8429:8429
    volumes:
      - vmagentdata:/vmagentdata
      - ./vm.yml:/tmp/vm.yml
    command:
      - '-promscrape.config=/tmp/vm.yml'
      - '-remoteWrite.url=http://victoriametrics:8428/api/v1/write'
  vmalert:
    container_name: vmalert
    image: victoriametrics/vmalert
    depends_on:
      - victoriametrics
    volumes:
      - ./vmalert_rules.yml:/tmp/alert.rules
    command:
      - '-rule=/tmp/alert.rules'
      # dummy notifier
      - '-notifier.url=http://127.0.0.1:9093'
      - '-datasource.url=http://victoriametrics:8428'
      - '-remoteWrite.url=http://victoriametrics:8428'
      - '-datasource.queryStep=15s'
      - '-evaluationInterval=15s'
      - '-datasource.lookback=15s'

vm.yml

---
global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: static_scrape
    static_configs:
    - targets:
        - victoriametrics:8428

alert.rules

groups:
 - name: HistogramAggregations
   rules:
     - record: test:scrape_duration_seconds:15s
       expr: >
         sum(rate(scrape_duration_seconds{instance=~"victoriametrics.*"}[15s]))

Plotting test:scrape_duration_seconds:15s and sum(rate(scrape_duration_seconds{instance=~"victoriametrics.*"}[15s])) still requires a 15s offset to match.

Expected behavior
The original and recorded metric should match.

Version
The line returned when passing --version command line flag to binary. For example:

$ victoriametrics --version
victoria-metrics-20210408-064138-tags-v1.58.0-0-gedd1590ac

One round of discussion here: #832

The text was updated successfully, but these errors were encountered:

hagen1778 · 2021-04-27T19:22:16Z

Hi @raags! I think I understand now why would that happen. I think, I'll come up with something this week. Thank you for detailed report!

hagen1778 · 2021-04-30T06:45:27Z

Hi @raags! Pls see #1257 PR which may help in this case.
For case described in this ticket don't forget to either bump datasource.lookback to 30s or lower search.latencyOffset to 15s. datasource.lookback shouldn't be lower or equal to search.latencyOffset.

Please see how to build vmalert from sources here https://docs.victoriametrics.com/vmalert.html#how-to-build-from-sources

Updates #1232

raags · 2021-04-30T06:57:25Z

@hagen1778 Thanks, I'll check against the PR.

The values for search.latencyOffset and datasource.lookback can be equal right? I'm assuming optimally they should be equal, which will result in the minimum lag possible (which will be equal to scrape interval). If I bump datasource.lookback then the recording rules will be delayed by that much, correct?

hagen1778 · 2021-04-30T07:15:15Z

The values for search.latencyOffset and datasource.lookback can be equal right?

Correct. Updated my prev comment.

If I bump datasource.lookback then the recording rules will be delayed by that much, correct?

Yes, produced time series will be delayed (missing last data points) but still aligned with original expression.

valyala · 2021-05-01T07:56:25Z

All the commits mentioned above have been included in v1.59.0. @raags , could you verify whether vmalert v1.59.0 generates the expected recording rule results?

hagen1778 · 2021-05-18T14:18:48Z

Hi @raags! Any updates on this?

raags · 2021-05-23T08:09:19Z

Hi @hagen1778 works as expected, the recorded rule matches exactly with the original query. This has allowed us to replace slow histograms graphs with their corresponding recorded rules, which are much faster now.

However, there is one gripe - sometimes the recorded rule has extra data points when compared to the original query. This only happens with histograms, where the orignal metric disappears.

Check the below screenshot, where I tried to reproduce it.

Its always equal to the last data point, and can span to more than 1 data point. This screenshot is from the prod dashboard:

Could this be due to the way histograms are calculated in a recording rule?

hagen1778 · 2021-05-24T11:49:49Z

I suspect it could happen due to the following reasons:

VM automatically adjusts staleness period based on scraping_interval for a target. For example, if metric wasn't updated for a 3x scrape intervals, it will be marked as stale and will disappear. Before that, VM will continue to return "phantom" data points with the value of the last recorded value.
Because of this, vmalert will continue to receive responses for that metric even if it stopped to exist.
Hence, vmalert's recording rule will always have more data points than original query in such cases, because for a short period of time it was processing those "phantom" data points.

Schematically it may be displayed in the following way
Original metric: - - - - * *
Recording rule: - - - - - - * *
where:
- real data point
* phantom data point based on the last value

But this doesn't explain why it happens only to histograms in your case.

raags · 2021-08-24T08:04:13Z

@hagen1778 got it - in that case, can VMagent can somehow be excluded from the staleness behaviour? I see there is a way to set -search.maxStalenessInterval and adjust this, but that will apply to all queries, and only to VMagent recording rules (where these phantom data points are actually adding data that shouldn't exist).

hagen1778 · 2021-08-30T12:46:01Z

It can't be done, unfortunately. The staleness threshold is a global setting and can't be adjusted by clients. This might be a feature request, though (wdyt @valyala ?).
@raags what exactly happens to make "the orignal metric disappear"? Could the staleness markers support help here?

hagen1778 · 2021-12-12T07:09:35Z

Closed as inactive

valyala assigned hagen1778 Apr 20, 2021

valyala added bug Something isn't working vmalert labels Apr 20, 2021

hagen1778 mentioned this issue Apr 30, 2021

Vmalert: adjust time param for datasource queries according to evaluationInterval #1257

Merged

valyala added a commit that referenced this issue Apr 30, 2021

docs/CHANGELOG.md: document the change from f3a0482

4394dc6

Updates #1232

valyala added a commit that referenced this issue Apr 30, 2021

docs/CHANGELOG.md: document the change from f3a0482

daf2778

Updates #1232

hagen1778 closed this as completed Dec 12, 2021

hagen1778 mentioned this issue Aug 16, 2022

vmalert evaluate incorrect alert when using absent #2976

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmalert: plotting a recorded metric along with the original in grafana requires a offset for them to match #1232

vmalert: plotting a recorded metric along with the original in grafana requires a offset for them to match #1232

raags commented Apr 19, 2021

hagen1778 commented Apr 27, 2021 •

edited

Loading

hagen1778 commented Apr 30, 2021 •

edited

Loading

raags commented Apr 30, 2021

hagen1778 commented Apr 30, 2021

valyala commented May 1, 2021

hagen1778 commented May 18, 2021

raags commented May 23, 2021

hagen1778 commented May 24, 2021 •

edited

Loading

raags commented Aug 24, 2021

hagen1778 commented Aug 30, 2021 •

edited

Loading

hagen1778 commented Dec 12, 2021

vmalert: plotting a recorded metric along with the original in grafana requires a offset for them to match #1232

vmalert: plotting a recorded metric along with the original in grafana requires a offset for them to match #1232

Comments

raags commented Apr 19, 2021

hagen1778 commented Apr 27, 2021 • edited Loading

hagen1778 commented Apr 30, 2021 • edited Loading

raags commented Apr 30, 2021

hagen1778 commented Apr 30, 2021

valyala commented May 1, 2021

hagen1778 commented May 18, 2021

raags commented May 23, 2021

hagen1778 commented May 24, 2021 • edited Loading

raags commented Aug 24, 2021

hagen1778 commented Aug 30, 2021 • edited Loading

hagen1778 commented Dec 12, 2021

hagen1778 commented Apr 27, 2021 •

edited

Loading

hagen1778 commented Apr 30, 2021 •

edited

Loading

hagen1778 commented May 24, 2021 •

edited

Loading

hagen1778 commented Aug 30, 2021 •

edited

Loading