Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmalert: plotting a recorded metric along with the original in grafana requires a offset for them to match #1232

Closed
raags opened this issue Apr 19, 2021 · 11 comments
Assignees
Labels
bug Something isn't working vmalert

Comments

@raags
Copy link

raags commented Apr 19, 2021

Describe the bug
The screenshot below shows the difference in the original query and the recording rule. The graph does not match without the offset equal to the scrape interval.

99452354-544b2a00-291b-11eb-9e30-56da95252d4e

The following parameters have been explored:

  • -search.latencyOffset=15s - (victoriametrics) to ensure recording rule doesn't miss any datapoint (due to delay in scraping)
  • -datasource.queryStep=15s - (vmagent) - to ensure step matches grafana step
  • -evaluationInterval=15s - (vmagent) - matches scrape interval
  • -datasource.lookback=15s - (vmagent) - to let vmagent "offset" match the search latency offset. This should ideally fix the issue I'm seeing.

To Reproduce
docker-compose.yml

version: '2.4'
services:
  victoriametrics:
    container_name: victoriametrics
    image: victoriametrics/victoria-metrics
    ports:
      - 8428:8428
    volumes:
      - vmdata:/vmdata
    command:
      - '-storageDataPath=/vmdata'
      - '-httpListenAddr=:8428'
      - '-selfScrapeInterval=0'
      - '-search.latencyOffset=15s'
  vmagent:
    container_name: vmagent
    image: victoriametrics/vmagent
    depends_on:
      - victoriametrics
    ports:
      - 8429:8429
    volumes:
      - vmagentdata:/vmagentdata
      - ./vm.yml:/tmp/vm.yml
    command:
      - '-promscrape.config=/tmp/vm.yml'
      - '-remoteWrite.url=http://victoriametrics:8428/api/v1/write'
  vmalert:
    container_name: vmalert
    image: victoriametrics/vmalert
    depends_on:
      - victoriametrics
    volumes:
      - ./vmalert_rules.yml:/tmp/alert.rules
    command:
      - '-rule=/tmp/alert.rules'
      # dummy notifier
      - '-notifier.url=http://127.0.0.1:9093'
      - '-datasource.url=http://victoriametrics:8428'
      - '-remoteWrite.url=http://victoriametrics:8428'
      - '-datasource.queryStep=15s'
      - '-evaluationInterval=15s'
      - '-datasource.lookback=15s'

vm.yml

---
global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: static_scrape
    static_configs:
    - targets:
        - victoriametrics:8428

alert.rules

groups:
 - name: HistogramAggregations
   rules:
     - record: test:scrape_duration_seconds:15s
       expr: >
         sum(rate(scrape_duration_seconds{instance=~"victoriametrics.*"}[15s]))

Plotting test:scrape_duration_seconds:15s and sum(rate(scrape_duration_seconds{instance=~"victoriametrics.*"}[15s])) still requires a 15s offset to match.

Expected behavior
The original and recorded metric should match.

Version
The line returned when passing --version command line flag to binary. For example:

$ victoriametrics --version
victoria-metrics-20210408-064138-tags-v1.58.0-0-gedd1590ac

One round of discussion here: #832

@valyala valyala added bug Something isn't working vmalert labels Apr 20, 2021
@hagen1778
Copy link
Collaborator

hagen1778 commented Apr 27, 2021

Hi @raags! I think I understand now why would that happen. I think, I'll come up with something this week. Thank you for detailed report!

@hagen1778
Copy link
Collaborator

hagen1778 commented Apr 30, 2021

Hi @raags! Pls see #1257 PR which may help in this case.
For case described in this ticket don't forget to either bump datasource.lookback to 30s or lower search.latencyOffset to 15s. datasource.lookback shouldn't be lower or equal to search.latencyOffset.

Please see how to build vmalert from sources here https://docs.victoriametrics.com/vmalert.html#how-to-build-from-sources

valyala added a commit that referenced this issue Apr 30, 2021
valyala added a commit that referenced this issue Apr 30, 2021
@raags
Copy link
Author

raags commented Apr 30, 2021

@hagen1778 Thanks, I'll check against the PR.

The values for search.latencyOffset and datasource.lookback can be equal right? I'm assuming optimally they should be equal, which will result in the minimum lag possible (which will be equal to scrape interval). If I bump datasource.lookback then the recording rules will be delayed by that much, correct?

@hagen1778
Copy link
Collaborator

The values for search.latencyOffset and datasource.lookback can be equal right?

Correct. Updated my prev comment.

If I bump datasource.lookback then the recording rules will be delayed by that much, correct?

Yes, produced time series will be delayed (missing last data points) but still aligned with original expression.

@valyala
Copy link
Collaborator

valyala commented May 1, 2021

All the commits mentioned above have been included in v1.59.0. @raags , could you verify whether vmalert v1.59.0 generates the expected recording rule results?

@hagen1778
Copy link
Collaborator

Hi @raags! Any updates on this?

@raags
Copy link
Author

raags commented May 23, 2021

Hi @hagen1778 works as expected, the recorded rule matches exactly with the original query. This has allowed us to replace slow histograms graphs with their corresponding recorded rules, which are much faster now.

However, there is one gripe - sometimes the recorded rule has extra data points when compared to the original query. This only happens with histograms, where the orignal metric disappears.

Check the below screenshot, where I tried to reproduce it.

Screenshot 2021-05-23 at 1 33 42 PM

Its always equal to the last data point, and can span to more than 1 data point. This screenshot is from the prod dashboard:

Screenshot 2021-05-23 at 1 37 43 PM

Could this be due to the way histograms are calculated in a recording rule?

@hagen1778
Copy link
Collaborator

hagen1778 commented May 24, 2021

I suspect it could happen due to the following reasons:

  1. VM automatically adjusts staleness period based on scraping_interval for a target. For example, if metric wasn't updated for a 3x scrape intervals, it will be marked as stale and will disappear. Before that, VM will continue to return "phantom" data points with the value of the last recorded value.
  2. Because of this, vmalert will continue to receive responses for that metric even if it stopped to exist.
  3. Hence, vmalert's recording rule will always have more data points than original query in such cases, because for a short period of time it was processing those "phantom" data points.

Schematically it may be displayed in the following way
Original metric: - - - - * *
Recording rule: - - - - - - * *
where:
- real data point
* phantom data point based on the last value

But this doesn't explain why it happens only to histograms in your case.

@raags
Copy link
Author

raags commented Aug 24, 2021

@hagen1778 got it - in that case, can VMagent can somehow be excluded from the staleness behaviour? I see there is a way to set -search.maxStalenessInterval and adjust this, but that will apply to all queries, and only to VMagent recording rules (where these phantom data points are actually adding data that shouldn't exist).

@hagen1778
Copy link
Collaborator

hagen1778 commented Aug 30, 2021

It can't be done, unfortunately. The staleness threshold is a global setting and can't be adjusted by clients. This might be a feature request, though (wdyt @valyala ?).
@raags what exactly happens to make "the orignal metric disappear"? Could the staleness markers support help here?

@hagen1778
Copy link
Collaborator

Closed as inactive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vmalert
Projects
None yet
Development

No branches or pull requests

3 participants