vmalert: recording rules produce unexpected spikes #4768

mbrancato · 2023-08-02T18:06:46Z

Describe the bug

I am seeing frequent outliers in the form of delta functions for recording rules by vmagent.

I'm using recording rules for histogram quantiles. The rules looks like this:

      - record: http_request_latency:1m_quantiles_cluster
        expr: |
          histogram_quantile(0.5, sum(rate(http_request_latency_bucket:30s_without_instance_pod_sum_samples[1m])) by (le,cluster))
        labels:
          quantile: "0.50"

Any while I first attributed this to starts / restarts of vmagent, over time I identified these delta functions appear at random across different recorded metrics.

If I run the expression manually, the spike is not there.

I'd rather the data not be there, than be wrong and mess up the scale.

To Reproduce

create a histogram_quantile recording rule

Version

1.90.0

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

hagen1778 · 2023-08-07T20:28:44Z

@mbrancato do you have non-default settings for vmalert? Have you read https://docs.victoriametrics.com/vmalert.html#troubleshooting ? Could you please elaborate on how data gets into the VictoriaMetrics? How exactly http_request_latency_bucket:30s_without_instance_pod_sum_samples is calculated: is it generated by vmalert or by another component?

mbrancato · 2023-08-16T06:07:26Z

@hagen1778 We're using the VM operator to deploy and configure everything. I do want to clarify, I've switched to the total operation for streaming aggregation, and I still see this. To get data, vmagent is pushing streaming aggregation metrics to the VM cluster. The summary metric http_request_latency_bucket:30s_without_instance_pod_sum_samples or http_request_latency_bucket:30s_without_instance_pod_total is generated by vmagent and shipped using remote write to VM.

Again, if I run the expr directly, I do not see the spikes. I'll look into the lookback setting (I expect metrics are delayed at times), but that is not a supported option in VMAlertDatasourceSpec.

Example in table form:

hagen1778 · 2023-08-16T14:55:19Z

@mbrancato thanks for details! Do you use one or multiple vmagents?
I set up a testing env for this - will check if I get the same discrepancy as yours. Will report later about my findings.

hagen1778 · 2023-08-24T07:51:07Z

@mbrancato I've been running a test on dev environment to calculate recording rules based on aggregates generated by vmagent. I've used the following config for rules:

  groups:
    - name: quantiles-for-stream-aggregated-histogram-buckets
      rules:
        - record: storage_operation_duration:quantile:5m
          expr: |
            histogram_quantile(0.5, sum(rate(storage_operation_duration_seconds_bucket:2m_by_instance_job_le_status_total[5m])) by(le, status))
          labels:
            quantile: "0.50"
        - record: storage_operation_duration:quantile:5m
          expr: |
            histogram_quantile(1, sum(rate(storage_operation_duration_seconds_bucket:2m_by_instance_job_le_status_total[5m])) by(le, status))
          labels:
            quantile: "1"
        - record: storage_operation_duration:quantile:5m
          expr: |
            histogram_quantile(0.99, sum(rate(storage_operation_duration_seconds_bucket:2m_by_instance_job_le_status_total[5m])) by(le, status))
          labels:
            quantile: "0.99"

For the streaming aggregation I used the following config:

          - match: "storage_operation_duration_seconds_bucket"
            interval: "2m"
            outputs: [ "total" ]
            by: [ "job", "instance", "status", "le" ]

I ran the test for more than 7d and I see no discrepancy:

Screen.Recording.2023-08-24.at.09.47.29.mov

Could you help to identify what in your config could be causing the issue you mentioned?

mbrancato · 2023-09-16T01:03:22Z

@hagen1778 sorry I haven't had much time to get back to this, but today I did. I made two changes:

I doubled the evaluation interval to 1 minute, and I added a lookback parameter. This was following guidance at: https://docs.victoriametrics.com/vmalert.html#data-delay. But unfortunately, it did not seem to help.

  evaluationInterval: "1m"
  extraArgs:
    # attempt to solve blank spaces in recorded rule graphs due to delayed data
    datasource.lookback: "3m"

my recording rules look like this:

  groups:
  - name: my-rules
    interval: 1m
    rules:
      - record: process_time:1m_quantiles
        expr: |
          histogram_quantile(0.5, sum(rate(process_time_bucket:30s_without_instance_pod_total[1m])) by (le))
        labels:
          quantile: "0.50"

However, recorded data still has gaps. I did find sometimes there source data has a gap, but not normally, and it doesnt' correspond with the recording rule gaps (green below). I inverted the recording rule to make it easier to see...

mbrancato · 2023-09-17T01:10:19Z

I just realized the gaps were not really what this issue was about, but since the lookback change, I've seen fewer spike, but they still happen.

hagen1778 · 2023-09-18T11:37:53Z

@mbrancato could it be that vmagent is struggling to deliver the mentioned metric process_time_bucket:30s_without_instance_pod_total? Can you check vmagents Grafana dashboard and see if PersistentQueue had correlated increases? Or maybe vmagent was restarted in the same moments of time?

mbrancato · 2023-09-18T13:40:15Z

@hagen1778 There are no restarts at that time, and kubernetes shows no container restarts for those pods. The persistent queue graph is empty - I see its filtered to only values > 2e6. However if I remove that filter, I get this:

I do now think this this is mainly related to #4966. At least a lot of recording rule spikes are correlated with streaming aggregation spikes.

mbrancato added the bug Something isn't working label Aug 2, 2023

hagen1778 changed the title ~~Spikes / outliers in vmagent recording rules~~ vmalert: recording rules produce unexpected spikes Aug 7, 2023

hagen1778 added need more info vmalert labels Aug 7, 2023

hagen1778 self-assigned this Aug 7, 2023

mbrancato mentioned this issue Sep 7, 2023

Streaming aggregation spikes in data reported #4966

Open

hagen1778 mentioned this issue Jun 11, 2024

Histogram buckets are updated partially for recently written samples #6424

Open

This was referenced Jul 10, 2024

app/vmselect/promql: propagate lower bucket values when fixing a histogram #6547

Merged

lib/streamaggr: added aggregation windows #6314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmalert: recording rules produce unexpected spikes #4768

vmalert: recording rules produce unexpected spikes #4768

mbrancato commented Aug 2, 2023

hagen1778 commented Aug 7, 2023

mbrancato commented Aug 16, 2023

hagen1778 commented Aug 16, 2023 •

edited

Loading

hagen1778 commented Aug 24, 2023

mbrancato commented Sep 16, 2023

mbrancato commented Sep 17, 2023

hagen1778 commented Sep 18, 2023

mbrancato commented Sep 18, 2023

vmalert: recording rules produce unexpected spikes #4768

vmalert: recording rules produce unexpected spikes #4768

Comments

mbrancato commented Aug 2, 2023

Describe the bug

To Reproduce

Version

Logs

Screenshots

Used command-line flags

Additional information

hagen1778 commented Aug 7, 2023

mbrancato commented Aug 16, 2023

hagen1778 commented Aug 16, 2023 • edited Loading

hagen1778 commented Aug 24, 2023

mbrancato commented Sep 16, 2023

mbrancato commented Sep 17, 2023

hagen1778 commented Sep 18, 2023

mbrancato commented Sep 18, 2023

hagen1778 commented Aug 16, 2023 •

edited

Loading