Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmalert: recording rules produce unexpected spikes #4768

Open
mbrancato opened this issue Aug 2, 2023 · 8 comments
Open

vmalert: recording rules produce unexpected spikes #4768

mbrancato opened this issue Aug 2, 2023 · 8 comments
Assignees
Labels
bug Something isn't working need more info vmalert

Comments

@mbrancato
Copy link

Describe the bug

I am seeing frequent outliers in the form of delta functions for recording rules by vmagent.

I'm using recording rules for histogram quantiles. The rules looks like this:

      - record: http_request_latency:1m_quantiles_cluster
        expr: |
          histogram_quantile(0.5, sum(rate(http_request_latency_bucket:30s_without_instance_pod_sum_samples[1m])) by (le,cluster))
        labels:
          quantile: "0.50"

Any while I first attributed this to starts / restarts of vmagent, over time I identified these delta functions appear at random across different recorded metrics.

If I run the expression manually, the spike is not there.

image

I'd rather the data not be there, than be wrong and mess up the scale.

To Reproduce

create a histogram_quantile recording rule

Version

1.90.0

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

@mbrancato mbrancato added the bug Something isn't working label Aug 2, 2023
@hagen1778 hagen1778 changed the title Spikes / outliers in vmagent recording rules vmalert: recording rules produce unexpected spikes Aug 7, 2023
@hagen1778
Copy link
Collaborator

@mbrancato do you have non-default settings for vmalert? Have you read https://docs.victoriametrics.com/vmalert.html#troubleshooting ? Could you please elaborate on how data gets into the VictoriaMetrics? How exactly http_request_latency_bucket:30s_without_instance_pod_sum_samples is calculated: is it generated by vmalert or by another component?

@hagen1778 hagen1778 self-assigned this Aug 7, 2023
@mbrancato
Copy link
Author

@hagen1778 We're using the VM operator to deploy and configure everything. I do want to clarify, I've switched to the total operation for streaming aggregation, and I still see this. To get data, vmagent is pushing streaming aggregation metrics to the VM cluster. The summary metric http_request_latency_bucket:30s_without_instance_pod_sum_samples or http_request_latency_bucket:30s_without_instance_pod_total is generated by vmagent and shipped using remote write to VM.

Again, if I run the expr directly, I do not see the spikes. I'll look into the lookback setting (I expect metrics are delayed at times), but that is not a supported option in VMAlertDatasourceSpec.

Example in table form:
image

@hagen1778
Copy link
Collaborator

hagen1778 commented Aug 16, 2023

@mbrancato thanks for details! Do you use one or multiple vmagents?
I set up a testing env for this - will check if I get the same discrepancy as yours. Will report later about my findings.

@hagen1778
Copy link
Collaborator

@mbrancato I've been running a test on dev environment to calculate recording rules based on aggregates generated by vmagent. I've used the following config for rules:

  groups:
    - name: quantiles-for-stream-aggregated-histogram-buckets
      rules:
        - record: storage_operation_duration:quantile:5m
          expr: |
            histogram_quantile(0.5, sum(rate(storage_operation_duration_seconds_bucket:2m_by_instance_job_le_status_total[5m])) by(le, status))
          labels:
            quantile: "0.50"
        - record: storage_operation_duration:quantile:5m
          expr: |
            histogram_quantile(1, sum(rate(storage_operation_duration_seconds_bucket:2m_by_instance_job_le_status_total[5m])) by(le, status))
          labels:
            quantile: "1"
        - record: storage_operation_duration:quantile:5m
          expr: |
            histogram_quantile(0.99, sum(rate(storage_operation_duration_seconds_bucket:2m_by_instance_job_le_status_total[5m])) by(le, status))
          labels:
            quantile: "0.99"

For the streaming aggregation I used the following config:

          - match: "storage_operation_duration_seconds_bucket"
            interval: "2m"
            outputs: [ "total" ]
            by: [ "job", "instance", "status", "le" ]

I ran the test for more than 7d and I see no discrepancy:

Screen.Recording.2023-08-24.at.09.47.29.mov

Could you help to identify what in your config could be causing the issue you mentioned?

@mbrancato
Copy link
Author

@hagen1778 sorry I haven't had much time to get back to this, but today I did. I made two changes:

I doubled the evaluation interval to 1 minute, and I added a lookback parameter. This was following guidance at: https://docs.victoriametrics.com/vmalert.html#data-delay. But unfortunately, it did not seem to help.

  evaluationInterval: "1m"
  extraArgs:
    # attempt to solve blank spaces in recorded rule graphs due to delayed data
    datasource.lookback: "3m"

my recording rules look like this:

  groups:
  - name: my-rules
    interval: 1m
    rules:
      - record: process_time:1m_quantiles
        expr: |
          histogram_quantile(0.5, sum(rate(process_time_bucket:30s_without_instance_pod_total[1m])) by (le))
        labels:
          quantile: "0.50"

However, recorded data still has gaps. I did find sometimes there source data has a gap, but not normally, and it doesnt' correspond with the recording rule gaps (green below). I inverted the recording rule to make it easier to see...
image

@mbrancato
Copy link
Author

I just realized the gaps were not really what this issue was about, but since the lookback change, I've seen fewer spike, but they still happen.

@hagen1778
Copy link
Collaborator

@mbrancato could it be that vmagent is struggling to deliver the mentioned metric process_time_bucket:30s_without_instance_pod_total? Can you check vmagents Grafana dashboard and see if PersistentQueue had correlated increases? Or maybe vmagent was restarted in the same moments of time?

@mbrancato
Copy link
Author

@hagen1778 There are no restarts at that time, and kubernetes shows no container restarts for those pods. The persistent queue graph is empty - I see its filtered to only values > 2e6. However if I remove that filter, I get this:
image

I do now think this this is mainly related to #4966. At least a lot of recording rule spikes are correlated with streaming aggregation spikes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need more info vmalert
Projects
None yet
Development

No branches or pull requests

2 participants