-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming aggregation spikes in data reported #4966
Comments
Hello @mbrancato! Could you please correlate if this vmagent is restarted or if its config gets updated. The metrics that should help you in this are |
Hey @hagen1778 - based on the logs I do not think it restarted, but here are the metrics you mentioned around that time. It does seem like it added and dropped some target pods at that time (see log entry above), which is normal because we are spinning up and down pods frequently based on request volume. I have seen / correlated this with a prom config reload previously, but its not always correlated with the spikes as in this case. reload_success_timestamp_seconds: Even if there was a gap in data (maybe a remote-write issue), I wouldn't think there would be a jump in the counters. |
Hm, I think the gap could explain it... |
@hagen1778 vmagent is doing the scraping, and doing remote write + streaming aggregation. |
This might be related to #4862
Hm, this is weird... Do you a flat hierarchy of vmagents? Or are there vmagents writing into vmagents? |
@hagen1778 vmagent shards write direct to a VM cluster (vminsert). So I'm sure there could be connection timeouts or drops, but what doesn't make sense is the Here is some additional data from today, these are login actions that are being graphed by 3 vmagent shards:
![]() Shard 1 suddenly jumps from 166 to 208. The values of the other shards (0 and 3) do not change. There are no restarts or odd log entries, but I have filtered out "cannot scrape target" entries in these logs, as they are just misconfigured PodMonitor resources. |
Thanks for your investigation!
I'm afraid it could be very complicated to define such a moment in time when all is discovered and scraped due to following reasons:
Form what you described I assume the problem in the following:
I think, the proper solution would be to persist state of aggregates on vmagent shutdown, so after the restart produced counters won't start from 0. |
Thanks @hagen1778 - I think simply waiting until some interval has fully passed for any aggregation since startup is also viable, but obviously its lossy. Without loading persisted aggregation data (which is currently not available), I would ask why is For using kubernetes and the VM operator, its not simply restarts. I mainly see this on pod replacement. Same deployment and replicaset, but a new replacement pod is brought up for whatever reason. Maybe a node was cycled or something. You can see this with the pod names below, the metric is For persisting aggregation per shard across restarts, I think ![]() |
This would require the user to specify this interval manually, which might not be trivial to explain and configure. Besides, it seems like it would a similar result to extending the aggregation interval to bigger value, which would cover possible delays in targets discovery.
The aggregation happens in a separate to scraping process, because it could need to aggregate values from many targets. With this design aggregation isn't aware of something being scraped or discovered. It just performs aggregations in background for everything it was fed with.
Agree. I think we can solve it in a different way. Looks like you're using |
Hi! We have exactly same issue with spikes in data. We have aggregation interval 15 minutes for cold storage. VMAgent scrapes all targets. Deployed with VMOperator with
|
@sergeyshevch I think your problem is related to #5643, not to the case in this ticket. |
@hagen1778 I don't think a rate aggregation helps me here since the rate and probably increase calculation would be limited to specific timeframes, so I couldn't then convert that from a |
@hagen1778 After upgrading to 1.101 this issue or similar appears to be even more prevalent. I have not enabled the flushing to storage for I was able to easily show some bad data using grafana. Here, the nginx requests can be seen increasing by many million over just a few seconds. I was able to then jump into the UI for that I then port forwarded to each nginx pod and summed the values:
The results are nowhere near what |
Btw, this is now possible to define via
Do you drop raw metrics? If not, could you please show the raw values of these time series without aggregation on the same period of time? Could you please also show the config of your aggregating rule? I find the metric name on the screenshot different to the rule you're posted in the initial message. |
I do drop raw metrics, but let me re-enable them for these nginx metrics and see if I can provide that. currently my remoteWrite:
- url: ...
sendTimeout: "2m"
inlineUrlRelabelConfig:
# Drop all container metrics unless they are aggregated. This reduces the value of the
# metrics, but reduces the cardinality. Enable specific container metrics needed here
# for investigation.
- action: drop_metrics
regex: "^container_[^:]*"
- action: drop_metrics
regex: "^kube_pod_[^:]*$"
- action: drop_metrics
regex: "^prober_probe_[^:]*$"
- action: drop_metrics
regex: "^kube_endpoint_address$"
# Disabling this to help debug this issue:
# https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4966
#- action: drop_metrics
# regex: "^nginx_[^:]*$"
- url: ...
sendTimeout: "2m"
streamAggrConfig:
keepInput: false
rules:
- match:
- '{__name__=~"^container_.+"}'
interval: "30s"
outputs: ["total"]
without: ["id", "name", "instance", "pod", "node"]
- match:
- '{__name__=~"^kube_pod_.+"}'
interval: "30s"
outputs: ["total"]
without: ["uid", "pod", "container_id"]
- match:
- '{__name__=~"^prober_probe_.+"}'
interval: "30s"
outputs: ["total"]
without: ["pod", "instance"]
- match:
- '{__name__=~"^nginx_.+"}'
interval: "30s"
outputs: ["total"]
without: ["pod", "instance", "controller_pod"] I just adjusted the config, but I might need to set |
hey @hagen1778 - I re-enabled the raw metrics, and its a stark difference in values here: ![]() |
@mbrancato from graphs, you've shared, increase on interval happens close to pod restart. please try to set |
@AndrewChubatiuk I'll make those changes. I'm not quite sure I understand about the pod restarts, I don't see that happening. Also, the |
@AndrewChubatiuk these are not pod restarts. The gaps are generated by Is there a way to investigate / metric to identify that That said it seems unrelated to the increases as many time those sudden increases happen with no gap in metric data. |
I will investigate using |
@AndrewChubatiuk setting the staleness interval helped a lot, however I still notice some So I also tried
Here is an aggregated counter used to produce
![]() And here is the same data just as So I think there is still some issue with the streaming agg, but there's no reason that |
total_prometheus output in existing releases has the same suffix as total output, new release provides only a suffix change
can you provide an example of metrics and aggregations for these metrics? |
I'll probably need to wait till this is released - I don't want to disable what we have today.
The last two graphs above are those examples. I'm wondering if with sharding, that the pods are not "sticky" to a shard. As new pods come up and down, eventually a pod that has been running for some time get moved over to a new shard or something. That still doesn't account for the fact that the |
@AndrewChubatiuk I took some screenshots of the just the aggregation, its jumping around a lot. I don't feel like it should have discontinuity in the aggregation if its a running total. How does it go down several times when there is no restart? ![]() ![]() ![]() ![]() |
let me give you an example to understand better how total aggregation works
and during a second one
for aggregation config
total will be calculated as a sum of independent increases of each samples collection with unique labels set over labels which are set in
Please note, that during a second scrape appeared a metric with labels collection, which didn't exist during previous scrapes
as i showed in an example, if a metric for a pod with a new id appears (or if was not present during staleness interval) it's initial value (
looking at the code, the only possible reason of total reset besides restart that i see is when total value is greater than |
@AndrewChubatiuk Thanks for the explanation. The counters for these new pods always start at 0, so there is no good reason we would see large jumps. As for some integer / float limit on the number, the aggregated total was around What happens to the aggregation when a metric / labels disappears and no longer exists and fall out beyond the staleness interval? Do you have unit tests for this? I could play around with a branch locally to try and reproduce. |
@AndrewChubatiuk I added your example in your existing f(
`
- interval: 1m
without: [pod]
outputs: [total]
`, `
foo_metric{pod="pod1", env="stage"} 5
foo_metric{pod="pod2", env="stage"} 3
foo_metric{pod="pod1", env="prod"} 2
foo_metric{pod="pod2", env="prod"} 8
foo_metric{pod="pod3", env="stage"} 7
foo_metric{pod="pod2", env="stage"} 4
foo_metric{pod="pod1", env="prod"} 5
foo_metric{pod="pod2", env="prod"} 10
`, `foo_metric:1m_without_pod_total{env="prod"} 15
foo_metric:1m_without_pod_total{env="stage"} 16
`, "11111111",
)
I also didn't find any tests that really simulated staleness issues, so I hacked together another test function: func TestAggregatorsChanges(t *testing.T) {
f := func(
config string, interval time.Duration, inputMetrics []string, outputMetricsExpected string,
) {
t.Helper()
// Initialize Aggregators
var tssOutput []prompbmarshal.TimeSeries
var tssOutputLock sync.Mutex
pushFunc := func(tss []prompbmarshal.TimeSeries) {
tssOutputLock.Lock()
tssOutput = make([]prompbmarshal.TimeSeries, 0)
for _, ts := range tss {
tssOutput = append(
tssOutput, prompbmarshal.TimeSeries{
Labels: append(ts.Labels[:0:0], ts.Labels...),
Samples: append(ts.Samples[:0:0], ts.Samples...),
},
)
}
tssOutputLock.Unlock()
}
opts := &Options{
FlushOnShutdown: true,
NoAlignFlushToInterval: true,
}
a, err := newAggregatorsFromData([]byte(config), pushFunc, opts)
if err != nil {
t.Fatalf("cannot initialize aggregators: %s", err)
}
// Push the inputMetrics to Aggregators
for _, inputMetric := range inputMetrics {
tssInput := mustParsePromMetrics(inputMetric)
_ = a.Push(tssInput, nil)
time.Sleep(interval)
}
a.MustStop()
// Verify the tssOutput contains the expected metrics
outputMetrics := timeSeriessToString(tssOutput)
if outputMetrics != outputMetricsExpected {
t.Errorf(
"unexpected output metrics;\ngot\n%s\nwant\n%s", outputMetrics,
outputMetricsExpected,
)
}
}
f(
`
- interval: 1s
without: [de]
outputs: [total]
staleness_interval: 2s
`, time.Second,
[]string{
`
foo{abc="qwe",de="asd"} 1
foo{abc="qwe",de="sj"} 2`,
`
foo{abc="qwe",de="asd"} 3
foo{abc="qwe",de="sj"} 4`,
}, `foo:1s_without_de_total{abc="qwe"} 4
`,
)
f(
`
- interval: 1s
without: [de]
outputs: [total]
staleness_interval: 2s
`, time.Second,
[]string{
// first values, these are ignored - 0
`
foo{abc="qwe",de="asd"} 1
foo{abc="qwe",de="sj"} 2`,
// increment each by 2, 4 total
`
foo{abc="qwe",de="asd"} 3
foo{abc="qwe",de="sj"} 4`,
// no change
`
foo{abc="qwe",de="sj"} 4`,
// no change
`
foo{abc="qwe",de="sj"} 4`,
// no change
`
foo{abc="qwe",de="sj"} 4`,
// new labels added, first value should be 0
`
foo{abc="qwe",de="eu"} 5
foo{abc="qwe",de="sj"} 4`,
// increment from 5 to 7, so 2 + 4 = 6
`
foo{abc="qwe",de="eu"} 7
foo{abc="qwe",de="sj"} 4`,
}, `foo:1s_without_de_total{abc="qwe"} 6
`,
)
f(
`
- interval: 1s
without: [pod]
outputs: [total]
staleness_interval: 2s
`, time.Second,
[]string{
`
foo_metric{pod="pod1", env="stage"} 5
foo_metric{pod="pod2", env="stage"} 3
foo_metric{pod="pod1", env="prod"} 2
foo_metric{pod="pod2", env="prod"} 8`,
`
foo_metric{pod="pod3", env="stage"} 7
foo_metric{pod="pod2", env="stage"} 4
foo_metric{pod="pod1", env="prod"} 5
foo_metric{pod="pod2", env="prod"} 10`,
}, `foo_metric_30s_without_pod_total{env="stage"} 16
foo_metric_30s_without_pod_total{env="prod"} 15
`,
)
} I think there's some nuance here where using |
tests do not work as expected as in total output implementation staleness interval is also used for initial timeout during which first samples should be ignored VictoriaMetrics/lib/streamaggr/total.go Line 71 in ff513b4
|
@AndrewChubatiuk @hagen1778 I just wanted to give a quick follow-up. I've been running I've not seen any spikes ever since I enabled similar to what I posted above, aggregations for ![]() The actual sum is not important, I know the raw data isn't the same as the running total aggregation. But the important thing is |
Describe the bug
Several times per day,
vmagent
shards sending streaming aggregation data will suddenly have gigantic spikes intotal
counters. Note, I think this is unrelated to #4768 as this is reproducible when querying the stored metric data.All of the examples below are using queries that look like this:
The spikes in rate over time look like this:
With the underlying data confirming the spike:
So the question is, why would that one shard have a huge spike? These counters are going up at the rate of single digit thousands per second, not millions.
To Reproduce
The
VMAgent
is configured with streaming aggregation like this:Version
using official container tags v1.93.3
Logs
Using the example above at 18:41:00, logs from that vmagent shard
vmagent-vmagent-1-6d6b7df89-8tkpn
filtering out scrape errors, this is the only log entry:
I have also seen config reload trigger logs that correlate with these spikes.
Screenshots
Sometimes, not always, the spike are seen across multiple metrics - related only in that they are counters being scraped by the same shard.
These are all counters from different workloads:
Used command-line flags
No response
Additional information
These are deployed using the VM operator v0.37.1 deployed with the VM operator helm chart 0.26.1.
The text was updated successfully, but these errors were encountered: