Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming aggregation spikes in data reported #4966

Open
mbrancato opened this issue Sep 7, 2023 · 31 comments
Open

Streaming aggregation spikes in data reported #4966

mbrancato opened this issue Sep 7, 2023 · 31 comments
Assignees
Labels

Comments

@mbrancato
Copy link

mbrancato commented Sep 7, 2023

Describe the bug

Several times per day, vmagent shards sending streaming aggregation data will suddenly have gigantic spikes in total counters. Note, I think this is unrelated to #4768 as this is reproducible when querying the stored metric data.

All of the examples below are using queries that look like this:

sum(rate(target_request_count:30s_without_instance_pod_total[$granularity]))

The spikes in rate over time look like this:

image

With the underlying data confirming the spike:

image

So the question is, why would that one shard have a huge spike? These counters are going up at the rate of single digit thousands per second, not millions.

To Reproduce

The VMAgent is configured with streaming aggregation like this:

          - match:
            - '{__name__=~"myapp_.+"}'
            interval: "30s"
            outputs: ["total","sum_samples"]
            without: ["pod", "instance"]

Version

using official container tags v1.93.3

% docker run -it --rm docker.io/victoriametrics/vmagent:v1.93.3 --version
vmagent-20230902-001011-tags-v1.93.3-0-gc1f0a2b5fa

Logs

Using the example above at 18:41:00, logs from that vmagent shard vmagent-vmagent-1-6d6b7df89-8tkpn

filtering out scrape errors, this is the only log entry:

{"caller":"VictoriaMetrics/lib/promscrape/scraper.go:430", "level":"info", "msg":"kubernetes_sd_configs: added targets: 146, removed targets: 30; total targets: 1308", "ts":"2023-09-06T22:39:28.129Z"}

I have also seen config reload trigger logs that correlate with these spikes.

Screenshots

Sometimes, not always, the spike are seen across multiple metrics - related only in that they are counters being scraped by the same shard.

These are all counters from different workloads:

image image image image

Used command-line flags

No response

Additional information

These are deployed using the VM operator v0.37.1 deployed with the VM operator helm chart 0.26.1.

@mbrancato mbrancato added the bug Something isn't working label Sep 7, 2023
@hagen1778
Copy link
Collaborator

Hello @mbrancato! Could you please correlate if this vmagent is restarted or if its config gets updated. The metrics that should help you in this are vm_app_uptime_seconds and vmagent_streamaggr_config_reloads_*.

@mbrancato
Copy link
Author

Hey @hagen1778 - based on the logs I do not think it restarted, but here are the metrics you mentioned around that time. It does seem like it added and dropped some target pods at that time (see log entry above), which is normal because we are spinning up and down pods frequently based on request volume.

I have seen / correlated this with a prom config reload previously, but its not always correlated with the spikes as in this case.

uptime:
image

reloads_total:
image

reload_success_timestamp_seconds:
image

Even if there was a gap in data (maybe a remote-write issue), I wouldn't think there would be a jump in the counters.

@hagen1778
Copy link
Collaborator

Hm, I think the gap could explain it...
Is this vmagent responsible for scraping or it receives writes from other applications?

@mbrancato
Copy link
Author

@hagen1778 vmagent is doing the scraping, and doing remote write + streaming aggregation.

@hagen1778
Copy link
Collaborator

I have seen / correlated this with a prom config reload previously, but its not always correlated with the spikes as in this case.

This might be related to #4862

vmagent is doing the scraping, and doing remote write + streaming aggregation.

Hm, this is weird... Do you a flat hierarchy of vmagents? Or are there vmagents writing into vmagents?
The data gaps are very suspicious to me, as vmagent shouldn't loose any metrics. It buffers data if network breaks, and delivers it as soon as possible.

@mbrancato
Copy link
Author

@hagen1778 vmagent shards write direct to a VM cluster (vminsert). So I'm sure there could be connection timeouts or drops, but what doesn't make sense is the _total calculation suddenly changes when the source data does not.

Here is some additional data from today, these are login actions that are being graphed by 3 vmagent shards:

auth_login_total:30s_without_instance_pod_sum_samples - streaming, aggregated sum_samples:
image

auth_login_total:30s_without_instance_pod_total - streaming, aggregated total - this is incorrect and the source of spikes.
image

image

Shard 1 suddenly jumps from 166 to 208. The values of the other shards (0 and 3) do not change.

There are no restarts or odd log entries, but I have filtered out "cannot scrape target" entries in these logs, as they are just misconfigured PodMonitor resources.
image

Resources are also fine on that shard, if over provisioned:
image

@mbrancato
Copy link
Author

While I need to upgrade VM to see if this still happens, and will do so soon, I believe I have isolated this to restarts in vmagent shards. The fix here is probably that a vmagent shard should not return an aggregated metric until two things happen:

  • All scraping targets have been discovered.
  • A full scrape has been completed of all targets.

The problem is that when a vmagent shard in the deployment is rotated, the new vmagent pod will slowly start to accumulate values for a metric. It is reporting them slowly over time. The affect this has on rate() calculation is to make it spike and be inaccurate.

Using a query, we can see the replacement happen:

sum(my_metric:30s_without_instance_pod_total{vmagent=~"vmagent-vmagent-2-75f7cff669-.*"}) by (vmagent)
image

When digging into what is going on when the values in this new pod, we see the count is quickly accumulating as it performs initial scraping until it levels off as it should. The values during initial accumulation are being used to show spikes in the rate() calculation, and should not be reported until they are stable.

image image

@hagen1778
Copy link
Collaborator

Thanks for your investigation!

The fix here is probably that a vmagent shard should not return an aggregated metric until two things happen:
All scraping targets have been discovered.
A full scrape has been completed of all targets.

I'm afraid it could be very complicated to define such a moment in time when all is discovered and scraped due to following reasons:

  1. Service discovery is dynamic and non-deterministic. It is unclear when it finishes and how many targets it would produce. The first attempt to discover targets could fail, and only next query will succeed. This could delay aggregation for significant period of time.
  2. Targets are scraped separately at different moments of time, each target is scraped within its scrape interval. Moreover, there could be jobs with different scrape intervals. So waiting for all of them to be scraped would mean either over-accumulating data or under-accumulating data.

Form what you described I assume the problem in the following:

  1. There is a counter metric on scraped targets, its value grows monotonically
  2. vmagents scrape this metric and aggregate it producing monotonically increasing counter
  3. on vmagent restart, the aggregated counter may produce value < value it produced before restart. This makes VictoriaMetrics to think that counter started from low value and then quickly increased, hence rate results in huge values.

I think, the proper solution would be to persist state of aggregates on vmagent shutdown, so after the restart produced counters won't start from 0.

@mbrancato
Copy link
Author

mbrancato commented Jan 17, 2024

Thanks @hagen1778 - I think simply waiting until some interval has fully passed for any aggregation since startup is also viable, but obviously its lossy. Without loading persisted aggregation data (which is currently not available), I would ask why is vmagent reporting values it should know are not yet complete? I think reporting incorrect values here is more to the point. For determining when to start reporting, it doesn't have to be perfect, but maybe just better - something like 2 intervals after data is first captured seems like it would solve this, while leaving a gap in metrics, which already exists, so its a net benefit in my view.

For using kubernetes and the VM operator, its not simply restarts. I mainly see this on pod replacement. Same deployment and replicaset, but a new replacement pod is brought up for whatever reason. Maybe a node was cycled or something. You can see this with the pod names below, the metric is sum(rate(...)) and filtered to the vmagent shard number 3.

For persisting aggregation per shard across restarts, I think vmagent may be left in the same boat with slow metric loading from blob storage, and gaps in metrics.

image

@hagen1778
Copy link
Collaborator

Thanks @hagen1778 - I think simply waiting until some interval has fully passed for any aggregation since startup is also viable, but obviously its lossy.

This would require the user to specify this interval manually, which might not be trivial to explain and configure. Besides, it seems like it would a similar result to extending the aggregation interval to bigger value, which would cover possible delays in targets discovery.

Without loading persisted aggregation data (which is currently not available), I would ask why is vmagent reporting values it should know are not yet complete?

The aggregation happens in a separate to scraping process, because it could need to aggregate values from many targets. With this design aggregation isn't aware of something being scraped or discovered. It just performs aggregations in background for everything it was fed with.

I think reporting incorrect values here is more to the point.

Agree. I think we can solve it in a different way. Looks like you're using total output for calculating rate afterwards. Would it help if vmagent would get support of rate aggregation function?

@sergeyshevch
Copy link

Hi! We have exactly same issue with spikes in data.

We have aggregation interval 15 minutes for cold storage. VMAgent scrapes all targets. Deployed with VMOperator with

    spec:
      replicaCount: 2
      shardCount: 1

@hagen1778
Copy link
Collaborator

@sergeyshevch I think your problem is related to #5643, not to the case in this ticket.

@mbrancato
Copy link
Author

@hagen1778 I don't think a rate aggregation helps me here since the rate and probably increase calculation would be limited to specific timeframes, so I couldn't then convert that from a 1m interval to a 1h interval.

@mbrancato
Copy link
Author

@hagen1778 After upgrading to 1.101 this issue or similar appears to be even more prevalent. I have not enabled the flushing to storage for vmagent shards yet, since the documentation recommended against it.

I was able to easily show some bad data using grafana. Here, the nginx requests can be seen increasing by many million over just a few seconds.
image

I was able to then jump into the UI for that vmagent and get the scrape target pods:
image

I then port forwarded to each nginx pod and summed the values:

% curl --silent localhost:10254/metrics | grep nginx_ingress_controller_requests | cut -d'}' -f 2 | awk '{ sum += $1 } END { print sum }'
3948112
% curl --silent localhost:10254/metrics | grep nginx_ingress_controller_requests | cut -d'}' -f 2 | awk '{ sum += $1 } END { print sum }'
2857914

The results are nowhere near what vmagent is reporting for the total aggregation. I understand the total agg is a running increase, versus sum_samples which is not and may go up and down.

@hagen1778
Copy link
Collaborator

  • I think simply waiting until some interval has fully passed for any aggregation since startup is also viable, but obviously its lossy

Btw, this is now possible to define via ignore_first_intervals - see https://docs.victoriametrics.com/stream-aggregation/#ignore-aggregation-intervals-on-start`

The results are nowhere near what vmagent is reporting for the total aggregation. I understand the total agg is a running increase, versus sum_samples which is not and may go up and down.

Do you drop raw metrics? If not, could you please show the raw values of these time series without aggregation on the same period of time?

Could you please also show the config of your aggregating rule? I find the metric name on the screenshot different to the rule you're posted in the initial message.

@mbrancato
Copy link
Author

mbrancato commented May 30, 2024

I do drop raw metrics, but let me re-enable them for these nginx metrics and see if I can provide that.

currently my vmagent config looks like this:

  remoteWrite:
    - url: ...
      sendTimeout: "2m"
      inlineUrlRelabelConfig:
        # Drop all container metrics unless they are aggregated. This reduces the value of the
        # metrics, but reduces the cardinality. Enable specific container metrics needed here
        # for investigation.
        - action: drop_metrics
          regex: "^container_[^:]*"
        - action: drop_metrics
          regex: "^kube_pod_[^:]*$"
        - action: drop_metrics
          regex: "^prober_probe_[^:]*$"
        - action: drop_metrics
          regex: "^kube_endpoint_address$"
        # Disabling this to help debug this issue:
        # https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4966
        #- action: drop_metrics
        #  regex: "^nginx_[^:]*$"
    - url: ...
      sendTimeout: "2m"
      streamAggrConfig:
        keepInput: false
        rules:
          - match:
              - '{__name__=~"^container_.+"}'
            interval: "30s"
            outputs: ["total"]
            without: ["id", "name", "instance", "pod", "node"]
          - match:
              - '{__name__=~"^kube_pod_.+"}'
            interval: "30s"
            outputs: ["total"]
            without: ["uid", "pod", "container_id"]
          - match:
              - '{__name__=~"^prober_probe_.+"}'
            interval: "30s"
            outputs: ["total"]
            without: ["pod", "instance"]
          - match:
              - '{__name__=~"^nginx_.+"}'
            interval: "30s"
            outputs: ["total"]
            without: ["pod", "instance", "controller_pod"]

I just adjusted the config, but I might need to set keepInput: true since I first created this config back when that had a bug in it.

@mbrancato
Copy link
Author

hey @hagen1778 - I re-enabled the raw metrics, and its a stark difference in values here:
image

image

And zoomed in with 5s steps and summed by nginx pod.
image

@AndrewChubatiuk
Copy link
Contributor

@mbrancato from graphs, you've shared, increase on interval happens close to pod restart. please try to set staleness_interval: 5m (default staleness interval value is 2 x aggregation interval) to not remove stale metrics out of state. Also you can try total_prometheus instead of total to avoid adding to a total value difference between zero an initial sample value

@mbrancato
Copy link
Author

@AndrewChubatiuk I'll make those changes. I'm not quite sure I understand about the pod restarts, I don't see that happening.

Also, the total aggregate appears to jump every minute or so but this is increase is not actually happening in the underlying counters.

@AndrewChubatiuk
Copy link
Contributor

AndrewChubatiuk commented May 31, 2024

I'm not quite sure I understand about the pod restarts
image

at about 15:14:45 yellow graph disappears for some time
at about 15:16:45 blue one disappears for a short period of time
the same is for green at 15:18:45
total_prometheus output is a better choice for such case than total

@mbrancato
Copy link
Author

@AndrewChubatiuk these are not pod restarts. The gaps are generated by vmagent. It is scraping on likely 30s intervals, and some of the gaps are 5-10 seconds.

Is there a way to investigate / metric to identify that vmagent can't keep up with scraping or recording? We get some scrape errors but they are normally just related to pods not (yet) having a metrics endpoint up. We have 4 shards currently.

That said it seems unrelated to the increases as many time those sudden increases happen with no gap in metric data.

@mbrancato
Copy link
Author

I will investigate using total_prometheus next week. But I'm cautious on changing the behavior there. That type of aggregation was not available when I first configured this.

@mbrancato
Copy link
Author

@AndrewChubatiuk setting the staleness interval helped a lot, however I still notice some total aggregations that deal with deployments that scale up and down a lot are having issues. These deployments frequently change the number of replicas, but there is no stale data, I'd say the max staleness of the data is 1m and I set the staleness interval to 5m for aggregation.

So I also tried total_prometheus which seemed to have no effect - I think its related to this bugfix that is not yet released:

BUGFIX: [stream aggregation](https://docs.victoriametrics.com/stream-aggregation/): set correct suffix 
<output>_prometheus for aggregation outputs [increase_prometheus](https://docs.victoriametrics.com/stream-
aggregation/#increase_prometheus) and [total_prometheus](https://docs.victoriametrics.com/stream-
aggregation/#total_prometheus). Before, outputs total and total_prometheus or increase and increase_prometheus had the 
same suffix.

Here is an aggregated counter used to produce sum(rate()) - this is doing:

sum(rate(my_counter:30s_without_instance_pod_total{vmagent="vmagent-vmagent-0-66696444fc-xjfbm"}[5m])) by (vmagent)
image

And here is the same data just as sum() for that same vmagent pod:
image

So I think there is still some issue with the streaming agg, but there's no reason that rate() should produce such a huge spike.

@AndrewChubatiuk
Copy link
Contributor

AndrewChubatiuk commented Jun 5, 2024

So I also tried total_prometheus which seemed to have no effect

total_prometheus output in existing releases has the same suffix as total output, new release provides only a suffix change

These deployments frequently change the number of replicas, but there is no stale data

can you provide an example of metrics and aggregations for these metrics?

@mbrancato
Copy link
Author

mbrancato commented Jun 5, 2024

total_prometheus output in existing releases has the same suffix as total output, new release provides only a suffix change

I'll probably need to wait till this is released - I don't want to disable what we have today.

can you provide an example of metrics and aggregations for these metrics?

The last two graphs above are those examples. I'm wondering if with sharding, that the pods are not "sticky" to a shard. As new pods come up and down, eventually a pod that has been running for some time get moved over to a new shard or something. That still doesn't account for the fact that the sum() of the aggregated data and the sum(rate()) above show a rate increase that is not displayed in the aggregated counter alone.

@mbrancato
Copy link
Author

@AndrewChubatiuk I took some screenshots of the just the aggregation, its jumping around a lot. I don't feel like it should have discontinuity in the aggregation if its a running total. How does it go down several times when there is no restart?

Screenshot 2024-06-05 at 7 30 42 PM Screenshot 2024-06-05 at 7 31 11 PM Screenshot 2024-06-05 at 7 31 28 PM Screenshot 2024-06-05 at 7 31 44 PM

@AndrewChubatiuk
Copy link
Contributor

AndrewChubatiuk commented Jun 6, 2024

let me give you an example to understand better how total aggregation works
let's imagine that during 30s interval agent performed two scrapes and during a first one it received

foo_metric{pod="pod1", env="stage"} 5
foo_metric{pod="pod2", env="stage"} 3
foo_metric{pod="pod1", env="prod"} 2
foo_metric{pod="pod2", env="prod"} 8

and during a second one

foo_metric{pod="pod3", env="stage"} 7
foo_metric{pod="pod2", env="stage"} 4
foo_metric{pod="pod1", env="prod"} 5
foo_metric{pod="pod2", env="prod"} 10

for aggregation config

- match:
   - '{__name__=~"^foo_.+"}'
      interval: "30s"
      outputs: ["total"]
      without: ["pod"]

total will be calculated as a sum of independent increases of each samples collection with unique labels set over labels which are set in by or without labels that are set in without section. in an example above result is

  • for foo_metric_30s_without_pod_total{env="stage"} = 5-0 + 3-0 + 7-0 + 4-3 = 16
  • for foo_metric_30s_without_pod_total{env="prod"} = 2-0 + 8-0 + 5-2 + 10-8 = 15

Please note, that during a second scrape appeared a metric with labels collection, which didn't exist during previous scrapes pod="pod3", env="stage" and despite a fact that total value for env="stage" already exists increase for this labels collection will be calculated independently: 7-0.

These deployments frequently change the number of replicas, but there is no stale data

as i showed in an example, if a metric for a pod with a new id appears (or if was not present during staleness interval) it's initial value (7-0) will be added to a total. total_prometheus output is dealing with initial samples in a different manner: it calculates increase starting from a second sample not to have spikes in an output. in example above if i replace total output with total_prometheus result is:

  • for foo_metric_30s_without_pod_total{env="stage"} = 4-3 = 1
  • for foo_metric_30s_without_pod_total{env="prod"} = 5-2 + 10-8 = 5

I took some screenshots of the just the aggregation, its jumping around a lot. I don't feel like it should have discontinuity in the aggregation if its a running total. How does it go down several times when there is no restart?

looking at the code, the only possible reason of total reset besides restart that i see is when total value is greater than 1 << 53 = 9007199254740992

@mbrancato
Copy link
Author

@AndrewChubatiuk Thanks for the explanation.

The counters for these new pods always start at 0, so there is no good reason we would see large jumps.

As for some integer / float limit on the number, the aggregated total was around 121934100 before resetting, well below that value.

What happens to the aggregation when a metric / labels disappears and no longer exists and fall out beyond the staleness interval? Do you have unit tests for this? I could play around with a branch locally to try and reproduce.

@mbrancato
Copy link
Author

@AndrewChubatiuk I added your example in your existing TestAggregatorSuccess function, it did not match the expected results.

	f(
		`     
- interval: 1m
  without: [pod]
  outputs: [total]
`, `
foo_metric{pod="pod1", env="stage"} 5
foo_metric{pod="pod2", env="stage"} 3
foo_metric{pod="pod1", env="prod"} 2
foo_metric{pod="pod2", env="prod"} 8
foo_metric{pod="pod3", env="stage"} 7
foo_metric{pod="pod2", env="stage"} 4
foo_metric{pod="pod1", env="prod"} 5
foo_metric{pod="pod2", env="prod"} 10
`, `foo_metric:1m_without_pod_total{env="prod"} 15
foo_metric:1m_without_pod_total{env="stage"} 16
`, "11111111",
	)
=== RUN   TestAggregatorsSuccess
    streamaggr_test.go:959: unexpected output metrics;
        got
        foo_metric:1m_without_pod_total{env="prod"} 5
        foo_metric:1m_without_pod_total{env="stage"} 1
        
        want
        foo_metric:1m_without_pod_total{env="prod"} 15
        foo_metric:1m_without_pod_total{env="stage"} 16
--- FAIL: TestAggregatorsSuccess (0.01s)

I also didn't find any tests that really simulated staleness issues, so I hacked together another test function:

func TestAggregatorsChanges(t *testing.T) {
	f := func(
		config string, interval time.Duration, inputMetrics []string, outputMetricsExpected string,
	) {
		t.Helper()

		// Initialize Aggregators
		var tssOutput []prompbmarshal.TimeSeries
		var tssOutputLock sync.Mutex
		pushFunc := func(tss []prompbmarshal.TimeSeries) {
			tssOutputLock.Lock()
			tssOutput = make([]prompbmarshal.TimeSeries, 0)
			for _, ts := range tss {
				tssOutput = append(
					tssOutput, prompbmarshal.TimeSeries{
						Labels:  append(ts.Labels[:0:0], ts.Labels...),
						Samples: append(ts.Samples[:0:0], ts.Samples...),
					},
				)
			}
			tssOutputLock.Unlock()
		}
		opts := &Options{
			FlushOnShutdown:        true,
			NoAlignFlushToInterval: true,
		}
		a, err := newAggregatorsFromData([]byte(config), pushFunc, opts)
		if err != nil {
			t.Fatalf("cannot initialize aggregators: %s", err)
		}

		// Push the inputMetrics to Aggregators
		for _, inputMetric := range inputMetrics {
			tssInput := mustParsePromMetrics(inputMetric)
			_ = a.Push(tssInput, nil)
			time.Sleep(interval)
		}

		a.MustStop()

		// Verify the tssOutput contains the expected metrics
		outputMetrics := timeSeriessToString(tssOutput)
		if outputMetrics != outputMetricsExpected {
			t.Errorf(
				"unexpected output metrics;\ngot\n%s\nwant\n%s", outputMetrics,
				outputMetricsExpected,
			)
		}
	}

	f(
		`
- interval: 1s 
  without: [de]
  outputs: [total]
  staleness_interval: 2s
`, time.Second,
		[]string{
			`
foo{abc="qwe",de="asd"} 1
foo{abc="qwe",de="sj"} 2`,
			`
foo{abc="qwe",de="asd"} 3
foo{abc="qwe",de="sj"} 4`,
		}, `foo:1s_without_de_total{abc="qwe"} 4
`,
	)

	f(
		`
- interval: 1s 
  without: [de]
  outputs: [total]
  staleness_interval: 2s
`, time.Second,
		[]string{
			// first values, these are ignored - 0
			`
foo{abc="qwe",de="asd"} 1
foo{abc="qwe",de="sj"} 2`,
			// increment each by 2, 4 total
			`
foo{abc="qwe",de="asd"} 3
foo{abc="qwe",de="sj"} 4`,
			// no change
			`
foo{abc="qwe",de="sj"} 4`,
			// no change
			`
foo{abc="qwe",de="sj"} 4`,
			// no change
			`
foo{abc="qwe",de="sj"} 4`,
			// new labels added, first value should be 0
			`
foo{abc="qwe",de="eu"} 5
foo{abc="qwe",de="sj"} 4`,
			// increment from 5 to 7, so 2 + 4 = 6
			`
foo{abc="qwe",de="eu"} 7
foo{abc="qwe",de="sj"} 4`,
		}, `foo:1s_without_de_total{abc="qwe"} 6
`,
	)

	f(
		`
- interval: 1s 
  without: [pod]
  outputs: [total]
  staleness_interval: 2s
`, time.Second,
		[]string{
			`
foo_metric{pod="pod1", env="stage"} 5
foo_metric{pod="pod2", env="stage"} 3
foo_metric{pod="pod1", env="prod"} 2
foo_metric{pod="pod2", env="prod"} 8`,
			`
foo_metric{pod="pod3", env="stage"} 7
foo_metric{pod="pod2", env="stage"} 4
foo_metric{pod="pod1", env="prod"} 5
foo_metric{pod="pod2", env="prod"} 10`,
		}, `foo_metric_30s_without_pod_total{env="stage"} 16
foo_metric_30s_without_pod_total{env="prod"} 15
`,
	)

}

I think there's some nuance here where using total ignores the first datapoint for all metrics, but after its started, it does not do that. I'm still looking to understand how the total jumps so much at once, I can only think that sharding issues are coming into play.

@AndrewChubatiuk
Copy link
Contributor

tests do not work as expected as in total output implementation staleness interval is also used for initial timeout during which first samples should be ignored

keepFirstSample := as.keepFirstSample && currentTime > as.ignoreFirstSampleDeadline

@mbrancato
Copy link
Author

@AndrewChubatiuk @hagen1778 I just wanted to give a quick follow-up. I've been running v1.102.0-rc1 for a while now since it contains the total_prometheus fix for the aggregation suffix.

I've not seen any spikes ever since I enabled total_prometheus. I've not seen any on the total aggregation either anymore.

similar to what I posted above, aggregations for total, total_prometheus, and raw data look very similar.
image

image

The actual sum is not important, I know the raw data isn't the same as the running total aggregation. But the important thing is rate and increase seem correct now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants