query results may incorrectly overlap time series #748

belm0 · 2020-09-07T08:49:17Z

Describe the bug

If query step is much greater than data interval (e.g. > 4x), and two series are adjacent in time but not overlapping, the query output or aggregation may incorrectly overlap the time series.

A typical example is build_info{version="..."}. A new version is deployed to instances, which stop updating build_info{version="1.0"} and start updating build_info{version="1.1"}. Due to this bug, at low zoom resolution there will be a point in time where count(build_info{instance="...")) returns 2, even though there is no overlap in the raw data points.

The equivalent query on Prometheus (Thanos) does not exhibit the problem.

To Reproduce

Raw datapoints:

build_info{instance="foo"}[20m]

time                 version
2020-07-22 00:45:10  20.05.2
2020-07-22 00:46:10  20.05.2
2020-07-22 00:47:10  20.05.2
2020-07-22 00:48:10  20.05.2
2020-07-22 00:51:56  20.05.3
2020-07-22 00:52:56  20.05.3
2020-07-22 00:58:56  20.05.3
2020-07-22 01:02:11  20.05.3

query, step 60 (no overlap)

build_info{instance="foo}

2020-07-22 00:47:00  20.05.2
2020-07-22 00:48:00  20.05.2
2020-07-22 00:49:00  20.05.2
2020-07-22 00:52:00  20.05.3
2020-07-22 00:53:00  20.05.3
2020-07-22 00:54:00  20.05.3

query, step 240

build_info{instance="foo}

2020-07-22 00:48:00  20.05.2
2020-07-22 00:52:00  20.05.2    ** overlap **
2020-07-22 00:52:00  20.05.3    ** overlap **
2020-07-22 00:56:00  20.05.3
2020-07-22 01:00:00  20.05.3

Expected behavior

If two series are not overlapping in time by raw data, the query should not treat them as overlapping when evaluating one interval in the output.

An example implementation would be to treat "start" and "end" points of a series differently when quantizing raw data points into time buckets: include the series in the bucket if 1) raw points are continuous in the bucket range, or 2) the series starts in the bucket range. Therefore, if a series ends in a bucket range, it is not included. It's similar to the concept of an open-ended range.

Screenshots

Example graph showing artificial spikes in count(build_info) when there is a deployment causing the version label to change:

Version

victoria-metrics-20200815-125320-tags-v1.40.0-0-ged00eb3f3

The text was updated successfully, but these errors were encountered:

valyala · 2020-09-07T20:37:38Z

@belm0 , thanks for the detailed bug report and the proposed solution! The solution looks good. We'll try implementing it and see how it works.

belm0 · 2020-09-08T00:50:30Z

Thank you. For discrepancies like this, it would be nice for VM to have unit tests against the output of the Prometheus query library.

propertone · 2020-09-19T02:10:13Z

+1, I've run into the same issue

propertone · 2020-09-19T15:21:23Z

The other (related) issue is that range queries do not appear to respect the -search.maxStalenessInterval option, while the instant query does.

valyala · 2020-09-21T18:59:42Z

The other (related) issue is that range queries do not appear to respect the -search.maxStalenessInterval option, while the instant query does.

Could you file a separate issue regarding this bug?

belm0 · 2020-10-13T00:14:48Z

@valyala would you consider prioritizing this? It's the main regression we have remaining vs. Prometheus.

This should prevent from double counting for time series at the time when it changes label. The most common case is in K8S, which changes pod uid label with each new deployment. Updates #748

valyala · 2020-10-13T09:37:26Z

@belm0 , could you verify whether the issue is fixed in the following commits:

single-node VictoriaMetrics - 9aa3b65 ( see build instructions)
cluster version of VictoriaMetrics - 217c192 (see build instructions )

Note that the response cache must be reset before testing the bugfix in order to remove previously cached incorrect results. See https://victoriametrics.github.io/#backfilling for more info on how to reset response cache.

valyala · 2020-10-13T15:08:03Z

The bugfix has been included in VictoriaMetrics v1.44.0. @belm0 , @propertone , could you verify whether it works as expected on your workloads?

belm0 · 2020-10-14T05:13:13Z

It seems to be resolved in v1.44, thank you!

I added a review comment to the commit.

belm0 · 2020-10-14T05:25:14Z

Now I see the opposite problem, where series unexpectedly disappear before they end (for example at head of the series).

I think it's related to my comment on the commit about correctness of the 90% heuristic.

valyala · 2020-10-17T07:17:53Z

@belm0 , could you share the original query used for building the graph above? It would be great if you could simplify and narrow down the query with more specific label filters, so it is based on a single source time series and still would expose the issue.

…ailing data points in time series Now trailing data points are additionally dropped for time series with a single raw sample Updates #748

valyala · 2020-10-17T07:47:54Z

Adjusted heuristic a bit in the following commits:

single-node VictoriaMetrics - 28353e4
cluster VictoriaMetrics - ee2902d

Now VictoriaMetrics should drop trailing points for time series containing only a single raw sample on the selected time range.

valyala · 2020-11-01T21:53:20Z

FYI, this introduces regression in queries to /api/v1/query for timestamps close to the current time - see #845 .

Updates #845 Updates #748

valyala · 2020-11-02T02:22:18Z

FYI, VictoriaMetrics v1.45.0 contains additional enhancements for this issue.

belm0 · 2020-11-03T03:13:46Z

the problem near end of query range seems to be resolved in 1.45.0

hagen1778 · 2021-08-09T11:29:32Z

#1526

Updates #1526 Updates #748 Updates #1509 Updates #1530 Updates #845

valyala · 2021-08-13T09:35:40Z

FYI, VictoriaMetrics gains support for Prometheus staleness markers in the next release as mentioned at #1526 . This should fix the original issue with overlapped time series in more generic way. The fix works only for queries over newly ingested data. It doesn't work for already existing data.

valyala · 2021-08-15T22:03:13Z

FYI, VictoriaMetrics and vmagent gained support for Prometheus staleness markers starting from the release v1.64.0.

… scrapers for the added targets This should prevent from possible time series overlap when old target is substituted by new target (for example, during Kubernetes deployments). Updates #1526 Updates #1530 Updates #748 Updates #1509

valyala added the bug Something isn't working label Sep 7, 2020

propertone mentioned this issue Sep 22, 2020

range queries don't respect maxStalenessInterval #784

Closed

belm0 closed this as completed Oct 14, 2020

belm0 reopened this Oct 14, 2020

valyala added a commit that referenced this issue Oct 17, 2020

app/vmselect/promql: an attempt to improve heuristics for dropping tr…

28353e4

…ailing data points in time series Now trailing data points are additionally dropped for time series with a single raw sample Updates #748

valyala added a commit that referenced this issue Oct 17, 2020

app/vmselect/promql: an attempt to improve heuristics for dropping tr…

ee2902d

…ailing data points in time series Now trailing data points are additionally dropped for time series with a single raw sample Updates #748

valyala added a commit that referenced this issue Nov 1, 2020

app/vmselect/promql: do not drop trailing datapoints for instant queries

7c0b658

Updates #845 Updates #748

valyala added a commit that referenced this issue Nov 1, 2020

app/vmselect/promql: do not drop trailing datapoints for instant queries

daacbc7

Updates #845 Updates #748

belm0 closed this as completed Nov 3, 2020

valyala mentioned this issue Jun 9, 2021

problem with count(), large step size, and grafana #1354

Closed

valyala mentioned this issue Aug 3, 2021

Query_range wrong last value #1509

Closed

valyala added a commit that referenced this issue Aug 13, 2021

all: add support for Prometheus staleness markers

4401464

Updates #1526 Updates #748 Updates #1509 Updates #1530 Updates #845

valyala added a commit that referenced this issue Aug 13, 2021

all: add support for Prometheus staleness markers

c1f81f0

Updates #1526 Updates #748 Updates #1509 Updates #1530 Updates #845

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query results may incorrectly overlap time series #748

query results may incorrectly overlap time series #748

belm0 commented Sep 7, 2020 •

edited

Loading

valyala commented Sep 7, 2020

belm0 commented Sep 8, 2020

propertone commented Sep 19, 2020

propertone commented Sep 19, 2020

valyala commented Sep 21, 2020

belm0 commented Oct 13, 2020

valyala commented Oct 13, 2020 •

edited

Loading

valyala commented Oct 13, 2020

belm0 commented Oct 14, 2020

belm0 commented Oct 14, 2020

valyala commented Oct 17, 2020

valyala commented Oct 17, 2020

valyala commented Nov 1, 2020

valyala commented Nov 2, 2020

belm0 commented Nov 3, 2020

hagen1778 commented Aug 9, 2021

valyala commented Aug 13, 2021

valyala commented Aug 15, 2021

query results may incorrectly overlap time series #748

query results may incorrectly overlap time series #748

Comments

belm0 commented Sep 7, 2020 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Screenshots

Version

valyala commented Sep 7, 2020

belm0 commented Sep 8, 2020

propertone commented Sep 19, 2020

propertone commented Sep 19, 2020

valyala commented Sep 21, 2020

belm0 commented Oct 13, 2020

valyala commented Oct 13, 2020 • edited Loading

valyala commented Oct 13, 2020

belm0 commented Oct 14, 2020

belm0 commented Oct 14, 2020

valyala commented Oct 17, 2020

valyala commented Oct 17, 2020

valyala commented Nov 1, 2020

valyala commented Nov 2, 2020

belm0 commented Nov 3, 2020

hagen1778 commented Aug 9, 2021

valyala commented Aug 13, 2021

valyala commented Aug 15, 2021

belm0 commented Sep 7, 2020 •

edited

Loading

valyala commented Oct 13, 2020 •

edited

Loading