Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query results may incorrectly overlap time series #748

Closed
belm0 opened this issue Sep 7, 2020 · 18 comments
Closed

query results may incorrectly overlap time series #748

belm0 opened this issue Sep 7, 2020 · 18 comments
Labels
bug Something isn't working

Comments

@belm0
Copy link
Contributor

belm0 commented Sep 7, 2020

Describe the bug

If query step is much greater than data interval (e.g. > 4x), and two series are adjacent in time but not overlapping, the query output or aggregation may incorrectly overlap the time series.

A typical example is build_info{version="..."}. A new version is deployed to instances, which stop updating build_info{version="1.0"} and start updating build_info{version="1.1"}. Due to this bug, at low zoom resolution there will be a point in time where count(build_info{instance="...")) returns 2, even though there is no overlap in the raw data points.

The equivalent query on Prometheus (Thanos) does not exhibit the problem.

To Reproduce

Raw datapoints:

build_info{instance="foo"}[20m]

time                 version
2020-07-22 00:45:10  20.05.2
2020-07-22 00:46:10  20.05.2
2020-07-22 00:47:10  20.05.2
2020-07-22 00:48:10  20.05.2
2020-07-22 00:51:56  20.05.3
2020-07-22 00:52:56  20.05.3
2020-07-22 00:58:56  20.05.3
2020-07-22 01:02:11  20.05.3

query, step 60 (no overlap)

build_info{instance="foo}

2020-07-22 00:47:00  20.05.2
2020-07-22 00:48:00  20.05.2
2020-07-22 00:49:00  20.05.2
2020-07-22 00:52:00  20.05.3
2020-07-22 00:53:00  20.05.3
2020-07-22 00:54:00  20.05.3

query, step 240

build_info{instance="foo}

2020-07-22 00:48:00  20.05.2
2020-07-22 00:52:00  20.05.2    ** overlap **
2020-07-22 00:52:00  20.05.3    ** overlap **
2020-07-22 00:56:00  20.05.3
2020-07-22 01:00:00  20.05.3

Expected behavior

If two series are not overlapping in time by raw data, the query should not treat them as overlapping when evaluating one interval in the output.

An example implementation would be to treat "start" and "end" points of a series differently when quantizing raw data points into time buckets: include the series in the bucket if 1) raw points are continuous in the bucket range, or 2) the series starts in the bucket range. Therefore, if a series ends in a bucket range, it is not included. It's similar to the concept of an open-ended range.

Screenshots

Example graph showing artificial spikes in count(build_info) when there is a deployment causing the version label to change:
Screen Shot 2020-09-05 at 12 45 38 AM

Version

victoria-metrics-20200815-125320-tags-v1.40.0-0-ged00eb3f3

@valyala valyala added the bug Something isn't working label Sep 7, 2020
@valyala
Copy link
Collaborator

valyala commented Sep 7, 2020

@belm0 , thanks for the detailed bug report and the proposed solution! The solution looks good. We'll try implementing it and see how it works.

@belm0
Copy link
Contributor Author

belm0 commented Sep 8, 2020

Thank you. For discrepancies like this, it would be nice for VM to have unit tests against the output of the Prometheus query library.

@propertone
Copy link

+1, I've run into the same issue

@propertone
Copy link

The other (related) issue is that range queries do not appear to respect the -search.maxStalenessInterval option, while the instant query does.

@valyala
Copy link
Collaborator

valyala commented Sep 21, 2020

The other (related) issue is that range queries do not appear to respect the -search.maxStalenessInterval option, while the instant query does.

Could you file a separate issue regarding this bug?

@belm0
Copy link
Contributor Author

belm0 commented Oct 13, 2020

@valyala would you consider prioritizing this? It's the main regression we have remaining vs. Prometheus.

valyala added a commit that referenced this issue Oct 13, 2020
This should prevent from double counting for time series at the time when it changes label.
The most common case is in K8S, which changes pod uid label with each new deployment.

Updates #748
valyala added a commit that referenced this issue Oct 13, 2020
This should prevent from double counting for time series at the time when it changes label.
The most common case is in K8S, which changes pod uid label with each new deployment.

Updates #748
@valyala
Copy link
Collaborator

valyala commented Oct 13, 2020

@belm0 , could you verify whether the issue is fixed in the following commits:

Note that the response cache must be reset before testing the bugfix in order to remove previously cached incorrect results. See https://victoriametrics.github.io/#backfilling for more info on how to reset response cache.

@valyala
Copy link
Collaborator

valyala commented Oct 13, 2020

The bugfix has been included in VictoriaMetrics v1.44.0. @belm0 , @propertone , could you verify whether it works as expected on your workloads?

@belm0
Copy link
Contributor Author

belm0 commented Oct 14, 2020

It seems to be resolved in v1.44, thank you!

I added a review comment to the commit.

@belm0 belm0 closed this as completed Oct 14, 2020
@belm0
Copy link
Contributor Author

belm0 commented Oct 14, 2020

Now I see the opposite problem, where series unexpectedly disappear before they end (for example at head of the series).

I think it's related to my comment on the commit about correctness of the 90% heuristic.

Screen Shot 2020-10-14 at 2 21 48 PM

@belm0 belm0 reopened this Oct 14, 2020
@valyala
Copy link
Collaborator

valyala commented Oct 17, 2020

@belm0 , could you share the original query used for building the graph above? It would be great if you could simplify and narrow down the query with more specific label filters, so it is based on a single source time series and still would expose the issue.

valyala added a commit that referenced this issue Oct 17, 2020
…ailing data points in time series

Now trailing data points are additionally dropped for time series with a single raw sample

Updates #748
valyala added a commit that referenced this issue Oct 17, 2020
…ailing data points in time series

Now trailing data points are additionally dropped for time series with a single raw sample

Updates #748
@valyala
Copy link
Collaborator

valyala commented Oct 17, 2020

Adjusted heuristic a bit in the following commits:

  • single-node VictoriaMetrics - 28353e4
  • cluster VictoriaMetrics - ee2902d

Now VictoriaMetrics should drop trailing points for time series containing only a single raw sample on the selected time range.

@valyala
Copy link
Collaborator

valyala commented Nov 1, 2020

FYI, this introduces regression in queries to /api/v1/query for timestamps close to the current time - see #845 .

@valyala
Copy link
Collaborator

valyala commented Nov 2, 2020

FYI, VictoriaMetrics v1.45.0 contains additional enhancements for this issue.

@belm0
Copy link
Contributor Author

belm0 commented Nov 3, 2020

the problem near end of query range seems to be resolved in 1.45.0

@hagen1778
Copy link
Collaborator

#1526

valyala added a commit that referenced this issue Aug 13, 2021
valyala added a commit that referenced this issue Aug 13, 2021
@valyala
Copy link
Collaborator

valyala commented Aug 13, 2021

FYI, VictoriaMetrics gains support for Prometheus staleness markers in the next release as mentioned at #1526 . This should fix the original issue with overlapped time series in more generic way. The fix works only for queries over newly ingested data. It doesn't work for already existing data.

@valyala
Copy link
Collaborator

valyala commented Aug 15, 2021

FYI, VictoriaMetrics and vmagent gained support for Prometheus staleness markers starting from the release v1.64.0.

valyala added a commit that referenced this issue Aug 16, 2021
… scrapers for the added targets

This should prevent from possible time series overlap when old target is substituted by new target (for example, during Kubernetes deployments).

Updates #1526
Updates #1530
Updates #748
Updates #1509
valyala added a commit that referenced this issue Aug 16, 2021
… scrapers for the added targets

This should prevent from possible time series overlap when old target is substituted by new target (for example, during Kubernetes deployments).

Updates #1526
Updates #1530
Updates #748
Updates #1509
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants