Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana Table doesn't merge queries results with Victoria Metrics datasource #720

Closed
Keiske opened this issue Aug 25, 2020 · 20 comments
Closed
Labels
bug Something isn't working need more info

Comments

@Keiske
Copy link

Keiske commented Aug 25, 2020

Describe the bug
Switching datasource from Prometheus to Victoria Metrics in Grafana Tables breaks row merging for the same data. We are trying to use VictoriaMetrics as long-term historical metrics storage for our Kubernetes in-cluster Prometheus. Have configured Prometheus to send data to VictoriaMetrics via remote_write api.

To Reproduce

  1. Install VictoriaMetrics single node helm chart.
  2. Add remote_write to VM in Prometheus config.
  3. Change datasource from prometheus to VM in grafana dashboards with tables with multiple queries. No any other settings in Grafana changed.
  4. Queries results doesn't merged anymore. See screenshots.
  5. Change datasouce back to Prometheus - tables look as it should.

Expected behavior
Multiple queries results with same lebels should be merged to rows in Grafana Tables panel.

Screenshots
Prometheus datasource table result:
prom_grafana_merge2

VictoriaMetrics datasource table result with same queries and settings:
vm_grafana_merge1

Version
The line returned when passing --version command line flag to binary. For example:

$ ./victoria-metrics-prod --version
victoria-metrics-20200815-125320-tags-v1.40.0-0-ged00eb3f3
@hagen1778
Copy link
Collaborator

Hi @Keiske! Could you pls check the actual response in both cases and compare? Would be nice to post it here if possible.

@Keiske
Copy link
Author

Keiske commented Aug 27, 2020

Hi @Keiske! Could you pls check the actual response in both cases and compare? Would be nice to post it here if possible.

Sure. Here it is. Looks very similar, but with switched order of metrics in results.

First value column query result:
Prometheus:
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"pod":"ag-postgres-portal-0"},"value":[1598532794,"0.25"]},{"metric":{"pod":"ag-postgres-user-0"},"value":[1598532794,"0.25"]}]}}

VictoriaMetrics:
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"pod":"ag-postgres-user-0"},"value":[1598532794,"0.25"]},{"metric":{"pod":"ag-postgres-portal-0"},"value":[1598532794,"0.25"]}]}}

Second value column query results:
Prometheus:
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"pod":"ag-postgres-portal-0"},"value":[1598532794,"0.0908893273381095"]},{"metric":{"pod":"ag-postgres-user-0"},"value":[1598532794,"0.0875724036524064"]}]}}

VictoriaMetrics:
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"pod":"ag-postgres-user-0"},"value":[1598532794,"0.08627032711112406"]},{"metric":{"pod":"ag-postgres-portal-0"},"value":[1598532794,"0.09153162373333341"]}]}}

vmag

@hagen1778
Copy link
Collaborator

Thanks! Weird, content looks identical except order. But order is consistent for both VM requests so it shouldn't be a case. Have you tried to build any other tables based on VM datasource? I wonder if it is Grafana bug...

@valyala
Copy link
Collaborator

valyala commented Sep 1, 2020

@Keiske , try wrapping queries into sort() function. This should guarantee consistent order of the returned metrics.

@balabalazhoucj
Copy link

Hi!I have same question.My environment uses v1.40.0-cluster.

WX20200909-105454

But I change grafana time range is ok(from now-5m to now-1m),my prometheus scrape_interval is 1m.

WX20200909-105416

/ # ./vminsert-prod -version
vminsert-20200815-132700-tags-v1.40.0-cluster-0-gd9f7ea1c6

When i use v1.37.2-cluster is ok

@hagen1778
Copy link
Collaborator

Hi @balabalazhoucj! Have you tried to set instant button to your queries like in the screenshot from @Keiske?

@Muxa1L
Copy link

Muxa1L commented Sep 17, 2020

@hagen1778 I think its because when "instant" is used it returns time for as - 30s (or current value of search.latencyOffset param set for vmselect) instead of actual time for this metric. And because queries may complete in different time this timestamps may differ for 1-2-10-inf ms, and it breaks the table.
Examples.
Not OK: (3ms offset for metrics scraped in one time)
image
OK (after 10 refreshes):
image
Nothing was changed in queries, and dashboards, only few refreshes.

@Keiske
Copy link
Author

Keiske commented Sep 17, 2020

@valyala Sorting results didn't help.

@hagen1778 Yes, as @Muxa1L said, our issue looks just like this. Refreshing dashboard in browser sometimes makes table marge correct. About 1 of 10 page refreshes. But how can we fix it to always merge results in table correctly, just like Prometheus datasource does.

@Muxa1L
Copy link

Muxa1L commented Sep 17, 2020

By the way, when you disable "instant" thus running ranged query - last results of query are correct.

@hagen1778
Copy link
Collaborator

@Muxa1L Good catch! Instant queries are sent to /query handler and response contains timestamps with milliseconds precision. And regular queries are sent to /query_range and response contains timestamps with second precision. Should VM round timestamps up to seconds for instant queries @valyala ?

@Muxa1L
Copy link

Muxa1L commented Sep 18, 2020

@hagen1778 I think it will be better to return the timestamp of the last metric value
Example single metric and results for instant query (to path /query):
{"result":[{"metric":{"__name__":"go_cpu_count","instance":"self","job":"victoria-metrics"},"value":[1600421855.726,"3"]}]}
1600421855.726 is approx 2020-09-18 12:37:35 - is the - 30s
, and ranged (for last minute)
{"result":[{"metric":{"__name__":"go_cpu_count","instance":"self","job":"victoria-metrics"},"values":[[1600421820,"3"],[1600421835,"3"],[1600421850,"3"],[1600421865,"3"],[1600421880,"3"]]}]}
1600421880 is 2020-09-18 12:38:00, and is a correct time of last scrape

Plus. If VM would round timestamps up to seconds - it still will be possible to get different timestamps that will break the table.

@starsliao
Copy link

@hagen1778
I also have this problem. I found that there will be such a problem since v1.39.0, but it is normal in v1.38.1. Because the grafana table cannot be used, I can only use the v1.38.1 version now. I hope this problem can be solved.

  • v1.39.0
    x

  • v1.38.1
    image

@Muxa1L
Copy link

Muxa1L commented Sep 21, 2020

@hagen1778 Nvm my previous comment. Returning timestamps with seconds precision will be enough. Prometheus also returns with seconds precision. And victoriaMetrics returned timestamps with seconds precision before v1.39.0, as @starsliao noticed

valyala added a commit that referenced this issue Sep 21, 2020
valyala added a commit that referenced this issue Sep 21, 2020
@valyala
Copy link
Collaborator

valyala commented Sep 21, 2020

The issue must be fixed in the following commits:

  • Single-node VictoriaMetrics - 2eb72e0
  • Cluster VictoriaMetrics - 07c6226

The bugfix rounds default time value to seconds when the query to /api/v1/query doesn't contain time query arg. This is a workaround, which reduces the probability of the original issue. The proper fix should be applied on Grafana side - it must pass time query arg with each query to /api/v1/query.

@valyala valyala added the bug Something isn't working label Sep 21, 2020
@Muxa1L
Copy link

Muxa1L commented Sep 22, 2020

@valyala grafana passes time to queries. But it does not seem to be counted anywhere.
изображение

I think this part overwrites start value.

if !searchutils.GetBool(r, "nocache") && ct-start < queryOffset {
// Adjust start time only if `nocache` arg isn't set.
// See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/241
start = ct - queryOffset
}

So setting -search.latencyOffset to something small, like 1ms helps.

valyala added a commit that referenced this issue Sep 23, 2020
valyala added a commit that referenced this issue Sep 23, 2020
@valyala
Copy link
Collaborator

valyala commented Sep 23, 2020

@Muxa1L , thanks for the spot! It must be fixed in the following commits:

  • Single-node VictoriaMetrics: 3ba5070
  • Cluster VictoriaMetrics: 1fce795

Unfortunately these commits weren't included in v1.41.1, but they will be included in the next release.

@Muxa1L
Copy link

Muxa1L commented Sep 23, 2020

@valyala, great! Now it works correctly, thanks for fixes!

@boazjohn
Copy link

Is it advised to use rollback to v1.38.1? Is this fix going to to be released soon?
Using 1.41.1 and facing the issue.

@valyala
Copy link
Collaborator

valyala commented Sep 29, 2020

Is it advised to use rollback to v1.38.1?

Unfortunately it is impossible to downgrade from v1.41.* to older releases due to on-disk data format change. See release notes for v1.41.0 for details. So it is better waiting for the next release or building VictoriaMetrics from sources according to the following docs:

Is this fix going to to be released soon?

The fix will be included in the upcoming release, which is going to be published in the next couple of days.

@valyala
Copy link
Collaborator

valyala commented Sep 30, 2020

The bugfix is available starting from v1.42.0. Closing the bug as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need more info
Projects
None yet
Development

No branches or pull requests

7 participants