Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explain potential cardinality statistics inaccuracy #3528

Closed
vijexa opened this issue Dec 23, 2022 · 9 comments
Closed

explain potential cardinality statistics inaccuracy #3528

vijexa opened this issue Dec 23, 2022 · 9 comments
Assignees
Labels
docs enhancement New feature or request TBD To Be Done

Comments

@vijexa
Copy link

vijexa commented Dec 23, 2022

Describe the bug
Hello, and thank you for this great project.
I was trying to make use of cardinality stats from /prometheus/api/v1/status/tsdb, but returned series count in seriesCountByMetricName doesn't seem to be accurate. Series count is multiple times higher than what is expected and calculated by manually counting series from /prometheus/api/v1/series.

To Reproduce
Check metric series count in tsdb stats:

curl 'http://vmselect-vm-cluster:8481/select/0/prometheus/api/v1/status/tsdb?topN=10&date=2022-12-22' | jq '.data.seriesCountByMetricName' |  grep 'metric_name' -A 2 -B 1

      {
        "name": "metric_name",
        "value": 297613
      },

Retrieve metadata for all series of the same metric for the same day, then count unique series:

curl 'http://vmselect-vm-cluster:8481/select/0/prometheus/api/v1/series?start=2022-12-22T00:00:00Z&end=2022-12-22T23:59:59Z&match=metric_name' | jq '.data[]' -c  | wc -l

  122364

Expected behavior
Series count is expected to be the same in both cases, however it is very different.

Version
v1.81.2-cluster, vmselect binary output with --version flag:
vmselect-20220908-111729-tags-v1.81.2-cluster-0-ge9a0b803a

@dmitryk-dk
Copy link
Contributor

Hi @vijexa ! I have checked your requests on the
VictoriaMetrics play

And i got next results
curl https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/api/v1/status/tsdb?topN=10&date=2022-12-28 | jq '.data.seriesCountByMetricName' | grep 'flag' -A 2 -B 1

"seriesCountByMetricName":[{"name":"flag","value":1278}]

curl 'https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/api/v1/series?start=2022-12-28T00:00:00Z&end=2022-12-28T23:59:59Z&match[]=flag' | jq '.data[]' -c | wc -l

1278

@dmitryk-dk dmitryk-dk added the question The question issue label Dec 28, 2022
@vijexa
Copy link
Author

vijexa commented Dec 28, 2022

Hello @dmitryk-dk,
Thank you for response! It looks like there's no difference for 2022-12-28 on play VM. However, I've checked a few other days on play VM and found the same issue for 2022-12-24:

curl 'https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/api/v1/status/tsdb?topN=10&date=2022-12-24' | jq '.data.seriesCountByMetricName' |  grep 'flag' -A 2 -B 1

  {
    "name": "flag",
    "value": 1360
  },
curl 'https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/api/v1/series?start=2022-12-24T00:00:00Z&end=2022-12-24T23:59:59Z&match[]=flag' | jq '.data[]' -c | wc -l

    1278

/v1/status/tsdb reports larger cardinality for metric flag on 2022-12-24.

@dmitryk-dk
Copy link
Contributor

Hi @vijexa ! Sorry for the late delay I missed that you use the cluster version. Do you use a replication factor flag?

About our play - this value can be a little different because tsdb checks all index when /api/v1/series/ make merges when return result

@valyala
Copy link
Collaborator

valyala commented Dec 30, 2022

The /api/v1/series works in the following way in VictoriaMetrics cluster:

  1. vmselect sends the /api/v1/series request to every vmstorage node in parallel.
  2. Every vmstorage node locates all the time series matching the provided match[] series selector.
  3. vmselect receives metricName{labels} entries from the step 2 for the matching time series from every vmstorage node.
  4. vmselect de-duplicates identical metricName{labels} entries received from vmstorage nodes and returns the result to the client. Duplicate entries may appear because of the following reasons:
    • When the replication is enabled, then every incoming time series is stored at multiple distinct vmstorage nodes. So multiple vmstorage nodes may return identical metricName{labels} entries.
    • When some of vmstorage nodes cannot keep up with the data ingestion rate (or if they become temporarily unavailable), then vminsert may start re-routing the incoming time series from the temporarily slow or temporarily unavailable vmstorage nodes to the remaining healthy nodes. This will result in duplicate metricName{labels} entries across multiple vmstorage nodes.

The /api/v1/status/tsdb works in the following way at VictoriaMetrics cluster:

  1. vmselect sends the /api/v1/status/tsdb request to every vmstorage node in parallel.
  2. vmstorage collects tsdb stats by scanning the inverted index (aka indexdb). Specifically, it scans metricName -> internal_series_id index for collecting the stats for seriesCountByMetricName.
  3. vmstorage fetches the collected tsdb stats from every vmstorage node and merges it. The seriesCountByMetricName stats is just summed across vmstorage nodes. This may lead to inflated values when samples for the same time series are spread across multiple vmstorage nodes as described above.

Additionally, sometimes vmstorage node may register multiple internal time series (aka TSID) for the same original time series when many samples for the same new time series are sent to vmstorage in a short period of time. This may also lead to inflated values for seriesCountByMetricName.

@vijexa
Copy link
Author

vijexa commented Dec 30, 2022

Yes, we use VM in cluster configuration. Configuration of the cluster that I've been using as an example is:

  • 6 vmstorage
  • 2 vminsert
  • 2 vmselect
  • replication factor 1

So what you are saying is that /api/v1/status/tsdb cannot be relied on in cluster mode? Are there any plans to address this?

@jnadler
Copy link

jnadler commented Jun 29, 2023

We're facing this problem as well.

I can understand why it might not be fixed - making TSDB stats mergeable across a cluster of vmstorage nodes sounds like a big job!

Can I suggest adding clear messaging to the documentation explaining that the numbers cannot be relied on in cluster mode. We spent a bit of time believing that we could use this API as a part of our cardinality monitoring / cardinality defense solution, time that could have been saved if this limitation were in the docs.

At least two places it deserves a mention:
https://docs.victoriametrics.com/#tsdb-stats
https://docs.victoriametrics.com/#cardinality-explorer

The fact that it breaks cardinality explorer for all cluster users may provide some motivation for thinking of a solution here!

@valyala
Copy link
Collaborator

valyala commented Jun 29, 2023

Can I suggest adding clear messaging to the documentation explaining that the numbers cannot be relied on in cluster mode?

Absolutely agreed - the docs about cardinality explorer should mention that some stats can be inaccurate for VictoriaMetrics cluster, with the explanation of this fact. I also think that the source code responsible for calculating the stats for VictoriaMetrics cluster must contain a comment with the same explanation, so developers could figure out and understand the issue more quickly.

@hagen1778 hagen1778 changed the title Cardinality statistics inaccuracy explain potential cardinality statistics inaccuracy Jul 19, 2023
@hagen1778 hagen1778 added enhancement New feature or request docs and removed question The question issue labels Jul 19, 2023
@hagen1778
Copy link
Collaborator

hagen1778 commented Jul 19, 2023

The info about potential inaccuracy should be added to VMUI cardinality page as well
Related ticket #3070

@hagen1778
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs enhancement New feature or request TBD To Be Done
Projects
None yet
Development

No branches or pull requests

5 participants