explain potential cardinality statistics inaccuracy #3528

vijexa · 2022-12-23T15:37:14Z

Describe the bug
Hello, and thank you for this great project.
I was trying to make use of cardinality stats from /prometheus/api/v1/status/tsdb, but returned series count in seriesCountByMetricName doesn't seem to be accurate. Series count is multiple times higher than what is expected and calculated by manually counting series from /prometheus/api/v1/series.

To Reproduce
Check metric series count in tsdb stats:

curl 'http://vmselect-vm-cluster:8481/select/0/prometheus/api/v1/status/tsdb?topN=10&date=2022-12-22' | jq '.data.seriesCountByMetricName' |  grep 'metric_name' -A 2 -B 1

      {
        "name": "metric_name",
        "value": 297613
      },

Retrieve metadata for all series of the same metric for the same day, then count unique series:

curl 'http://vmselect-vm-cluster:8481/select/0/prometheus/api/v1/series?start=2022-12-22T00:00:00Z&end=2022-12-22T23:59:59Z&match=metric_name' | jq '.data[]' -c  | wc -l

  122364

Expected behavior
Series count is expected to be the same in both cases, however it is very different.

Version
v1.81.2-cluster, vmselect binary output with --version flag:
vmselect-20220908-111729-tags-v1.81.2-cluster-0-ge9a0b803a

The text was updated successfully, but these errors were encountered:

dmitryk-dk · 2022-12-28T08:52:47Z

Hi @vijexa ! I have checked your requests on the
VictoriaMetrics play

And i got next results
curl https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/api/v1/status/tsdb?topN=10&date=2022-12-28 | jq '.data.seriesCountByMetricName' | grep 'flag' -A 2 -B 1

"seriesCountByMetricName":[{"name":"flag","value":1278}]

curl 'https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/api/v1/series?start=2022-12-28T00:00:00Z&end=2022-12-28T23:59:59Z&match[]=flag' | jq '.data[]' -c | wc -l

1278

vijexa · 2022-12-28T12:01:06Z

Hello @dmitryk-dk,
Thank you for response! It looks like there's no difference for 2022-12-28 on play VM. However, I've checked a few other days on play VM and found the same issue for 2022-12-24:

curl 'https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/api/v1/status/tsdb?topN=10&date=2022-12-24' | jq '.data.seriesCountByMetricName' |  grep 'flag' -A 2 -B 1

  {
    "name": "flag",
    "value": 1360
  },

curl 'https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/api/v1/series?start=2022-12-24T00:00:00Z&end=2022-12-24T23:59:59Z&match[]=flag' | jq '.data[]' -c | wc -l

    1278

/v1/status/tsdb reports larger cardinality for metric flag on 2022-12-24.

dmitryk-dk · 2022-12-29T17:15:46Z

Hi @vijexa ! Sorry for the late delay I missed that you use the cluster version. Do you use a replication factor flag?

About our play - this value can be a little different because tsdb checks all index when /api/v1/series/ make merges when return result

valyala · 2022-12-30T07:15:45Z

The /api/v1/series works in the following way in VictoriaMetrics cluster:

vmselect sends the /api/v1/series request to every vmstorage node in parallel.
Every vmstorage node locates all the time series matching the provided match[] series selector.
vmselect receives metricName{labels} entries from the step 2 for the matching time series from every vmstorage node.
vmselect de-duplicates identical metricName{labels} entries received from vmstorage nodes and returns the result to the client. Duplicate entries may appear because of the following reasons:
- When the replication is enabled, then every incoming time series is stored at multiple distinct vmstorage nodes. So multiple vmstorage nodes may return identical metricName{labels} entries.
- When some of vmstorage nodes cannot keep up with the data ingestion rate (or if they become temporarily unavailable), then vminsert may start re-routing the incoming time series from the temporarily slow or temporarily unavailable vmstorage nodes to the remaining healthy nodes. This will result in duplicate metricName{labels} entries across multiple vmstorage nodes.

The /api/v1/status/tsdb works in the following way at VictoriaMetrics cluster:

vmselect sends the /api/v1/status/tsdb request to every vmstorage node in parallel.
vmstorage collects tsdb stats by scanning the inverted index (aka indexdb). Specifically, it scans metricName -> internal_series_id index for collecting the stats for seriesCountByMetricName.
vmstorage fetches the collected tsdb stats from every vmstorage node and merges it. The seriesCountByMetricName stats is just summed across vmstorage nodes. This may lead to inflated values when samples for the same time series are spread across multiple vmstorage nodes as described above.

Additionally, sometimes vmstorage node may register multiple internal time series (aka TSID) for the same original time series when many samples for the same new time series are sent to vmstorage in a short period of time. This may also lead to inflated values for seriesCountByMetricName.

vijexa · 2022-12-30T12:44:48Z

Yes, we use VM in cluster configuration. Configuration of the cluster that I've been using as an example is:

6 vmstorage
2 vminsert
2 vmselect
replication factor 1

So what you are saying is that /api/v1/status/tsdb cannot be relied on in cluster mode? Are there any plans to address this?

jnadler · 2023-06-29T00:38:57Z

We're facing this problem as well.

I can understand why it might not be fixed - making TSDB stats mergeable across a cluster of vmstorage nodes sounds like a big job!

Can I suggest adding clear messaging to the documentation explaining that the numbers cannot be relied on in cluster mode. We spent a bit of time believing that we could use this API as a part of our cardinality monitoring / cardinality defense solution, time that could have been saved if this limitation were in the docs.

At least two places it deserves a mention:
https://docs.victoriametrics.com/#tsdb-stats
https://docs.victoriametrics.com/#cardinality-explorer

The fact that it breaks cardinality explorer for all cluster users may provide some motivation for thinking of a solution here!

valyala · 2023-06-29T14:48:32Z

Can I suggest adding clear messaging to the documentation explaining that the numbers cannot be relied on in cluster mode?

Absolutely agreed - the docs about cardinality explorer should mention that some stats can be inaccurate for VictoriaMetrics cluster, with the explanation of this fact. I also think that the source code responsible for calculating the stats for VictoriaMetrics cluster must contain a comment with the same explanation, so developers could figure out and understand the issue more quickly.

hagen1778 · 2023-07-19T11:00:05Z

The info about potential inaccuracy should be added to VMUI cardinality page as well
Related ticket #3070

hagen1778 · 2023-11-21T09:02:55Z

The info about possible inaccuracy was added to docs in PR #4678
See https://docs.victoriametrics.com/#tsdb-stats
https://docs.victoriametrics.com/#cardinality-explorer-statistic-inaccuracy

dmitryk-dk added the question The question issue label Dec 28, 2022

hagen1778 changed the title ~~Cardinality statistics inaccuracy~~ explain potential cardinality statistics inaccuracy Jul 19, 2023

hagen1778 added enhancement New feature or request docs and removed question The question issue labels Jul 19, 2023

hagen1778 added the TBD To Be Done label Jul 19, 2023

dmitryk-dk mentioned this issue Jul 20, 2023

docs: update information about tsdb usage in cluster version #4678

Merged

dmitryk-dk self-assigned this Jul 31, 2023

hagen1778 closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explain potential cardinality statistics inaccuracy #3528

explain potential cardinality statistics inaccuracy #3528

vijexa commented Dec 23, 2022

dmitryk-dk commented Dec 28, 2022

vijexa commented Dec 28, 2022

dmitryk-dk commented Dec 29, 2022

valyala commented Dec 30, 2022 •

edited

vijexa commented Dec 30, 2022

jnadler commented Jun 29, 2023

valyala commented Jun 29, 2023

hagen1778 commented Jul 19, 2023 •

edited

hagen1778 commented Nov 21, 2023

explain potential cardinality statistics inaccuracy #3528

explain potential cardinality statistics inaccuracy #3528

Comments

vijexa commented Dec 23, 2022

dmitryk-dk commented Dec 28, 2022

vijexa commented Dec 28, 2022

dmitryk-dk commented Dec 29, 2022

valyala commented Dec 30, 2022 • edited

vijexa commented Dec 30, 2022

jnadler commented Jun 29, 2023

valyala commented Jun 29, 2023

hagen1778 commented Jul 19, 2023 • edited

hagen1778 commented Nov 21, 2023

valyala commented Dec 30, 2022 •

edited

hagen1778 commented Jul 19, 2023 •

edited