Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series limit applied at vmagent, but churn rate is high #3660

Closed
blesswinsamuel opened this issue Jan 16, 2023 · 6 comments
Closed

Series limit applied at vmagent, but churn rate is high #3660

blesswinsamuel opened this issue Jan 16, 2023 · 6 comments
Labels
bug Something isn't working vmagent

Comments

@blesswinsamuel
Copy link

blesswinsamuel commented Jan 16, 2023

Describe the bug

When I use series limit to limit the series ingested into victoriametrics, the automatic metrics generated by vmagent shows that new series are being dropped (scrape_series_limit_samples_dropped), which is expected. But, checking the victoriametrics grafana dashboard, it shows a high churn rate, which suggests that even though vmagent shows the series as dropped, they are ingested into victoriametrics. I expect the samples shown by vmagent as dropped should not be ingested into victoriametrics.

I used avalanche (https://github.com/prometheus-community/avalanche) to run this test.

To Reproduce

Start victoriametrics:

./victoria-metrics-prod

Start avalanche:

git clone https://github.com/prometheus-community/avalanche.git
cd avalanche
go run cmd/avalanche.go --metric-count=500 --series-count=30 --port=9101

Start vmagent with seriesLimitPerTarget setting:

scrape-config.yaml

global:
  scrape_interval: 30s
  scrape_timeout: 10s
scrape_configs:
  - job_name: 'vmagent'
    scrape_interval: 30s
    static_configs:
      - targets:
        - 'localhost:8429'
        labels:
          pod: vmagent-pod
  - job_name: 'victoriametrics'
    scrape_interval: 30s
    static_configs:
      - targets:
        - 'localhost:8428'
        labels:
          pod: victoriametrics-pod
  - job_name: 'avalanche'
    scrape_interval: 30s
    static_configs:
      - targets:
        - 'localhost:9101'
        labels:
          pod: avalanche-pod
./vmagent-prod -remoteWrite.url "http://localhost:8428/api/v1/write" -promscrape.config /tmp/scrape-config.yaml -promscrape.seriesLimitPerTarget 5000 -promscrape.streamParse=true

Version

❯ ./vmutils-darwin-amd64-v1.86.1/vmagent-prod --version
vmagent-20230111-093903-tags-v1.86.1-0-g351fc152e
❯ ./victoria-metrics-prod --version
victoria-metrics-20230111-093558-tags-v1.86.1-0-g351fc152e

Logs

Not relevant to this issue.

Screenshots

The automatic metrics generated by vmagent show that samples are being dropped (this is as expected because avalanche is generating completely new series every minute):

image

The churn rate panel in VictoriaMetrics cluster grafana dashboard is consistently high and new series over 24h is increasing:

image

Here, churn rate = 250 series/sec = 250x60 series/min = 15,000 series/min
The number of series generated by Avalanche for every scrape (based on the above configuration) is 500x30=15,000 samples. The default configuration of avalanche is to generate entirely new series every 60s. So, looking at the churn rate, none of the series is being dropped by vmagent.

I expect the churn rate to be close to 0 here since all the new metrics emitted by avalanche should be dropped by vmagent.

Used command-line flags

Mentioned under the "To Reproduce" section.

Additional information

No response

@blesswinsamuel blesswinsamuel added the bug Something isn't working label Jan 16, 2023
@hagen1778
Copy link
Collaborator

Hello! Can confirm the issue.

hagen1778 added a commit that referenced this issue Jan 17, 2023
There are two changes here:
1. Do not account for `sw.Config.NoStaleMarkers`. Otherwise, disabling staleness markers
would also mean disabling of `seriesLimiter`;
2. Prevent sending staleness markers if series limit has been exceeded.
To send staleness markers we need to check which series disappeared between current and
previous scrapes. But when series limit is dropping series - there is no an easy way
to calculate it anymore. Hence, wrong markers could be send to remote storage.
Although, each series which was rejected by `seriesLimiter` will be then accounted as
a new time series by VM TSDB, after it was received in a form of stale marker.
See #3660

Signed-off-by: hagen1778 <roman@victoriametrics.com>
valyala added a commit that referenced this issue Jan 17, 2023
Fix the following issues:

- Series limit wasn't applied when staleness tracking was disabled.
- Series limit didn't prevent from sending staleness markers for new series exceeding the limit.

Updates #3660

Thanks to @hagen1778 for the initial attempt to fix the issue
at #3665
valyala added a commit that referenced this issue Jan 17, 2023
Fix the following issues:

- Series limit wasn't applied when staleness tracking was disabled.
- Series limit didn't prevent from sending staleness markers for new series exceeding the limit.

Updates #3660

Thanks to @hagen1778 for the initial attempt to fix the issue
at #3665
@valyala
Copy link
Collaborator

valyala commented Jan 17, 2023

@blesswinsamuel , thanks for filing the detailed bugreport! The issue should be fixed in the commit 289af65 . This commit will be included in the next release. In the mean time you can build vmagent or single-node VictoriaMetrics from this commit and verify whether they correctly apply the series limit. See build instructions below:

@blesswinsamuel
Copy link
Author

@valyala @hagen1778 Thanks for fixing this so fast! I built vmagent from the commit a844b97, and it is working as expected. Thank you!

image

One more thing - memory usage is going up a lot when running avalanche with the same configuration (total 15k series that keeps changing every minute). I can raise a separate issue to track this if you'd like.

image

I have streamParse enabled, which according to the docs, should help when targets export a big number of metrics. Unfortunately, I cleared the data in victoriametrics before trying out the new vmagent, so I don't have the memory usage details before the update.

@hagen1778
Copy link
Collaborator

hagen1778 commented Jan 18, 2023

vmagent can require more memory than usual if seriesLimitPerTarget is enabled. To check whether a specific series has already been seen before vmagent maintains bloom filter in memory. The filter requires memory proportionally to the seriesLimitPerTarget limit (the higher the limit, the more memory is needed). Bloom filter is created per each target (the more targets, the more memory is needed). So it could be the reason why it needs more memory.

To verify this please capture the memory profile and attach to the issue. Yes, creating a new issue for memory usage is preferable.

@blesswinsamuel
Copy link
Author

@hagen1778 thanks for your response. I created a new issue #3675 with more details about the memory spike. After doing some tests, it looks like this memory spike happens when the target exposes a large number of new series it hasn't seen previously on every scrape.

@valyala
Copy link
Collaborator

valyala commented Jan 18, 2023

vmagent should properly apply series limit starting from v1.86.2. Closing this issue as fixed.

@valyala valyala closed this as completed Jan 18, 2023
valyala added a commit that referenced this issue Jan 21, 2024
…markers … (#5577)"

This reverts commit cfec258.

Reason for revert: the original code already doesn't store the last scrape response when stale markers are disabled.
The scrapeWork.areIdenticalSeries() function always returns true is stale markers are disabled.
This prevents from storing the last response at scrapeWork.processScrapedData().

It looks like the reverted commit could also return back the issue #3660

Updates #5577
valyala added a commit that referenced this issue Jan 21, 2024
…markers … (#5577)"

This reverts commit cfec258.

Reason for revert: the original code already doesn't store the last scrape response when stale markers are disabled.
The scrapeWork.areIdenticalSeries() function always returns true is stale markers are disabled.
This prevents from storing the last response at scrapeWork.processScrapedData().

It looks like the reverted commit could also return back the issue #3660

Updates #5577
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vmagent
Projects
None yet
Development

No branches or pull requests

3 participants