Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmagent: drop corrupted data blocks on start up #1030

Closed
jelmd opened this issue Jan 22, 2021 · 15 comments
Closed

vmagent: drop corrupted data blocks on start up #1030

jelmd opened this issue Jan 22, 2021 · 15 comments
Labels
enhancement New feature or request vmagent

Comments

@jelmd
Copy link
Contributor

jelmd commented Jan 22, 2021

There seems to be a problem wrt. handling cached data by vmagent v1.52.0:

/usr/sbin/vmagent \
--remoteWrite.flushInterval=5s --httpListenAddr=127.0.0.1:8429 \
--promscrape.config=/data/vmagent/promscrape.json --promscrape.disableKeepAlive=false \
--promscrape.disableCompression --remoteWrite.url=http://vmdbA:8428/api/v1/write \
--remoteWrite.tmpDataPath=/data/vmagent
...
2021-01-22T15:06:38.344Z    info    VictoriaMetrics/app/vmagent/main.go:88  starting vmagent at "127.0.0.1:8429"...
2021-01-22T15:06:38.350Z    info    VictoriaMetrics/lib/memory/memory.go:43 limiting caches to 161940359577 bytes, leaving 107960239719 bytes to the OS according to -memory.allowedPercent=60
2021-01-22T15:06:38.350Z    error   VictoriaMetrics/lib/persistentqueue/persistentqueue.go:214  cannot read metainfo for persistent queue from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/metainfo.json": invalid data read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/metainfo.json": readerOffset=536871040 cannot exceed writerOffset=16701900; re-creating "/data/vmagent/persistent-queue/1_28103F80D27BCD31"
2021-01-22T15:06:38.364Z    info    VictoriaMetrics/lib/persistentqueue/fastqueue.go:51 opened fast persistent queue at "/data/vmagent/persistent-queue/1_28103F80D27BCD31" with maxInmemoryBlocks=200, it contains 0 pending bytes
2021-01-22T15:06:38.364Z    info    VictoriaMetrics/app/vmagent/remotewrite/client.go:130   initialized client for -remoteWrite.url="1:secret-url"
2021-01-22T15:06:38.366Z    info    VictoriaMetrics/app/vmagent/main.go:111 started vmagent in 0.021 seconds
2021-01-22T15:06:38.366Z    info    VictoriaMetrics/lib/promscrape/scraper.go:91    reading Prometheus configs from "/data/vmagent/promscrape.json"
2021-01-22T15:06:38.366Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:82 starting http server at http://127.0.0.1:8429/
2021-01-22T15:06:38.366Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:83 pprof handlers are exposed at http://127.0.0.1:8429/debug/pprof/
2021-01-22T15:06:38.371Z    info    VictoriaMetrics/lib/promscrape/scraper.go:344   static_configs: added targets: 3, removed targets: 0; total targets: 3
queue/persistentqueue.go:512    FATAL: too big block size read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000": 8255741331277644361 bytes; cannot exceed 33554432 bytes
panic: FATAL: too big block size read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000": 8255741331277644361 bytes; cannot exceed 33554432 bytes
...

The vmagent scrapes:

  • 480 metrics every 10s from vmagent itself
  • 311 metrics every 2s from dcgm-exporter
  • 534 metrics every 1s from node-exporter

so 63720000 data pairs a day (~ 26 MB?).

@hagen1778
Copy link
Collaborator

Hi @jelmd! The number of bytes 8255741331277644361 comes from a data block header, so it is likely a bug in vmagent (on flushing such blocks on disk) or data corruption. I'd encourage you to check if there was a probability of data corruption. But so far I anticipate it is a bug in vmagent. Thank you for report!

@hagen1778 hagen1778 added bug Something isn't working vmagent labels Jan 29, 2021
@jelmd
Copy link
Contributor Author

jelmd commented Feb 1, 2021

Hi @hagen1778,

Well, the users on the related machines run experimental software which sometimes consumes all available memory (exactly to monitor this, I installed vmagent ;-)) and the machine needs to be rebooted, because no process can make any progress anymore. Not sure, what vmagent does in such situations, but at least I can say, that data series have gaps.

The fs used is ZFS, thus should be always consistent because of x-action based functionality.

@hagen1778
Copy link
Collaborator

Are these "normal" kind of reboots (with SIGTERM to all processes) or "power-off" kind of reboots?

@jelmd
Copy link
Contributor Author

jelmd commented Feb 2, 2021

ipmitool -I lanplus -H $host -U $admin -o supermicro chassis power reset

So more like hitting the reset button of the machine.

@hagen1778
Copy link
Collaborator

hagen1778 commented Feb 3, 2021

Hm, while VM itself is resistant for power reset cases, the vmagent is not, unfortunately. It needs a proper shutdown procedure to finish on-disk writes. I guess, it may be marked as enhancement.
Thanks for report!

@hagen1778 hagen1778 added enhancement New feature or request and removed bug Something isn't working labels Feb 3, 2021
@jelmd
Copy link
Contributor Author

jelmd commented Feb 5, 2021

Ah, ok. Would be nice to have as much data as possible to be able to analyze "frozen" boxes like this. But getting it to busy, is bad as well ;-).

What is much more annoying is, that vmagent seems to never re-try unavailable prom scrape targets. E.g. if the machine gets rebooted and poor systemd starts vmagent before e.g. node-exporter, it seems to do nothing forever. So I need to restart vmagent manually to get any data again. But I guess, it is better to open another issue for it?

@hagen1778
Copy link
Collaborator

Yes, please open another issue. Thanks!

@jelmd
Copy link
Contributor Author

jelmd commented Feb 22, 2021

Hmm, it really, really, really sucks, that vmagent is not able to recover properly. It is of no use, if we have it to restart manually each time, after a reboot. Especially data collectors should be very robust to avoid loosing data. vmagent seems to be very fragile =:-((((

2021-02-22T17:09:32.928Z	info	VictoriaMetrics/lib/logger/flag.go:12	build version: vmagent-v1.52.0 (v1.52.0)
2021-02-22T17:09:32.928Z	info	VictoriaMetrics/lib/logger/flag.go:13	command line flags
2021-02-22T17:09:32.928Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "csvTrimTimestamp" = "1ms"
2021-02-22T17:09:32.928Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "dryRun" = "false"
2021-02-22T17:09:32.928Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "enableTCP6" = "false"
2021-02-22T17:09:32.928Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "envflag.enable" = "false"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "envflag.prefix" = ""
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "fs.disableMmap" = "false"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "graphiteListenAddr" = ""
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "graphiteTrimTimestamp" = "1s"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "http.connTimeout" = "2m0s"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "http.disableResponseCompression" = "false"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "http.idleConnTimeout" = "1m0s"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "http.maxGracefulShutdownDuration" = "7s"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "http.pathPrefix" = ""
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "http.shutdownDelay" = "0s"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "httpAuth.password" = "secret"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "httpAuth.username" = ""
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "httpListenAddr" = "127.0.0.1:8429"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "import.maxLineLen" = "104857600"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "influx.maxLineSize" = "262144"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "influxListenAddr" = ""
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "influxMeasurementFieldSeparator" = "_"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "influxSkipMeasurement" = "false"
2021-02-22T17:09:32.929Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "influxSkipSingleField" = "false"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "influxTrimTimestamp" = "1ms"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "insert.maxQueueDuration" = "1m0s"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "loggerDisableTimestamps" = "false"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "loggerErrorsPerSecondLimit" = "0"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "loggerFormat" = "default"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "loggerLevel" = "INFO"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "loggerOutput" = "stderr"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "loggerWarnsPerSecondLimit" = "0"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "maxConcurrentInserts" = "160"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "maxInsertRequestSize" = "33554432"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "memory.allowedBytes" = "0"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "memory.allowedPercent" = "60"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "metricsAuthKey" = "secret"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "opentsdbHTTPListenAddr" = ""
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "opentsdbListenAddr" = ""
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "opentsdbTrimTimestamp" = "1s"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "opentsdbhttp.maxInsertRequestSize" = "33554432"
2021-02-22T17:09:32.930Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "opentsdbhttpTrimTimestamp" = "1ms"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "pprofAuthKey" = "secret"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.config" = "/data/vmagent/promscrape.json"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.config.dryRun" = "false"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.config.strictParse" = "false"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.configCheckInterval" = "0s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.consulSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.disableCompression" = "true"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.disableKeepAlive" = "false"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.discovery.concurrency" = "100"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.discovery.concurrentWaitTime" = "1m0s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.dnsSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.dockerswarmSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.dropOriginalLabels" = "false"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.ec2SDCheckInterval" = "1m0s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.eurekaSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.fileSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.gceSDCheckInterval" = "1m0s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.kubernetesSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.maxDroppedTargets" = "1000"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.maxScrapeSize" = "16777216"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.openstackSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.streamParse" = "false"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.suppressDuplicateScrapeTargetErrors" = "false"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "promscrape.suppressScrapeErrors" = "false"
2021-02-22T17:09:32.931Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.basicAuth.password" = "secret"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.basicAuth.username" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.bearerToken" = "secret"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.flushInterval" = "5s"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.label" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.maxBlockSize" = "8388608"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.maxDiskUsagePerURL" = "0"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.proxyURL" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.queues" = "4"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.relabelConfig" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.sendTimeout" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.showURL" = "false"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.significantFigures" = "0"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.tlsCAFile" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.tlsCertFile" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.tlsInsecureSkipVerify" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.tlsKeyFile" = "secret"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.tlsServerName" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.tmpDataPath" = "/data/vmagent"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.url" = "secret"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.urlRelabelConfig" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "tls" = "false"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "tlsCertFile" = ""
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "tlsKeyFile" = "secret"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "version" = "false"
2021-02-22T17:09:32.932Z	info	VictoriaMetrics/app/vmagent/main.go:88	starting vmagent at "127.0.0.1:8429"...
2021-02-22T17:09:32.935Z	info	VictoriaMetrics/lib/memory/memory.go:43	limiting caches to 161940359577 bytes, leaving 107960239719 bytes to the OS according to -memory.allowedPercent=60
2021-02-22T17:09:32.936Z	warn	VictoriaMetrics/lib/persistentqueue/persistentqueue.go:318                                                                                                                    "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000" size (6231028 bytes) is bigger than writer offset (6218989 bytes); this may be the case on unclean shutdown (OOM, `kill -9`, hardware reset); trying to fix it by adjusting fileSize to 6218989
2021-02-22T17:09:32.936Z	info	VictoriaMetrics/lib/persistentqueue/fastqueue.go:51                                                                                                                           opened fast persistent queue at "/data/vmagent/persistent-queue/1_28103F80D27BCD31" with maxInmemoryBlocks=200, it contains 12447 pending bytes
2021-02-22T17:09:32.937Z	info	VictoriaMetrics/app/vmagent/remotewrite/client.go:130                                                                                                                         initialized client for -remoteWrite.url="1:secret-url"
2021-02-22T17:09:32.938Z	info	VictoriaMetrics/app/vmagent/main.go:111	started vmagent in 0.006 seconds
2021-02-22T17:09:32.939Z	info	VictoriaMetrics/lib/promscrape/scraper.go:91                                                                                                                                  reading Prometheus configs from "/data/vmagent/promscrape.json"
2021-02-22T17:09:32.939Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:82                                                                                                                               starting http server at http://127.0.0.1:8429/
2021-02-22T17:09:32.939Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:83                                                                                                                               pprof handlers are exposed at http://127.0.0.1:8429/debug/pprof/
2021-02-22T17:09:32.944Z	info	VictoriaMetrics/lib/promscrape/scraper.go:344                                                                                                                                 static_configs: added targets: 3, removed targets: 0; total targets: 3
2021-02-22T17:09:32.966Z	error	VictoriaMetrics/app/vmagent/remotewrite/client.go:220                                                                                                                         couldn't send a block with size 12439 bytes to "1:secret-url": Post "http://mdbt:8428/api/v1/write": dial tcp4: lookup mdbt on 141.44.21.1:53: write udp 141.44.34.10:59438->141.44.21.1:53: write: operation not permitted; re-sending the block in 2.000 seconds
2021-02-22T17:09:33.225Z	error	VictoriaMetrics/lib/promscrape/scrapework.go:235                                                                                                                              error when scraping "http://localhost:9400/metrics" from job "nvidia" with labels {gz="kino2",instance="localhost:9400",job="nvidia"}: error when scraping "http://localhost:9400/metrics": dial tcp4 127.0.0.1:9400: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses
2021-02-22T17:09:35.225Z	error	VictoriaMetrics/lib/promscrape/scrapework.go:235                                                                                                                              error when scraping "http://localhost:9400/metrics" from job "nvidia" with labels {gz="kino2",instance="localhost:9400",job="nvidia"}: error when scraping "http://localhost:9400/metrics": dial tcp4 127.0.0.1:9400: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses
2021-02-22T17:09:37.225Z	error	VictoriaMetrics/lib/promscrape/scrapework.go:235                                                                                                                              error when scraping "http://localhost:9400/metrics" from job "nvidia" with labels {gz="kino2",instance="localhost:9400",job="nvidia"}: error when scraping "http://localhost:9400/metrics": dial tcp4 127.0.0.1:9400: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses
2021-02-22T17:09:39.226Z	error	VictoriaMetrics/lib/promscrape/scrapework.go:235                                                                                                                              error when scraping "http://localhost:9400/metrics" from job "nvidia" with labels {gz="kino2",instance="localhost:9400",job="nvidia"}: error when scraping "http://localhost:9400/metrics": dial tcp4 127.0.0.1:9400: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses
2021-02-22T17:09:42.943Z	panic	VictoriaMetrics/lib/persistentqueue/persistentqueue.go:512                                                                                                                    FATAL: too big block size read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000": 1198702969903609204 bytes; cannot exceed 33554432 bytes
panic: FATAL: too big block size read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000": 1198702969903609204 bytes; cannot exceed 33554432 bytes

goroutine 59 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logMessage(0xb11a68, 0x5, 0xc000cac000, 0xa1, 0x4)
	github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:258 +0xc8c
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevelSkipframes(0x1, 0xb11a68, 0x5, 0xb15252, 0x9, 0xc0000a7d68, 0x1, 0x1)
	github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:126 +0xd0
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevel(...)
	github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:118
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.Panicf(...)
	github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:114
github.com/VictoriaMetrics/VictoriaMetrics/lib/persistentqueue.(*Queue).MustReadBlock(0xc0000f0360, 0xc000292360, 0x0, 0x120, 0x0, 0x0, 0x0, 0xc0000c4000)
	github.com/VictoriaMetrics/VictoriaMetrics/lib/persistentqueue/persistentqueue.go:512 +0x219
github.com/VictoriaMetrics/VictoriaMetrics/lib/persistentqueue.(*FastQueue).MustReadBlock(0xc0001a0000, 0xc000292360, 0x0, 0x120, 0x0, 0x0, 0x0, 0x0)
	github.com/VictoriaMetrics/VictoriaMetrics/lib/persistentqueue/fastqueue.go:169 +0x146
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*client).runWorker(0xc000022360)
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/client.go:164 +0xa3
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.newClient.func1(0xc000022360)
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/client.go:127 +0x54
created by github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.newClient
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/client.go:125 +0x9da

@jelmd
Copy link
Contributor Author

jelmd commented Feb 22, 2021

admin.box2 ~ > ll /data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000
-rw-r--r-- 1 daemon daemon 6234577 Feb 22 18:09 /data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000

@hagen1778 hagen1778 changed the title vmagent: cached data bug? Integer overflow? vmagent: drop corrupted data blocks on start up Feb 23, 2021
@hagen1778
Copy link
Collaborator

As a workaround, have you tried to add a startup script which will drop vmagent queue after reboot?

@jelmd
Copy link
Contributor Author

jelmd commented Feb 24, 2021

Hmm, not sure, what you mean. IMHO good apps handle errors and do not just panic and put all the burden on the user, which actually (at least in my case) has no clue, whether it is really time run away and panic as well ... ;-)

@valyala
Copy link
Collaborator

valyala commented Mar 30, 2021

good apps handle errors and do not just panic and put all the burden on the user

Agreed. We need to figure out how to handle the error properly.

A temporary solution is to run rm -rf /data/vmagent/persistent-queue/* before starting vmagent after unclean shutdown as @hagen1778 pointed out.

@valyala
Copy link
Collaborator

valyala commented Apr 5, 2021

The related issue - #687

valyala added a commit that referenced this issue Apr 5, 2021
valyala added a commit that referenced this issue Apr 5, 2021
@valyala
Copy link
Collaborator

valyala commented Apr 5, 2021

@jelmd , vmagent should recover when reading from corrupted persistent queue at -remoteWrite.tmpDataPath starting from the commit 95dbebf . This commit will be included in the next release.

@valyala
Copy link
Collaborator

valyala commented Apr 8, 2021

vmagent should automatically recover when reading from corrupted persistent queue starting from v1.58.0. Closing the issue as fixed.

@valyala valyala closed this as completed Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request vmagent
Projects
None yet
Development

No branches or pull requests

3 participants