Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmstorage sometimes fails to start on NFSv3 disks #162

Closed
prezhdarov opened this issue Aug 26, 2019 · 10 comments
Closed

vmstorage sometimes fails to start on NFSv3 disks #162

prezhdarov opened this issue Aug 26, 2019 · 10 comments
Labels
bug Something isn't working question The question issue

Comments

@prezhdarov
Copy link

Deployed VM in kubernetes yesterday and this morning both vmstorage nodes were dead. Apparently they were looking for a file in a folder which didn't exist (no, vmstorage nodes did not lie). Cleaned up and redeployed vmstorage (new PVCs) few hours later one of them is experiencing the same difficulties. Here's what it says on startup:

2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:11 build version: vmstorage-20190822-115633-tags-v1.26.0-cluster-0-ge9db22a5
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:12 command line flags
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "http.disableResponseCompression" = "false"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "httpListenAddr" = ":8482"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "loggerLevel" = "INFO"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "memory.allowedPercent" = "60"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "precisionBits" = "64"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "retentionPeriod" = "48"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "rpc.disableCompression" = "false"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "search.maxTagKeys" = "secret"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "search.maxTagValues" = "10000"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "search.maxUniqueTimeseries" = "100000"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "snapshotAuthKey" = "secret"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "storageDataPath" = "/storage"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "version" = "false"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "vminsertAddr" = ":8400"
2019-08-26T19:04:15.088+0000 info lib/logger/flag.go:20 flag "vmselectAddr" = ":8401"
2019-08-26T19:04:15.088+0000 info app/vmstorage/main.go:34 opening storage at "/storage" with retention period 48 months
2019-08-26T19:04:15.090+0000 info lib/memory/memory.go:33 limiting caches to 2576980377 bytes of RAM according to -memory.allowedPercent=60
2019-08-26T19:04:15.090+0000 info lib/storage/storage.go:548 loading MetricName->TSID cache from "/storage/cache/metricName_tsid"...
2019-08-26T19:04:15.091+0000 info lib/storage/storage.go:553 loaded MetricName->TSID cache from "/storage/cache/metricName_tsid" in 472.738µs; entriesCount: 0; sizeBytes: 0
2019-08-26T19:04:15.091+0000 info lib/storage/storage.go:548 loading MetricID->TSID cache from "/storage/cache/metricID_tsid"...
2019-08-26T19:04:15.091+0000 info lib/storage/storage.go:553 loaded MetricID->TSID cache from "/storage/cache/metricID_tsid" in 706.625µs; entriesCount: 0; sizeBytes: 0
2019-08-26T19:04:15.091+0000 info lib/storage/storage.go:548 loading MetricID->MetricName cache from "/storage/cache/metricID_metricName"...
2019-08-26T19:04:15.092+0000 info lib/storage/storage.go:553 loaded MetricID->MetricName cache from "/storage/cache/metricID_metricName" in 303.152µs; entriesCount: 0; sizeBytes: 0
2019-08-26T19:04:15.092+0000 info lib/storage/storage.go:548 loading Date->MetricID cache from "/storage/cache/date_metricID"...
2019-08-26T19:04:15.092+0000 info lib/storage/storage.go:553 loaded Date->MetricID cache from "/storage/cache/date_metricID" in 367.411µs; entriesCount: 0; sizeBytes: 0
2019-08-26T19:04:15.092+0000 info lib/storage/storage.go:482 loading curr_hour_metric_ids from "/storage/cache/curr_hour_metric_ids"...
2019-08-26T19:04:15.092+0000 info lib/storage/storage.go:485 nothing to load from "/storage/cache/curr_hour_metric_ids"
2019-08-26T19:04:15.092+0000 info lib/storage/storage.go:482 loading prev_hour_metric_ids from "/storage/cache/prev_hour_metric_ids"...
2019-08-26T19:04:15.092+0000 info lib/storage/storage.go:485 nothing to load from "/storage/cache/prev_hour_metric_ids"
2019-08-26T19:04:15.097+0000 info lib/mergeset/table.go:128 opening table "/storage/indexdb/15BE7AFB904C1F0B"...
2019-08-26T19:04:15.123+0000 info lib/mergeset/table.go:160 table "/storage/indexdb/15BE7AFB904C1F0B" has been opened in 25.740749ms; partsCount: 4; blocksCount: 49, itemsCount: 48595; sizeBytes: 163308
2019-08-26T19:04:15.125+0000 info lib/mergeset/table.go:128 opening table "/storage/indexdb/15BE7AFB904C1F0A"...
2019-08-26T19:04:15.132+0000 info lib/mergeset/table.go:160 table "/storage/indexdb/15BE7AFB904C1F0A" has been opened in 6.545175ms; partsCount: 0; blocksCount: 0, itemsCount: 0; sizeBytes: 0
2019-08-26T19:04:15.145+0000 info lib/mergeset/table.go:170 waiting for raw items flusher to stop on "/storage/indexdb/15BE7AFB904C1F0B"...
2019-08-26T19:04:15.145+0000 info lib/mergeset/table.go:173 raw items flusher stopped in 384ns on "/storage/indexdb/15BE7AFB904C1F0B"
2019-08-26T19:04:15.145+0000 info lib/mergeset/table.go:175 waiting for part mergers to stop on "/storage/indexdb/15BE7AFB904C1F0B"...
2019-08-26T19:04:15.145+0000 info lib/mergeset/table.go:178 part mergers stopped in 395ns on "/storage/indexdb/15BE7AFB904C1F0B"
2019-08-26T19:04:15.145+0000 info lib/mergeset/table.go:180 flushing inmemory parts to files on "/storage/indexdb/15BE7AFB904C1F0B"...
2019-08-26T19:04:15.145+0000 info lib/mergeset/table.go:204 0 inmemory parts have been flushed to files in 2.118µs on "/storage/indexdb/15BE7AFB904C1F0B"
2019-08-26T19:04:15.146+0000 info lib/mergeset/table.go:170 waiting for raw items flusher to stop on "/storage/indexdb/15BE7AFB904C1F0A"...
2019-08-26T19:04:15.146+0000 info lib/mergeset/table.go:173 raw items flusher stopped in 4.566µs on "/storage/indexdb/15BE7AFB904C1F0A"
2019-08-26T19:04:15.146+0000 info lib/mergeset/table.go:175 waiting for part mergers to stop on "/storage/indexdb/15BE7AFB904C1F0A"...
2019-08-26T19:04:15.146+0000 info lib/mergeset/table.go:178 part mergers stopped in 237ns on "/storage/indexdb/15BE7AFB904C1F0A"
2019-08-26T19:04:15.146+0000 info lib/mergeset/table.go:180 flushing inmemory parts to files on "/storage/indexdb/15BE7AFB904C1F0A"...
2019-08-26T19:04:15.146+0000 info lib/mergeset/table.go:204 0 inmemory parts have been flushed to files in 1.098µs on "/storage/indexdb/15BE7AFB904C1F0A"
2019-08-26T19:04:15.148+0000 fatal app/vmstorage/main.go:38 cannot open a storage at /storage with retention period 48 months: cannot open table at "/storage/data": cannot open partitions in the table "/storage/data": cannot open partition "2019_08": cannot open small parts from "/storage/data/small/2019_08": cannot open part "/storage/data/small/2019_08/1502_1494_20190826161500.000_20190826161500.000_15BE7B0A0A234959": cannot open timestamps file: open /storage/data/small/2019_08/1502_1494_20190826161500.000_20190826161500.000_15BE7B0A0A234959/timestamps.bin: no such file or directory

The node again is right, there is no timestamps.bin in that folder or in fact any other files. The adjacent folders however have those files, except maybe one or two.

Could this be a tiny little nuisance I've accidentally discovered or I'm holding it wrong? As you can see, It's mostly basic setup - persistent storage is NFSv3.

@prezhdarov prezhdarov changed the title vmstorage seems to be losing its date after a while (or looking for the wrong kind if it) vmstorage seems to be losing its data after a while (or looking for the wrong kind if it) Aug 26, 2019
@prezhdarov
Copy link
Author

Right, seems this has been covered in #61 already. I might try and work out nfs options needed to avoid such behaviour (netapp trident as storage driver), but wouldn’t this make it easier if vmstorage simply ignores empty dirs and even tries to remove them at a later stage?

@valyala valyala changed the title vmstorage seems to be losing its data after a while (or looking for the wrong kind if it) vmstorage sometimes fails to start on NFSv3 disks Aug 28, 2019
@valyala
Copy link
Collaborator

valyala commented Aug 28, 2019

It looks like VictoriaMetrics failed to properly flush files on NFSv3 disk before stopping. Did the previous vmstorage container stop gracefully via SIGINT or SIGTERM signal? Could you provide log lines for the stopped vmstorage container before this error?

Temporary workaround is to manually remove the folder with incomplete data - /storage/data/small/2019_08/1502_1494_20190826161500.000_20190826161500.000_15BE7B0A0A234959 in this case. See troubleshooting section for details. This can result in loss for the data stored inside this folder.

wouldn’t this make it easier if vmstorage simply ignores empty dirs and even tries to remove them at a later stage?

This would hide real errors and may result in data loss, since automatic removal of broken directories could silently remove the data stored inside these directories, which could be recovered otherwise.

The proper solution is to use more reliable filesystem for persistent volumes such as GCP disks.

@valyala valyala added bug Something isn't working question The question issue labels Aug 29, 2019
@prezhdarov
Copy link
Author

It is what we all think it is - nfs locks. Here's a sample output with vmstorage crashing:

2019-09-03T14:55:22.128+0000 info app/vmstorage/transport/server.go:122 accepted vminsert conn from 172.30.7.10:39660
2019-09-03T14:55:22.130+0000 info app/vmstorage/transport/server.go:154 processing vminsert conn from 172.30.7.10:39660
2019-09-03T14:55:26.378+0000 panic lib/fs/fs.go:289 FATAL: cannot remove "/storage/indexdb/15BED237DBBFDA5E/2006_4_15C0F55E58C6ED7E" in 11 attempts: unlinkat /storage/indexdb/15BED237DBBFDA5E/2006_4_15C0F55E58C6ED7E/.nfs80e3356e00004d6c000065a9: device or resource busy
panic: FATAL: cannot remove "/storage/indexdb/15BED237DBBFDA5E/2006_4_15C0F55E58C6ED7E" in 11 attempts: unlinkat /storage/indexdb/15BED237DBBFDA5E/2006_4_15C0F55E58C6ED7E/.nfs80e3356e00004d6c000065a9: device or resource busy

goroutine 19 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logMessage(0xa56551, 0x5, 0xc0000da7e0, 0xd9, 0x3)
/VictoriaMetrics/lib/logger/logger.go:124 +0x53d
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevel(0xa56551, 0x5, 0xa6d699, 0x2a, 0xc00e369fa0, 0x3, 0x3)
/VictoriaMetrics/lib/logger/logger.go:78 +0x113
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.Panicf(...)
/VictoriaMetrics/lib/logger/logger.go:62
github.com/VictoriaMetrics/VictoriaMetrics/lib/fs.dirRemover()
/VictoriaMetrics/lib/fs/fs.go:289 +0x1b5
created by github.com/VictoriaMetrics/VictoriaMetrics/lib/fs.init.0
/VictoriaMetrics/lib/fs/fs.go:306 +0x35

I'm not going to great lengths to defend NFS as viable storage solution, enough to mention this particular kubernetes cluster uses industry standard storage system and their kubernetes driver for persistent storage and there are influxdb and prometheus instances running flawlessly - may add gitlab, goharbor, jenkins, ELK among the few others - also GCK is not an option as this and few more are local running kubernetes clusters. There I go..... went too far defending my beloved NFS. I can add that this vendor NFS volumes are sought after in all public clouds as providing far superior performance and lower latency - even better than GCP disks. Probably we're going to see more and more of NFS as it feels natural to the whole kubernetes and cloud idea. But anyway -

Now, let's make VM happy with NFS locks. Apparently 11 attempts (are these 11 ms instead?) are not enough to clear the locks, though less than a second later the folders are empty? What about a quick and dirty workaround - a tiny KV in memory with all failed remove attempts, main processes ignore all paths in it, then a sidecar process - I know Go is great at spawning those - attempts to clean these at its leisure. At startup a quick lookup finds all the empty ones and puts them in said KV and here we go again... I must read through the source and find out what pile of gibberish I just wrote, but still, it's an idea that might work before something better comes to mind... I was thinking of putting a sidecar container cron job that removes all every minute - even if vmstorage dies for a minute, hopefully vminsert might keep the records, but you said this may corrupt data, so no....

@valyala
Copy link
Collaborator

valyala commented Sep 4, 2019

What about a quick and dirty workaround - a tiny KV in memory with all failed remove attempts, main processes ignore all paths in it, then a sidecar process - I know Go is great at spawning those - attempts to clean these at its leisure.

This is how it works now - see dirRemover code. 11 attempts were performed in 1s with 100ms sleep intervals between each attempt.

I'll remove the limit on the number of failed attempts in the follow-up commit.

valyala added a commit that referenced this issue Sep 4, 2019
…orary lock

Do not give up after 11 attempts of directory removal on laggy NFS.

Add `vm_nfs_dir_remove_failed_attempts_total` metric for counting the number of failed attempts
on directory removal.

Log failed attempts on directory removal after long sleep times.

Updates #162
valyala added a commit that referenced this issue Sep 4, 2019
…orary lock

Do not give up after 11 attempts of directory removal on laggy NFS.

Add `vm_nfs_dir_remove_failed_attempts_total` metric for counting the number of failed attempts
on directory removal.

Log failed attempts on directory removal after long sleep times.

Updates #162
@valyala
Copy link
Collaborator

valyala commented Sep 4, 2019

@prezhdarov , try building from 82bfe81 and verifying whether the issue is fixed after removing empty directories. See build instructions.

Monitor vmstorage logs and vm_nfs_dir_remove_failed_attempts_total metric exported on the /metrics page by vmstorage nodes. This metric should increment on each failed attempt to remove directory on NFS.

@prezhdarov
Copy link
Author

Works fine for the past 5 hours:
NAME READY STATUS RESTARTS AGE
vmstorage-0 1/1 Running 2 4h52m
vmstorage-1 1/1 Running 0 4h52m

vmstorage-0 restarts were triggered by already existing empty folders I removed quickly.
Removals per minute look like this:
NFS dir remove

Roughly 60 failed attempts per minute. Now if we could get the delay required for removals and if I can manage to get NFS latency per pod/container, we can easily see, if that is related. Also I may see if this is negated by different NFS mount options - i.e. sync and hard mounts, which will increase latency and storage work, but might just get rid of all NFS locks... Should've thought about this earlier.

valyala added a commit that referenced this issue Sep 5, 2019
…are removed on graceful shutdown

This patch is mainly required for laggy NFS. See #162
valyala added a commit that referenced this issue Sep 5, 2019
…are removed on graceful shutdown

This patch is mainly required for laggy NFS. See #162
@valyala
Copy link
Collaborator

valyala commented Sep 5, 2019

@prezhdarov , fixes for the issue have been added into release v1.27.2.
Closing the issue as fixed.
Feel free re-opening it if similar issues will appear again after v1.27.2.

@valyala valyala closed this as completed Sep 5, 2019
valyala added a commit that referenced this issue Nov 10, 2019
Continue trying to remove NFS directory on temporary errors for up to a minute.

The previous async removal process breaks in the following case during VictoriaMetrics start

- VictoriaMetrics opens index, finds incomplete merge transactions and starts replaying them.
- The transaction instructs removing old directories for parts, which were already merged into bigger part.
- VictoriaMetrics removes these directories, but their removal is delayed due to NFS errors.
- VictoriaMetrics scans partition directory after all the incomplete merge transactions are finished
  and finds directories, which should be removed, but weren't still removed due to NFS errors.
- VictoriaMetrics panics when it finds unexpected empty directory.

Updates #162
valyala added a commit that referenced this issue Nov 10, 2019
Continue trying to remove NFS directory on temporary errors for up to a minute.

The previous async removal process breaks in the following case during VictoriaMetrics start

- VictoriaMetrics opens index, finds incomplete merge transactions and starts replaying them.
- The transaction instructs removing old directories for parts, which were already merged into bigger part.
- VictoriaMetrics removes these directories, but their removal is delayed due to NFS errors.
- VictoriaMetrics scans partition directory after all the incomplete merge transactions are finished
  and finds directories, which should be removed, but weren't still removed due to NFS errors.
- VictoriaMetrics panics when it finds unexpected empty directory.

Updates #162
@valyala
Copy link
Collaborator

valyala commented Nov 12, 2019

FYI, the issue was still present and should be finally fixed in v1.29.0.

valyala added a commit that referenced this issue Dec 2, 2019
…oned dirs are really removed

This should fix the issue on NFS when incompletely removed dirs may be left
after unclean shutdown (OOM, kill -9, hard reset, etc.), while the corresponding transaction
files are already removed.

Updates #162
valyala added a commit that referenced this issue Dec 2, 2019
…oned dirs are really removed

This should fix the issue on NFS when incompletely removed dirs may be left
after unclean shutdown (OOM, kill -9, hard reset, etc.), while the corresponding transaction
files are already removed.

Updates #162
@valyala
Copy link
Collaborator

valyala commented Dec 2, 2019

FYI, the issue should be finally fixed in the following commits:

The fix will be included in the next VictoriaMetrics release. Until then it is possible to build VictoriaMetrics from sources at the commits mentioned above according to the following instructions:

Relate issue: #61 .

@valyala valyala reopened this Dec 2, 2019
@valyala
Copy link
Collaborator

valyala commented Dec 2, 2019

The issue should be fixed in v1.30.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question The question issue
Projects
None yet
Development

No branches or pull requests

2 participants