Data corruption when using NBD mode #184

Nikratio · 2022-05-08T18:26:30Z

As far as I can tell, there is something that corrupts data when running s3backer in NBD mode.

I am running ZFS on top of the NBD devices, and I'm getting frequent errors that I think can only be explained by data corruption. For example, when running examples/create_zpool.py and then trying to import the fresh zpool, I have now several times gotten errors that vdev's had the wrong guid, or just general "I/O errors".

When replacing NBD's s3backer backend with the file backend, the problems all seem to go away.

The errors are exactly of the sort that I would expect my eventual consistency (I think s3backer's protection mechanism do not help when s3backer is restarted), but Amazon is pretty clear about offering strong consistency for everything: https://aws.amazon.com/s3/consistency/

Therefore, the only explanation I have is that something is not working right in the NBD-s3backer read or write path.

I was thinking about creating a simple unit test that writes data through the NBD interface (but not using the NBD server itself), reads it back, and confirms that the contents are correct. However, I strongly suspect that the problem is not that straightforward and it is something about the sequence of operations that ZFS performs.

Is s3backer executing NBD requests synchronously, or are they deferred to background threads? Might it be possible if there is an ordering issue with TRIM and WRITE requests or something like that?

The text was updated successfully, but these errors were encountered:

archiecobbs · 2022-05-08T22:52:19Z

How data is written back is determined by the way it's configured, e.g., --blockCacheWriteDelay. If you eliminate the block cache altogether then reads and writes should be straightforward.

To verify it's not an eventual consistency problem, you could turn on/off the ec_protect layer (via --md5CacheSize).

You can also try to run nbdkit with --threads 1 to see if that changes anything.

We implement NBD TRIM synchronously (see s3b_nbd_plugin_trim()), so assuming S3 is truly consistent then that should be OK. But if you suspect TRIM is involved, just comment out that function and see if that fixes it.

There are ways we can gather more info. For example, create the filesystem with NBD, verify the problem, then unmount and remount with FUSE. If the corruption goes away, then it's a read problem, not a write problem.

Make sure you're not connecting multiple clients to the same server.

Try commenting out s3b_nbd_plugin_can_multi_conn()...

Nikratio · 2022-05-09T04:59:00Z

Here' what happens when trying to directly import after export (without restarting s3backer):

# nbdkit --unix /tmp/tmpr9hete6o/nbd_socket_sb --foreground --filter=exitlast --threads 16 s3backer s3b_region=eu-west-2 s3b_size=50G s3b_force=true bucket=nikratio-backup/sb s3b_blockSize=32K &
nbdkit: auto-detecting block size and total file size...
nbdkit: auto-detected block size=32k and total size=50g
2022-05-09 05:52:18 INFO: established new mount token 0x5872d5fa
2022-05-09 05:52:18 INFO: mounting nikratio-backup

# nbdkit --unix /tmp/tmpr9hete6o/nbd_socket_lb --foreground --filter=exitlast --threads 16 s3backer s3b_region=eu-west-2 s3b_size=50G s3b_force=true bucket=nikratio-backup/lb s3b_blockSize=512K &
nbdkit: auto-detecting block size and total file size...
nbdkit: auto-detected block size=512k and total size=50g
2022-05-09 05:52:18 INFO: established new mount token 0x19b9d2dc
2022-05-09 05:52:18 INFO: mounting nikratio-backup

# nbd-client -unix /tmp/tmpr9hete6o/nbd_socket_sb /dev/nbd0
Warning: the oldstyle protocol is no longer supported.
This method now uses the newstyle protocol with a default export
Negotiation: ..size = 51200MB
Connected /dev/nbd0

# nbd-client -unix /tmp/tmpr9hete6o/nbd_socket_lb /dev/nbd1
Warning: the oldstyle protocol is no longer supported.
This method now uses the newstyle protocol with a default export
Negotiation: ..size = 51200MB
Connected /dev/nbd1

# zpool create -f -R /zpools -o ashift=9 -o autotrim=on -o failmode=continue -O acltype=posixacl -O relatime=on -O xattr=sa -O compression=zstd-19 -O checksum=sha256 -O sync=disabled -O special_small_blocks=131072 -O redundant_metadata=most -O recordsize=131072 -O encryption=on -O keyformat=passphrase -O keylocation=file:///home/nikratio/lib/s3_zfs.key s3backup /dev/nbd1 special /dev/nbd0

# zfs create s3backup/vostro
# zpool export s3backup
# sync

# zpool import -R /zpools -d /dev/nbd1 -d /dev/nbd0 s3backup
cannot import 's3backup': no such pool available

2022-05-09 05:53:03 INFO: unmount nikratio-backup: initiated
2022-05-09 05:53:03 INFO: unmount nikratio-backup: shutting down filesystem
2022-05-09 05:53:03 INFO: unmount nikratio-backup: clearing mount token
2022-05-09 05:53:04 INFO: unmount nikratio-backup: completed

# dmesg | tail
May 09 05:52:18 vostro.rath.org kernel: nbd0: detected capacity change from 0 to 104857600
May 09 05:52:19 vostro.rath.org kernel: nbd1: detected capacity change from 0 to 104857600
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 32 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 2
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 96 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 2
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 160 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 2
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 224 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 2
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd1: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 544 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 0
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd1, sector 32 op 0x0:(READ) flags 0x0 phys_seg 28 prio class 2
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 608 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 0
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 672 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 0
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 736 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
May 09 05:53:03 vostro.rath.org kernel: block nbd1: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd1, sector 544 op 0x0:(READ) flags 0x0 phys_seg 28 prio class 0
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd1: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd1: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Receive control failed (result -32)

Nikratio · 2022-05-09T05:01:09Z

I think the first nbdkit instance must have crashed, but it didn't leave behind a core file anywhere...

archiecobbs · 2022-05-09T14:02:44Z

So it looks like you have a reliable recipe for reproducing this - which is good. Now let's try changing variables until the problem goes away.

See earlier post for some things to try.

Here's another one to try: set the s3backer block size to 4k in all cases. This should eliminate partial writes because the current NBD code also uses a block size of 4k.

Nikratio · 2022-05-15T14:57:30Z

For those following along, I have now added a very simple Python & boto based S3 plugin to nbdkit itself (https://gitlab.com/nbdkit/nbdkit/-/tree/master/plugins/S3). This should rule out any issues with memory management, compression, encryption, etc.

I then created a debug version of the plugin (https://gitlab.com/Nikolaus2/nbdkit/-/tree/s3_debug) which maintains a local copy of the bucket and compares contents after each write(), and on each read.

Interestingly enough, data errors continue to happen:

Running zpool create -f -R /zpools -o ashift=9 -o autotrim=on -o failmode=continue -O acltype=posixacl -O relatime=on -O xattr=sa -O compression=zstd-19 -O checksum=sha256 -O sync=disabled -O special_small_blocks=131072 -O redundant_metadata=most -O recordsize=131072 -O encryption=on -O keyformat=passphrase -O keylocation=file:///home/nikratio/lib/s3_zfs.key s3backup /dev/nbd1 special /dev/nbd0
nbdkit: python.4: error: /usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin: pread: error: Traceback (most recent call last):
   File "/usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin", line 203, in pread
    raise RuntimeError(
 RuntimeError: INTERNAL ERROR[1]: read(off=16384, len=32768) differs between local and remote

nbdkit: python.11: error: /usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin: pread: error: Traceback (most recent call last):
   File "/usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin", line 203, in pread
    raise RuntimeError(
 RuntimeError: INTERNAL ERROR[2]: read(off=114688, len=16384) differs between local and remote

nbdkit: python.7: error: /usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin: pread: error: Traceback (most recent call last):
   File "/usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin", line 203, in pread
    raise RuntimeError(
 RuntimeError: INTERNAL ERROR[3]: read(off=376832, len=16384) differs between local and remote

However, these errors are only happening in pread. In other words, when I am reading the data back from S3 directly after pwrite, the content is as expected, but when it is retrieved again by pread() later on, the contents have changed.

Work continues...

archiecobbs · 2022-05-17T14:30:33Z

I love a good bug hunt. Keep us posted.

Nikratio · 2022-05-20T15:44:19Z

I am now 99% sure that the problem is with Amazon S3, and it looks exactly like the behavior that I would expect from eventual consistency. Data is sometimes old and sometimes fresh.

I am surprised that s3backer's MD5 cache isn't detecting this. But I guess within the same process the request is silently retried until it succeeds, and if s3backer restarts then the old cache is gone so it can't detect that it delivers outdated data..

I guess I should ask Amazon support for help, but I'm not looking forward to the experience....

archiecobbs · 2022-05-20T16:07:12Z

I am now 99% sure that the problem is with Amazon S3, and it looks exactly like the behavior that I would expect from eventual consistency. Data is sometimes old and sometimes fresh.

Hmmm. I'm skeptical that S3 would be so glaringly in violation of their documented behavior.

I am surprised that s3backer's MD5 cache isn't detecting this. But I guess within the same process the request is silently retried until it succeeds, and if s3backer restarts then the old cache is gone so it can't detect that it delivers outdated data..

Just to clarify terminology, the MD5 cache doesn't actually detect anything. All it does is guarantee consistency if the actual convergence time is less than --md5CacheTime and --minWriteDelay.

For what you're suggesting is happening to happen, the convergence time would have to be longer than the time it takes you to stop and restart s3backer.

So you're asserting that not only is S3 violating their stated consistency guarantees, they are doing so with a fairly large convergence time.

If you really believe this the next step would be to write a stripped-down test case that proves it.

archiecobbs · 2022-05-20T17:27:04Z

Dumb question.. do you see the same problem when running with --test instead of over S3?

archiecobbs · 2022-05-21T03:17:20Z

It just occurred to me that the problem might be due to how s3backer handles writes of partial blocks. Partial writes are handled by reading, patching, and writing back. If there are two simultaneous partial writes that don't overlap but happen to fall with the same s3backer block then one of them could "lose".

Can you test whether the problem occurs when you set the s3backer block size to 4K (same as NBD), so that there are never any partial writes?

Nikratio · 2022-05-21T10:36:43Z

Hmm, I think you are right. This is definitely something that could happen, and may well be what's happening here. You wouldn't see it in FUSE mode because this forces serialized writes. And it explains why I haven't seen this happen when forcing write serialization in NBD (I gave up on that because it was so slow, resulting in timeouts in the kernel block layer).

Nikratio · 2022-05-21T12:06:50Z

There's no way to set an NBD blocksize, just a *maximum" blocksize. And even that doesn't ensure that blocks are aligned, so I can't test it that way. I'll just fix the issue and see if the problem still occurs.

archiecobbs · 2022-05-21T14:54:48Z

There's no way to set an NBD blocksize, just a maximum blocksize

We pass the -block-size 4096 flag to nbd-client (here). This is what I was referring to, though I admit I'm not clear exactly on what it actually does.

…#184). When these block sizes are different, it's possible for the kernel to be reading and/or writing (different parts of) the same block simultaneously. Because partial block writes are performed by reading, patching, then writing back the entire block, this means data writes can get lost unless we impose read/write locking for each block. As part of this change, simplify the code by requiring all partial read/write operations to be done at the top layer of the stack (i.e., FUSE or NBD layer). This eliminates the "read_block_part" and "write_block_part" layer functions.

archiecobbs · 2022-05-21T22:25:04Z

Should hopefully be fixed by d1bce95.

Nikratio mentioned this issue May 9, 2022

block_cache assertion failures when running in NBD mode #180

Closed

archiecobbs closed this as completed May 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption when using NBD mode #184

Data corruption when using NBD mode #184

Nikratio commented May 8, 2022

archiecobbs commented May 8, 2022

Nikratio commented May 9, 2022

Nikratio commented May 9, 2022

archiecobbs commented May 9, 2022

Nikratio commented May 15, 2022

archiecobbs commented May 17, 2022

Nikratio commented May 20, 2022 •

edited

archiecobbs commented May 20, 2022

archiecobbs commented May 20, 2022

archiecobbs commented May 21, 2022

Nikratio commented May 21, 2022

Nikratio commented May 21, 2022

archiecobbs commented May 21, 2022

archiecobbs commented May 21, 2022

Data corruption when using NBD mode #184

Data corruption when using NBD mode #184

Comments

Nikratio commented May 8, 2022

archiecobbs commented May 8, 2022

Nikratio commented May 9, 2022

Nikratio commented May 9, 2022

archiecobbs commented May 9, 2022

Nikratio commented May 15, 2022

archiecobbs commented May 17, 2022

Nikratio commented May 20, 2022 • edited

archiecobbs commented May 20, 2022

archiecobbs commented May 20, 2022

archiecobbs commented May 21, 2022

Nikratio commented May 21, 2022

Nikratio commented May 21, 2022

archiecobbs commented May 21, 2022

archiecobbs commented May 21, 2022

Nikratio commented May 20, 2022 •

edited