-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption when using NBD mode #184
Comments
How data is written back is determined by the way it's configured, e.g., To verify it's not an eventual consistency problem, you could turn on/off the ec_protect layer (via You can also try to run We implement NBD TRIM synchronously (see There are ways we can gather more info. For example, create the filesystem with NBD, verify the problem, then unmount and remount with FUSE. If the corruption goes away, then it's a read problem, not a write problem. Make sure you're not connecting multiple clients to the same server. Try commenting out |
Here' what happens when trying to directly import after export (without restarting s3backer):
|
I think the first nbdkit instance must have crashed, but it didn't leave behind a core file anywhere... |
So it looks like you have a reliable recipe for reproducing this - which is good. Now let's try changing variables until the problem goes away. See earlier post for some things to try. Here's another one to try: set the s3backer block size to 4k in all cases. This should eliminate partial writes because the current NBD code also uses a block size of 4k. |
For those following along, I have now added a very simple Python & boto based S3 plugin to nbdkit itself (https://gitlab.com/nbdkit/nbdkit/-/tree/master/plugins/S3). This should rule out any issues with memory management, compression, encryption, etc. I then created a debug version of the plugin (https://gitlab.com/Nikolaus2/nbdkit/-/tree/s3_debug) which maintains a local copy of the bucket and compares contents after each write(), and on each read. Interestingly enough, data errors continue to happen:
However, these errors are only happening in Work continues... |
I love a good bug hunt. Keep us posted. |
I am now 99% sure that the problem is with Amazon S3, and it looks exactly like the behavior that I would expect from eventual consistency. Data is sometimes old and sometimes fresh. I am surprised that s3backer's MD5 cache isn't detecting this. But I guess within the same process the request is silently retried until it succeeds, and if s3backer restarts then the old cache is gone so it can't detect that it delivers outdated data.. I guess I should ask Amazon support for help, but I'm not looking forward to the experience.... |
Hmmm. I'm skeptical that S3 would be so glaringly in violation of their documented behavior.
Just to clarify terminology, the MD5 cache doesn't actually detect anything. All it does is guarantee consistency if the actual convergence time is less than For what you're suggesting is happening to happen, the convergence time would have to be longer than the time it takes you to stop and restart s3backer. So you're asserting that not only is S3 violating their stated consistency guarantees, they are doing so with a fairly large convergence time. If you really believe this the next step would be to write a stripped-down test case that proves it. |
Dumb question.. do you see the same problem when running with |
It just occurred to me that the problem might be due to how s3backer handles writes of partial blocks. Partial writes are handled by reading, patching, and writing back. If there are two simultaneous partial writes that don't overlap but happen to fall with the same s3backer block then one of them could "lose". Can you test whether the problem occurs when you set the s3backer block size to 4K (same as NBD), so that there are never any partial writes? |
Hmm, I think you are right. This is definitely something that could happen, and may well be what's happening here. You wouldn't see it in FUSE mode because this forces serialized writes. And it explains why I haven't seen this happen when forcing write serialization in NBD (I gave up on that because it was so slow, resulting in timeouts in the kernel block layer). |
There's no way to set an NBD blocksize, just a *maximum" blocksize. And even that doesn't ensure that blocks are aligned, so I can't test it that way. I'll just fix the issue and see if the problem still occurs. |
We pass the |
…#184). When these block sizes are different, it's possible for the kernel to be reading and/or writing (different parts of) the same block simultaneously. Because partial block writes are performed by reading, patching, then writing back the entire block, this means data writes can get lost unless we impose read/write locking for each block. As part of this change, simplify the code by requiring all partial read/write operations to be done at the top layer of the stack (i.e., FUSE or NBD layer). This eliminates the "read_block_part" and "write_block_part" layer functions.
Should hopefully be fixed by d1bce95. |
As far as I can tell, there is something that corrupts data when running s3backer in NBD mode.
I am running ZFS on top of the NBD devices, and I'm getting frequent errors that I think can only be explained by data corruption. For example, when running
examples/create_zpool.py
and then trying to import the fresh zpool, I have now several times gotten errors that vdev's had the wrong guid, or just general "I/O errors".When replacing NBD's s3backer backend with the file backend, the problems all seem to go away.
The errors are exactly of the sort that I would expect my eventual consistency (I think s3backer's protection mechanism do not help when s3backer is restarted), but Amazon is pretty clear about offering strong consistency for everything: https://aws.amazon.com/s3/consistency/
Therefore, the only explanation I have is that something is not working right in the NBD-s3backer read or write path.
I was thinking about creating a simple unit test that writes data through the NBD interface (but not using the NBD server itself), reads it back, and confirms that the contents are correct. However, I strongly suspect that the problem is not that straightforward and it is something about the sequence of operations that ZFS performs.
Is s3backer executing NBD requests synchronously, or are they deferred to background threads? Might it be possible if there is an ordering issue with TRIM and WRITE requests or something like that?
The text was updated successfully, but these errors were encountered: