Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't open state after crash: too few bytes #79

Open
gromakovsky opened this issue Sep 14, 2016 · 30 comments
Open

Can't open state after crash: too few bytes #79

gromakovsky opened this issue Sep 14, 2016 · 30 comments

Comments

@gromakovsky
Copy link
Contributor

If I run a bunch of updates on my state (local, on disk), then press Ctrl-C during some update, then try to open state (openLocalStateFrom in stack ghci), sometimes I get the following error:

too few bytes
From:   demandInput

I tried to debug it by my own and can provide some extra information.
There are two checkpoints files with two consequent numbers. The last one is much small than another one. So probably it's malformed. Also there are events files, but they seem to be irrelevant.
Error happens in newestEntry function. To be more precise, it happens when the last (probably malformed) checkpoint is being read. If I go deeper, I can say that this runGetPartial fails.
My guess is that the last checkpoint wasn't dumped properly because program crashed, but the previous checkpoint also exists and it should be fine. The problem is that newestEntry fails with error if the last checkpoint is malformed, but instead it should try to read another checkpoint (if it exists). It's just my guess, I am be wrong, because I don't know this code.

Also I can provide an example of database which can't be read because of this bug. Unfortunatelly, this example is quite heavy, but maybe someone will look into it.
Here is database (zip archived): wallet-db.zip
Definition of this data type can be found in this repository. Please use f97db74cbf09e7d2aa403d2c47a7fe37f7583e8f revision (just in case).
Just run in stack ghci:

ghci> import qualified Data.Acid as A
ghci> import qualified RSCoin.User as U
ghci> st <- A.openLocalStateFrom "wallet-db" U.emptyWalletStorage

and you will get that error.

@neongreen
Copy link
Contributor

neongreen commented Sep 16, 2016

I think it happens because appending to file isn't atomic: https://github.com/acid-state/acid-state/blob/master/src/Data/Acid/Log.hs#L141. What would be the best way to fix it in acid-state?

  • find some way to append a (possibly long) bytestring to a file atomically, or
  • ignore truncated log entries instead of erroring on them?

Or is it actually handled somewhere in acid-state internals and I haven't noticed it?

@neongreen
Copy link
Contributor

ignore truncated log entries instead of erroring on them?

Does acid-state ever append to checkpoint/event files that already existed before opening the state? If it doesn't, then ignoring truncated log entries (without even bothering to delete them) might be a viable solution.

neongreen pushed a commit to serokell/acid-state that referenced this issue Sep 16, 2016
@jagajaga
Copy link

@stepcut some ideas?

neongreen pushed a commit to serokell/acid-state that referenced this issue Sep 17, 2016
@ddssff
Copy link
Collaborator

ddssff commented Sep 18, 2016

I suppose this is the same issue as not being able to start a server after syncing its database from a running server with open event files?

@stepcut
Copy link
Member

stepcut commented Sep 18, 2016

So, another reason you can get this error is if you change your types but forget to create a SafeCopy migration instance. Currently that causes the app to fail to start -- which seems like a sensible thing to do. With the proposed changes, it seems like acid-state would just silently ignore any entries that it could not read, causing a bunch of valid data to be lost?

I believe the intention is that acid-state is supposed to have a crash recover tool which allows you to examine the data and have a developer make a decisions about the right path forward. Though, such a tool does not yet exist. Does that sound correct @lemmih ?

@neongreen
Copy link
Contributor

neongreen commented Sep 18, 2016

@stepcut To my knowledge, acid-state works in two stages:

  1. Read entries into an internal acid-state “entry” type
  2. Parse entries into user's types

My proposed fix doesn't affect 2), only 1). Incorrect migrations are handled at step 2).

neongreen pushed a commit to serokell/acid-state that referenced this issue Sep 19, 2016
@alexbiehl
Copy link
Contributor

Since appending a bytestring is not atomic we can first write to a temporary file and rename that to the correct log file name. Renaming is atomic. That way a damaged log will never be tried to load when reopening the state after a crash.

@neongreen
Copy link
Contributor

Won't this cause lots of copying if the state is big?

Here another solution is offered but it's a bit more complicated (not much since we only have to log appends):

If we want atomicity (so we either end up with a foo or a bar but nothing in between) one standard technique is to make a copy of the data we’re about to change in an undo log file, modify the “real” file, and then delete the log file. If a crash happens, we can recover from the log.

Read the post for the whole solution (which requires several fsyncs). Also:

Update: many people have read this post and suggested that, in the first file example, you should use the much simpler protocol of copying the file to modified to a temp file, modifying the temp file, and then renaming the temp file to overwrite the original file. In fact, that’s probably the most common comment I’ve gotten on this post. If you think this solves the problem, I’m going to ask you to pause for five seconds and consider the problems this might have. First, you still need to fsync in multiple places. Second, you will get very poor performance with large files. People have also suggested using many small files to work around that problem, but that will also give you very poor performance unless you do something fairly exotic. Third, if there’s a hardlink, you’ve now made the problem of crash consistency much more complicated than in the original example. Fourth, you’ll lose file metadata, sometimes in ways that can’t be fixed up after the fact. That problem can, on some filesystems, be worked around with ioctls, but that only sometimes fixes the issue and now you’ve got fs specific code to preserve correctness even in the non-crash case. And that’s just the beginning. The fact that so many people thought that this was a simple solution to the problem demonstrates that this problem is one that people are prone to underestimating, even they’re explicitly warned that people tend to underestimate this problem!

@neongreen
Copy link
Contributor

In particular, for our case the log will be simply “how many bytes the file had before appending”, and recovering will be a single truncate.

@alexbiehl
Copy link
Contributor

alexbiehl commented Oct 27, 2016

@neongreen no, if the rename/move happens on the same filesystem its simply a matter of changing some inode pointers so its pretty cheap. If you move a file to different file systems it needs to copy the content of course which is probably the most expensive case.

@neongreen
Copy link
Contributor

neongreen commented Oct 27, 2016

I'm not talking about renaming being expensive, I'm talking about “write to a temporary file” being expensive.

  1. If we only use unsafe appends on events* files, then it'll be fast because those are usually small.
  2. If we also use unsafe appends on checkpoints* files, then it'll be slow because in order to append to a temporary file we would first need to copy the checkpoints file to the temporary file, and checkpoints files can be big.

I don't remember already whether 1. or 2. is true.

@neongreen
Copy link
Contributor

(I think I've seen appending to checkpoints files happening when I was using acid-state, but I'm not 100% sure.)

@alexbiehl
Copy link
Contributor

alexbiehl commented Oct 27, 2016

Ah, I see your point. The route via temporary files would need a cutFileLog (localCheckpoints acidState) somewhere in createLocalCheckpoint. I don't think this would be too bad actually. Having one checkpoint == one log file makes sense to me.

@alexbiehl
Copy link
Contributor

@neongreen yes, it uses the same FileLog for checkpoints and for events so one or more checkpoints per log file are possible.

@stepcut
Copy link
Member

stepcut commented Nov 3, 2016

Being able to easily diagnose and recover from an error such as this one is perhaps the number one most important task for acid-state. So I do want to take a serious look into this issue. Unfortunately, if you write too much Haskell code you reach a point where you spend all your time updating to the latest build dependencies and you don't seem to have much time left to write new code or tackle serious bugs.

Regarding this statement in the initial bug report:

There are two checkpoints files with two consequent numbers. The last one is much small than another one. So probably it's malformed.

That is simply not true. A checkpoint file can contain more than one checkpoint. So, by itself, one checkpoint file being much smaller than another is meaningless.

Regarding the error message itself:

too few bytes
From:   demandInput

I think there are (at least) three very different situations in which that error can arise.

One way that can happen is if the application is killed in the middle of appending the event log. When this happens, the update call that triggered the write should never have returned -- and so the transaction was not considered durable yet, and so we can delete it with out violating ACID properties. In this case, ignoring the truncated entry might be acceptable. I think some people have expressed concern about deleting a truncated log entry which could still contain some useful information even if it was not a completed transaction. I do not have a strong use case to demonstrate that concern -- but it is worth considering.

Another situation where this error could occur is when the database file has gotten corrupted after the fact. For example, a hard disk is starting to fail and a sector has gone bad. In this situation, silently deleting what used to be a valid entry sounds like a very wrong answer. I am pretty sure we do not want a database system that will silently delete corrupted data and then start processing new transactions as if nothing bad ever happened.

The most common situation where this occurs (if I am not mistaken) is when you accidentally change a type, but forget to change the SafeCopy version number and create a Migration instance. For example, if you changed an Int32 to and Int64, then when the server starts up again it will attempt to read 4 more bytes than were written. In this case, the data on the disk is intact and uncorrupted. Silently ignoring those entries is definitely not the right solution. The right solution is to fix the broken code in the server and try again.

So, it seems that we have 3 situations in which we might get the original error message, but in only 1 case can we argue that silently ignoring the entry is the correct thing to do.

So, the question is, can we reliably determine which case we are looking at.

@neongreen has suggested that deserializing happens in two steps -- first we deserialize an Entry and then we deserialized the data stored in that Entry. Entry is defined as:

type Entry = Lazy.ByteString

And if we look at putEntry,

putEntry :: Entry -> Builder

We see that it writes the length of of the Lazy.ByteString and its crc16 hash.

So I think it is correct that we can distinguish between a user forgetting to update their 'SafeCopy' instances the case where the Entry is truncated or corrupted.

Additionally, if the Entry is the right length, but has the wrong crc16 then we can also determine that the data has been corrupted.

However -- what about the case where the length says there should 20 bytes, but we only find 10? Can we say with certainty that the Entry was never fully written? Could it not also be the case that it was fully written, but got corrupted after the fact?

Right now we treat both situations the same -- we abort. That is not wrong -- but is certainly annoying when the problem is just a transaction that never completed. We could switch to always deleting the the entry if it appears truncated -- but silently ignoring corruption would seem to be very wrong to me.

One proposed solution was to write log entries to a new file and rename them after the write is complete. This would seem to solve the problem -- however, I think it would also be a major performance hit. Surely creating a new file is more expensive that appending to an already open file?

At a very minimum we can do two things:

  1. write a tool to manually truncate a log file that is too short
  2. update the error message to say something meaningful and link to a webpage describing what they have encounter and how they might proceed.

Perhaps another option would be for acid-state to have some sort of secondary log that indicates which log entries it believes it completed? If the log entry is short -- but the secondary log says it successfully wrote that entry, then it is almost certainly corruption. If the log entry is short and the secondary log does not indicate it was completed, then it is more likely (but not guaranteed) that it was just aborted mid-write?

When HDD ruled the world I would be concerned about the extra disk seeks that might incur. In a SSD world -- I am not sure what to think.

tl;dr - How to we tell the difference between a transaction that never finished writing, and one that did but got corrupted later?

@neongreen
Copy link
Contributor

That is simply not true. A checkpoint file can contain more than one checkpoint. So, by itself, one checkpoint file being much smaller than another is meaningless.

Agreed.

Additionally, if the Entry is the right length, but has the wrong crc16 then we can also determine that the data has been corrupted.

Agreed. I think that if this happens, the right course of action is to throw an error message instead of silently discarding the corrupted entry.

Can we say with certainty that the Entry was never fully written? Could it not also be the case that it was fully written, but got corrupted after the fact?

Good point. Unfortunately, I'm not sure it's possible even in theory to distinguish between these cases. If corruption is described as arbitrary modification of stored data, then for any state A which is detected as “unfinished write” you could argue that if you observe state A, it might mean that corruption took place, and not an unfinished write. I don't see a way out of it other than distinguishing between them on the filesystem level (as we assume that we can trust the filesystem) – e.g. we might create a special zero-length file when we begin writing, and delete the file when we finish writing. Then corruption and truncation could be distinguished by the special file's presence/absence. Alternatively we could just rename the file to checkpoints-N-INCOMPLETE.log while we're appending and rename it back afterwards. All of these things would have to be carefully fsynced (because judging by the article I linked, filesystems actually aren't very trustworthy).

Surely creating a new file is more expensive that appending to an already open file?

I'm not sure – this would have to be benchmarked. Filesystems work in mysterious ways. If you are a filesystem expert or can ask a filesystem expert, this changes things :)

@jberryman
Copy link

I've never written a database before. are there problems here that are unique to acid-state and haven't been solved in e.g. postgres or any number of DB using a WAL for instance? A quick google brings up pg 12 here for instance: https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf with:

Protection against partial writes in disk

WAL Page Partial Write Protection
• Each WAL record contains CRC and checking the validity of the WAL record's CRC will detect partial write.
• Each WAL page contains a magic number, the validity of which is checked after reading each page.

And is anyone working on this ticket actively?

@neongreen
Copy link
Contributor

@jberryman my understanding is that CRC already gets written:

Additionally, if the Entry is the right length, but has the wrong crc16 then we can also determine that the data has been corrupted.

The problem isn't that we aren't protected against partial writes – we are. The problem is that we can't implement the following semantics:

  • When data is corrupted, we don't try to do any recovery and just show an error message to the user
  • When there has been a partial write, we recover by discarding the partially written Entry

The reason we can't implement it is that we can't distinguish between “data didn't get written fully” and “data got written fully but got truncated later”. The latter can happen easily if e.g. you copy the database on a USB drive and eject it mid-write.

@neongreen
Copy link
Contributor

Actually, now that I think of it, the easiest solution seems to be something like this:

  • write '\0' <> crc <> data
  • fsync
  • rewrite the '\0' with '\1'
  • fsync

Now incomplete writes can be distinguished from truncation by looking at the first byte.

For extra safety we can write 4-byte magic constants instead of '\0' and '\1' – if the INCOMPLETE constant would still start with 0 and COMPLETE constant would start with 1, we would still be able to distinguish between incomplete writes and truncation even if all bytes but the first one get truncated, but the benefit of having 4-byte constants is that in a case of corruption, four bytes would have to be deliberately changed to a very specific value in order to trick us into thinking that it was just an incomplete write, while with one-byte constants only one byte would have to be changed (which can happen accidentally).

@stepcut thoughts?

@stepcut
Copy link
Member

stepcut commented Mar 30, 2017

@jberryman acid-state is essentially a write ahead log system, and does use CRCs to check the validity. The issue is what happens when a CRC check fails? How do you distinguish between an incomplete write, and a completed write that was later corrupted.

In theory, you can safely delete an incomplete write, but if the data has become corrupted after the fact, then you probably want to abort.

I think you are correct that acid-state is not doing anything fundamentally different than any other database. So, if we can learn what they are doing, we can probably implement the same solution.

@neongreen It sounds like that option involves seeks and overwriting data? It could work, but sounds terrible for performance. It that it is basically the same as the idea I suggested about having a secondary log file which just logs completed writes. The difference is whether we are logging that in the same file or a different one.

I also have at the back of my mind a concern about how these solutions might play out when I finally have time to implement an S3 backend. I do not believe S3 supports appending or modifying existing objects. Perhaps the File backend and S3 backend do not use the same solution -- who knows.

@jberryman
Copy link

jberryman commented Mar 30, 2017

So did this fix the original issue?: serokell@26c9339 . It sounds like it would be fairly easy (and important) to write an automated test for this. EDIT oh sorry, didn't realize that was a fork.

@neongreen
Copy link
Contributor

@stepcut

@neongreen It sounds like that option involves seeks and overwriting data? It could work, but sounds terrible for performance.

Huh, I didn't know that overwriting data was slow. Can you recommend any sources I could read? (I'd google but since I'm a noob when it comes to low-level things, I don't know how to distinguish trustworthy sources from non-truthworthy ones.)

It that it is basically the same as the idea I suggested about having a secondary log file which just logs completed writes. The difference is whether we are logging that in the same file or a different one.

Okay, then I don't have an opinion on which one should actually be implemented :)

I also have at the back of my mind a concern about how these solutions might play out when I finally have time to implement an S3 backend.

Cool, I didn't know an S3 backend was planned. (Does S3 even have a notion of incomplete writes?)

@neongreen
Copy link
Contributor

@jberryman

So did this fix the original issue?: serokell/acid-state@26c9339

Well, yes (as in “incomplete writes don't mess things up for us anymore”), but a) we don't have any way to detect corruption now so we wouldn't have noticed if things went wrong, and b) we've mostly moved to RocksDB by now anyway.

@jberryman
Copy link

@neongreen oh I misunderstood, I assumed you were an acid-state maintainer and meant that acid-state's backend would moving to rocksdb or something. Have you written about your experience with rocksdb anywhere? Are you using it in a way that's similar to acid-state?

@neongreen
Copy link
Contributor

@jberryman

Have you written about your experience with rocksdb anywhere?

Nope.

Are you using it in a way that's similar to acid-state?

Not really – for instance, we don't use versioning or migrations, and all of our queries are either “get a primitive value by key” or “get a complex value by key and then parse it with binary”. We switched to RocksDB because a) we don't want to store data in-memory, b) acid-state doesn't seem to be actively developed, and c) acid-state's model is not a particularly good fit for us (e.g. we want Updates that can fail, we want batched writes, etc).

@gbaz
Copy link

gbaz commented Dec 19, 2018

since the issue is fixed by serokell@26c9339 shouldn't this be closed?

@parsonsmatt
Copy link
Contributor

@gbaz that's a fork of acid-state. I am curious why this PR says that the fix won't be merged in -- it links back to this thread, but after a few skims I can't find any rejected PRs or similar.

@gbaz
Copy link

gbaz commented Dec 20, 2018

It looks like this was the concern.

So I think it is correct that we can distinguish between a user forgetting to update their 'SafeCopy' instances the case where the Entry is truncated or corrupted.

Additionally, if the Entry is the right length, but has the wrong crc16 then we can also determine that the data has been corrupted.

However -- what about the case where the length says there should 20 bytes, but we only find 10? Can we say with certainty that the Entry was never fully written? Could it not also be the case that it was fully written, but got corrupted after the fact?

Right now we treat both situations the same -- we abort. That is not wrong -- but is certainly annoying when the problem is just a transaction that never completed. We could switch to always deleting the the entry if it appears truncated -- but silently ignoring corruption would seem to be very wrong to me.

My suggestion would be -- only rollback (i.e. drop the partial entry) at the very end of a file. Hence we don't have to worry about other forms of corruption -- this just handles the partial write in the event of a kill, power failure, etc.

@adamgundry
Copy link
Contributor

Right, I think the simplest way forward here is to detect that a file is truncated and rollback the last entry, but throw an error for other kinds of corruption. This should probably be an option chosen by the user when opening the state, because some users might not want it to happen automatically, or might have application-level mechanisms to distinguish clean from unclean shutdown. After all, we have no guarantee that a truncation is the result of a partial write as opposed to subsequent corruption, it's just the most likely cause.

I suppose we could also have the option to make a best effort at restoring the log, ignoring any corrupted entries, which might be helpful for debugging. Side thought: would it make sense to offer a read-only mode that opens a state for queries only, again to help recovery?

Related to this is the need for better tools to analyse a (possibly corrupted) log file. That will hopefully become easier if/when my work on alternative serialisation layers (#88) gets merged.

@I3ck
Copy link

I3ck commented Mar 28, 2019

Actually, now that I think of it, the easiest solution seems to be something like this:

* write `'\0' <> crc <> data`

* `fsync`

* rewrite the `'\0'` with `'\1'`

* `fsync`

Shouldn't this resolve this rather serious issue? Or what can a library user do in case of the too few bytes problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests