Can't open state after crash: too few bytes #79

gromakovsky · 2016-09-14T11:02:41Z

If I run a bunch of updates on my state (local, on disk), then press Ctrl-C during some update, then try to open state (openLocalStateFrom in stack ghci), sometimes I get the following error:

too few bytes
From:   demandInput

I tried to debug it by my own and can provide some extra information.
There are two checkpoints files with two consequent numbers. The last one is much small than another one. So probably it's malformed. Also there are events files, but they seem to be irrelevant.
Error happens in newestEntry function. To be more precise, it happens when the last (probably malformed) checkpoint is being read. If I go deeper, I can say that this runGetPartial fails.
My guess is that the last checkpoint wasn't dumped properly because program crashed, but the previous checkpoint also exists and it should be fine. The problem is that newestEntry fails with error if the last checkpoint is malformed, but instead it should try to read another checkpoint (if it exists). It's just my guess, I am be wrong, because I don't know this code.

Also I can provide an example of database which can't be read because of this bug. Unfortunatelly, this example is quite heavy, but maybe someone will look into it.
Here is database (zip archived): wallet-db.zip
Definition of this data type can be found in this repository. Please use f97db74cbf09e7d2aa403d2c47a7fe37f7583e8f revision (just in case).
Just run in stack ghci:

ghci> import qualified Data.Acid as A
ghci> import qualified RSCoin.User as U
ghci> st <- A.openLocalStateFrom "wallet-db" U.emptyWalletStorage

and you will get that error.

The text was updated successfully, but these errors were encountered:

neongreen · 2016-09-16T13:14:52Z

I think it happens because appending to file isn't atomic: https://github.com/acid-state/acid-state/blob/master/src/Data/Acid/Log.hs#L141. What would be the best way to fix it in acid-state?

find some way to append a (possibly long) bytestring to a file atomically, or
ignore truncated log entries instead of erroring on them?

Or is it actually handled somewhere in acid-state internals and I haven't noticed it?

neongreen · 2016-09-16T13:50:17Z

ignore truncated log entries instead of erroring on them?

Does acid-state ever append to checkpoint/event files that already existed before opening the state? If it doesn't, then ignoring truncated log entries (without even bothering to delete them) might be a viable solution.

This is an attempt to fix acid-state#79

jagajaga · 2016-09-17T10:28:18Z

@stepcut some ideas?

This is an attempt to fix acid-state#79

ddssff · 2016-09-18T13:23:25Z

I suppose this is the same issue as not being able to start a server after syncing its database from a running server with open event files?

stepcut · 2016-09-18T21:29:18Z

So, another reason you can get this error is if you change your types but forget to create a SafeCopy migration instance. Currently that causes the app to fail to start -- which seems like a sensible thing to do. With the proposed changes, it seems like acid-state would just silently ignore any entries that it could not read, causing a bunch of valid data to be lost?

I believe the intention is that acid-state is supposed to have a crash recover tool which allows you to examine the data and have a developer make a decisions about the right path forward. Though, such a tool does not yet exist. Does that sound correct @lemmih ?

neongreen · 2016-09-18T21:51:10Z

@stepcut To my knowledge, acid-state works in two stages:

Read entries into an internal acid-state “entry” type
Parse entries into user's types

My proposed fix doesn't affect 2), only 1). Incorrect migrations are handled at step 2).

This is an attempt to fix acid-state#79

alexbiehl · 2016-10-27T11:44:09Z

Since appending a bytestring is not atomic we can first write to a temporary file and rename that to the correct log file name. Renaming is atomic. That way a damaged log will never be tried to load when reopening the state after a crash.

neongreen · 2016-10-27T11:51:24Z

Won't this cause lots of copying if the state is big?

Here another solution is offered but it's a bit more complicated (not much since we only have to log appends):

If we want atomicity (so we either end up with a foo or a bar but nothing in between) one standard technique is to make a copy of the data we’re about to change in an undo log file, modify the “real” file, and then delete the log file. If a crash happens, we can recover from the log.

Read the post for the whole solution (which requires several fsyncs). Also:

Update: many people have read this post and suggested that, in the first file example, you should use the much simpler protocol of copying the file to modified to a temp file, modifying the temp file, and then renaming the temp file to overwrite the original file. In fact, that’s probably the most common comment I’ve gotten on this post. If you think this solves the problem, I’m going to ask you to pause for five seconds and consider the problems this might have. First, you still need to fsync in multiple places. Second, you will get very poor performance with large files. People have also suggested using many small files to work around that problem, but that will also give you very poor performance unless you do something fairly exotic. Third, if there’s a hardlink, you’ve now made the problem of crash consistency much more complicated than in the original example. Fourth, you’ll lose file metadata, sometimes in ways that can’t be fixed up after the fact. That problem can, on some filesystems, be worked around with ioctls, but that only sometimes fixes the issue and now you’ve got fs specific code to preserve correctness even in the non-crash case. And that’s just the beginning. The fact that so many people thought that this was a simple solution to the problem demonstrates that this problem is one that people are prone to underestimating, even they’re explicitly warned that people tend to underestimate this problem!

neongreen · 2016-10-27T11:52:22Z

In particular, for our case the log will be simply “how many bytes the file had before appending”, and recovering will be a single truncate.

alexbiehl · 2016-10-27T11:59:41Z

@neongreen no, if the rename/move happens on the same filesystem its simply a matter of changing some inode pointers so its pretty cheap. If you move a file to different file systems it needs to copy the content of course which is probably the most expensive case.

neongreen · 2016-10-27T12:05:50Z

I'm not talking about renaming being expensive, I'm talking about “write to a temporary file” being expensive.

If we only use unsafe appends on events* files, then it'll be fast because those are usually small.
If we also use unsafe appends on checkpoints* files, then it'll be slow because in order to append to a temporary file we would first need to copy the checkpoints file to the temporary file, and checkpoints files can be big.

I don't remember already whether 1. or 2. is true.

neongreen · 2016-10-27T12:07:41Z

(I think I've seen appending to checkpoints files happening when I was using acid-state, but I'm not 100% sure.)

alexbiehl · 2016-10-27T12:10:22Z

Ah, I see your point. The route via temporary files would need a cutFileLog (localCheckpoints acidState) somewhere in createLocalCheckpoint. I don't think this would be too bad actually. Having one checkpoint == one log file makes sense to me.

alexbiehl · 2016-10-27T12:12:34Z

@neongreen yes, it uses the same FileLog for checkpoints and for events so one or more checkpoints per log file are possible.

stepcut · 2016-11-03T00:25:12Z

Being able to easily diagnose and recover from an error such as this one is perhaps the number one most important task for acid-state. So I do want to take a serious look into this issue. Unfortunately, if you write too much Haskell code you reach a point where you spend all your time updating to the latest build dependencies and you don't seem to have much time left to write new code or tackle serious bugs.

Regarding this statement in the initial bug report:

There are two checkpoints files with two consequent numbers. The last one is much small than another one. So probably it's malformed.

That is simply not true. A checkpoint file can contain more than one checkpoint. So, by itself, one checkpoint file being much smaller than another is meaningless.

Regarding the error message itself:

too few bytes
From:   demandInput

I think there are (at least) three very different situations in which that error can arise.

One way that can happen is if the application is killed in the middle of appending the event log. When this happens, the update call that triggered the write should never have returned -- and so the transaction was not considered durable yet, and so we can delete it with out violating ACID properties. In this case, ignoring the truncated entry might be acceptable. I think some people have expressed concern about deleting a truncated log entry which could still contain some useful information even if it was not a completed transaction. I do not have a strong use case to demonstrate that concern -- but it is worth considering.

Another situation where this error could occur is when the database file has gotten corrupted after the fact. For example, a hard disk is starting to fail and a sector has gone bad. In this situation, silently deleting what used to be a valid entry sounds like a very wrong answer. I am pretty sure we do not want a database system that will silently delete corrupted data and then start processing new transactions as if nothing bad ever happened.

The most common situation where this occurs (if I am not mistaken) is when you accidentally change a type, but forget to change the SafeCopy version number and create a Migration instance. For example, if you changed an Int32 to and Int64, then when the server starts up again it will attempt to read 4 more bytes than were written. In this case, the data on the disk is intact and uncorrupted. Silently ignoring those entries is definitely not the right solution. The right solution is to fix the broken code in the server and try again.

So, it seems that we have 3 situations in which we might get the original error message, but in only 1 case can we argue that silently ignoring the entry is the correct thing to do.

So, the question is, can we reliably determine which case we are looking at.

@neongreen has suggested that deserializing happens in two steps -- first we deserialize an Entry and then we deserialized the data stored in that Entry. Entry is defined as:

type Entry = Lazy.ByteString

And if we look at putEntry,

acid-state/src/Data/Acid/Archive.hs

Line 40 in c684fc5

putEntry :: Entry -> Builder

We see that it writes the length of of the Lazy.ByteString and its crc16 hash.

So I think it is correct that we can distinguish between a user forgetting to update their 'SafeCopy' instances the case where the Entry is truncated or corrupted.

Additionally, if the Entry is the right length, but has the wrong crc16 then we can also determine that the data has been corrupted.

However -- what about the case where the length says there should 20 bytes, but we only find 10? Can we say with certainty that the Entry was never fully written? Could it not also be the case that it was fully written, but got corrupted after the fact?

Right now we treat both situations the same -- we abort. That is not wrong -- but is certainly annoying when the problem is just a transaction that never completed. We could switch to always deleting the the entry if it appears truncated -- but silently ignoring corruption would seem to be very wrong to me.

One proposed solution was to write log entries to a new file and rename them after the write is complete. This would seem to solve the problem -- however, I think it would also be a major performance hit. Surely creating a new file is more expensive that appending to an already open file?

At a very minimum we can do two things:

write a tool to manually truncate a log file that is too short
update the error message to say something meaningful and link to a webpage describing what they have encounter and how they might proceed.

Perhaps another option would be for acid-state to have some sort of secondary log that indicates which log entries it believes it completed? If the log entry is short -- but the secondary log says it successfully wrote that entry, then it is almost certainly corruption. If the log entry is short and the secondary log does not indicate it was completed, then it is more likely (but not guaranteed) that it was just aborted mid-write?

When HDD ruled the world I would be concerned about the extra disk seeks that might incur. In a SSD world -- I am not sure what to think.

tl;dr - How to we tell the difference between a transaction that never finished writing, and one that did but got corrupted later?

neongreen · 2016-11-03T21:53:54Z

That is simply not true. A checkpoint file can contain more than one checkpoint. So, by itself, one checkpoint file being much smaller than another is meaningless.

Agreed.

Additionally, if the Entry is the right length, but has the wrong crc16 then we can also determine that the data has been corrupted.

Agreed. I think that if this happens, the right course of action is to throw an error message instead of silently discarding the corrupted entry.

Can we say with certainty that the Entry was never fully written? Could it not also be the case that it was fully written, but got corrupted after the fact?

Good point. Unfortunately, I'm not sure it's possible even in theory to distinguish between these cases. If corruption is described as arbitrary modification of stored data, then for any state A which is detected as “unfinished write” you could argue that if you observe state A, it might mean that corruption took place, and not an unfinished write. I don't see a way out of it other than distinguishing between them on the filesystem level (as we assume that we can trust the filesystem) – e.g. we might create a special zero-length file when we begin writing, and delete the file when we finish writing. Then corruption and truncation could be distinguished by the special file's presence/absence. Alternatively we could just rename the file to checkpoints-N-INCOMPLETE.log while we're appending and rename it back afterwards. All of these things would have to be carefully fsynced (because judging by the article I linked, filesystems actually aren't very trustworthy).

Surely creating a new file is more expensive that appending to an already open file?

I'm not sure – this would have to be benchmarked. Filesystems work in mysterious ways. If you are a filesystem expert or can ask a filesystem expert, this changes things :)

jberryman · 2017-03-22T20:55:24Z

I've never written a database before. are there problems here that are unique to acid-state and haven't been solved in e.g. postgres or any number of DB using a WAL for instance? A quick google brings up pg 12 here for instance: https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf with:

Protection against partial writes in disk

WAL Page Partial Write Protection
• Each WAL record contains CRC and checking the validity of the WAL record's CRC will detect partial write.
• Each WAL page contains a magic number, the validity of which is checked after reading each page.

And is anyone working on this ticket actively?

neongreen · 2017-03-25T09:16:06Z

@jberryman my understanding is that CRC already gets written:

Additionally, if the Entry is the right length, but has the wrong crc16 then we can also determine that the data has been corrupted.

The problem isn't that we aren't protected against partial writes – we are. The problem is that we can't implement the following semantics:

When data is corrupted, we don't try to do any recovery and just show an error message to the user
When there has been a partial write, we recover by discarding the partially written Entry

The reason we can't implement it is that we can't distinguish between “data didn't get written fully” and “data got written fully but got truncated later”. The latter can happen easily if e.g. you copy the database on a USB drive and eject it mid-write.

neongreen · 2017-03-25T09:27:23Z

Actually, now that I think of it, the easiest solution seems to be something like this:

write '\0' <> crc <> data
fsync
rewrite the '\0' with '\1'
fsync

Now incomplete writes can be distinguished from truncation by looking at the first byte.

For extra safety we can write 4-byte magic constants instead of '\0' and '\1' – if the INCOMPLETE constant would still start with 0 and COMPLETE constant would start with 1, we would still be able to distinguish between incomplete writes and truncation even if all bytes but the first one get truncated, but the benefit of having 4-byte constants is that in a case of corruption, four bytes would have to be deliberately changed to a very specific value in order to trick us into thinking that it was just an incomplete write, while with one-byte constants only one byte would have to be changed (which can happen accidentally).

@stepcut thoughts?

stepcut · 2017-03-30T19:24:52Z

@jberryman acid-state is essentially a write ahead log system, and does use CRCs to check the validity. The issue is what happens when a CRC check fails? How do you distinguish between an incomplete write, and a completed write that was later corrupted.

In theory, you can safely delete an incomplete write, but if the data has become corrupted after the fact, then you probably want to abort.

I think you are correct that acid-state is not doing anything fundamentally different than any other database. So, if we can learn what they are doing, we can probably implement the same solution.

@neongreen It sounds like that option involves seeks and overwriting data? It could work, but sounds terrible for performance. It that it is basically the same as the idea I suggested about having a secondary log file which just logs completed writes. The difference is whether we are logging that in the same file or a different one.

I also have at the back of my mind a concern about how these solutions might play out when I finally have time to implement an S3 backend. I do not believe S3 supports appending or modifying existing objects. Perhaps the File backend and S3 backend do not use the same solution -- who knows.

jberryman · 2017-03-30T22:46:30Z

So did this fix the original issue?: serokell@26c9339 . It sounds like it would be fairly easy (and important) to write an automated test for this. EDIT oh sorry, didn't realize that was a fork.

neongreen · 2017-04-01T01:41:34Z

@stepcut

@neongreen It sounds like that option involves seeks and overwriting data? It could work, but sounds terrible for performance.

Huh, I didn't know that overwriting data was slow. Can you recommend any sources I could read? (I'd google but since I'm a noob when it comes to low-level things, I don't know how to distinguish trustworthy sources from non-truthworthy ones.)

It that it is basically the same as the idea I suggested about having a secondary log file which just logs completed writes. The difference is whether we are logging that in the same file or a different one.

Okay, then I don't have an opinion on which one should actually be implemented :)

I also have at the back of my mind a concern about how these solutions might play out when I finally have time to implement an S3 backend.

Cool, I didn't know an S3 backend was planned. (Does S3 even have a notion of incomplete writes?)

neongreen · 2017-04-01T01:44:15Z

@jberryman

So did this fix the original issue?: serokell/acid-state@26c9339

Well, yes (as in “incomplete writes don't mess things up for us anymore”), but a) we don't have any way to detect corruption now so we wouldn't have noticed if things went wrong, and b) we've mostly moved to RocksDB by now anyway.

jberryman · 2017-04-17T15:09:12Z

@neongreen oh I misunderstood, I assumed you were an acid-state maintainer and meant that acid-state's backend would moving to rocksdb or something. Have you written about your experience with rocksdb anywhere? Are you using it in a way that's similar to acid-state?

neongreen · 2017-04-17T15:36:16Z

@jberryman

Have you written about your experience with rocksdb anywhere?

Nope.

Are you using it in a way that's similar to acid-state?

Not really – for instance, we don't use versioning or migrations, and all of our queries are either “get a primitive value by key” or “get a complex value by key and then parse it with binary”. We switched to RocksDB because a) we don't want to store data in-memory, b) acid-state doesn't seem to be actively developed, and c) acid-state's model is not a particularly good fit for us (e.g. we want Updates that can fail, we want batched writes, etc).

gbaz · 2018-12-19T22:40:07Z

since the issue is fixed by serokell@26c9339 shouldn't this be closed?

parsonsmatt · 2018-12-20T00:38:31Z

@gbaz that's a fork of acid-state. I am curious why this PR says that the fix won't be merged in -- it links back to this thread, but after a few skims I can't find any rejected PRs or similar.

gbaz · 2018-12-20T00:44:27Z

It looks like this was the concern.

So I think it is correct that we can distinguish between a user forgetting to update their 'SafeCopy' instances the case where the Entry is truncated or corrupted.

Additionally, if the Entry is the right length, but has the wrong crc16 then we can also determine that the data has been corrupted.

However -- what about the case where the length says there should 20 bytes, but we only find 10? Can we say with certainty that the Entry was never fully written? Could it not also be the case that it was fully written, but got corrupted after the fact?

Right now we treat both situations the same -- we abort. That is not wrong -- but is certainly annoying when the problem is just a transaction that never completed. We could switch to always deleting the the entry if it appears truncated -- but silently ignoring corruption would seem to be very wrong to me.

My suggestion would be -- only rollback (i.e. drop the partial entry) at the very end of a file. Hence we don't have to worry about other forms of corruption -- this just handles the partial write in the event of a kill, power failure, etc.

adamgundry · 2018-12-20T09:17:39Z

Right, I think the simplest way forward here is to detect that a file is truncated and rollback the last entry, but throw an error for other kinds of corruption. This should probably be an option chosen by the user when opening the state, because some users might not want it to happen automatically, or might have application-level mechanisms to distinguish clean from unclean shutdown. After all, we have no guarantee that a truncation is the result of a partial write as opposed to subsequent corruption, it's just the most likely cause.

I suppose we could also have the option to make a best effort at restoring the log, ignoring any corrupted entries, which might be helpful for debugging. Side thought: would it make sense to offer a read-only mode that opens a state for queries only, again to help recovery?

Related to this is the need for better tools to analyse a (possibly corrupted) log file. That will hopefully become easier if/when my work on alternative serialisation layers (#88) gets merged.

I3ck · 2019-03-28T19:09:50Z

Actually, now that I think of it, the easiest solution seems to be something like this:
* write `'\0' <> crc <> data`

* `fsync`

* rewrite the `'\0'` with `'\1'`

* `fsync`

Shouldn't this resolve this rather serious issue? Or what can a library user do in case of the too few bytes problem?

neongreen pushed a commit to serokell/acid-state that referenced this issue Sep 16, 2016

Ignore unreadable entries in event/checkpoint logs

45b13c8

This is an attempt to fix acid-state#79

neongreen pushed a commit to serokell/acid-state that referenced this issue Sep 17, 2016

Ignore unreadable entries in event/checkpoint logs

b5a2845

This is an attempt to fix acid-state#79

neongreen pushed a commit to serokell/acid-state that referenced this issue Sep 19, 2016

Ignore unreadable entries in event/checkpoint logs

26c9339

This is an attempt to fix acid-state#79

chshersh mentioned this issue Dec 2, 2017

Move 'acid-state' utils into our own fork of 'acid-state' serokell/serokell-util#23

Closed

aspiwack mentioned this issue Jun 20, 2019

Add a repair tool for WAL and checkpoints #126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't open state after crash: too few bytes #79

Can't open state after crash: too few bytes #79

gromakovsky commented Sep 14, 2016

neongreen commented Sep 16, 2016 •

edited

neongreen commented Sep 16, 2016

jagajaga commented Sep 17, 2016

ddssff commented Sep 18, 2016

stepcut commented Sep 18, 2016

neongreen commented Sep 18, 2016 •

edited

alexbiehl commented Oct 27, 2016

neongreen commented Oct 27, 2016

neongreen commented Oct 27, 2016

alexbiehl commented Oct 27, 2016 •

edited

neongreen commented Oct 27, 2016 •

edited

neongreen commented Oct 27, 2016

alexbiehl commented Oct 27, 2016 •

edited

alexbiehl commented Oct 27, 2016

stepcut commented Nov 3, 2016

neongreen commented Nov 3, 2016

jberryman commented Mar 22, 2017

neongreen commented Mar 25, 2017

neongreen commented Mar 25, 2017

stepcut commented Mar 30, 2017

jberryman commented Mar 30, 2017 •

edited

neongreen commented Apr 1, 2017

neongreen commented Apr 1, 2017

jberryman commented Apr 17, 2017

neongreen commented Apr 17, 2017

gbaz commented Dec 19, 2018

parsonsmatt commented Dec 20, 2018

gbaz commented Dec 20, 2018

adamgundry commented Dec 20, 2018

I3ck commented Mar 28, 2019

Can't open state after crash: too few bytes #79

Can't open state after crash: too few bytes #79

Comments

gromakovsky commented Sep 14, 2016

neongreen commented Sep 16, 2016 • edited

neongreen commented Sep 16, 2016

jagajaga commented Sep 17, 2016

ddssff commented Sep 18, 2016

stepcut commented Sep 18, 2016

neongreen commented Sep 18, 2016 • edited

alexbiehl commented Oct 27, 2016

neongreen commented Oct 27, 2016

neongreen commented Oct 27, 2016

alexbiehl commented Oct 27, 2016 • edited

neongreen commented Oct 27, 2016 • edited

neongreen commented Oct 27, 2016

alexbiehl commented Oct 27, 2016 • edited

alexbiehl commented Oct 27, 2016

stepcut commented Nov 3, 2016

neongreen commented Nov 3, 2016

jberryman commented Mar 22, 2017

neongreen commented Mar 25, 2017

neongreen commented Mar 25, 2017

stepcut commented Mar 30, 2017

jberryman commented Mar 30, 2017 • edited

neongreen commented Apr 1, 2017

neongreen commented Apr 1, 2017

jberryman commented Apr 17, 2017

neongreen commented Apr 17, 2017

gbaz commented Dec 19, 2018

parsonsmatt commented Dec 20, 2018

gbaz commented Dec 20, 2018

adamgundry commented Dec 20, 2018

I3ck commented Mar 28, 2019

neongreen commented Sep 16, 2016 •

edited

neongreen commented Sep 18, 2016 •

edited

alexbiehl commented Oct 27, 2016 •

edited

neongreen commented Oct 27, 2016 •

edited

alexbiehl commented Oct 27, 2016 •

edited

jberryman commented Mar 30, 2017 •

edited