New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't open state after crash: too few bytes #79
Comments
I think it happens because appending to file isn't atomic: https://github.com/acid-state/acid-state/blob/master/src/Data/Acid/Log.hs#L141. What would be the best way to fix it in acid-state?
Or is it actually handled somewhere in acid-state internals and I haven't noticed it? |
Does acid-state ever append to checkpoint/event files that already existed before opening the state? If it doesn't, then ignoring truncated log entries (without even bothering to delete them) might be a viable solution. |
This is an attempt to fix acid-state#79
@stepcut some ideas? |
This is an attempt to fix acid-state#79
I suppose this is the same issue as not being able to start a server after syncing its database from a running server with open event files? |
So, another reason you can get this error is if you change your types but forget to create a SafeCopy migration instance. Currently that causes the app to fail to start -- which seems like a sensible thing to do. With the proposed changes, it seems like acid-state would just silently ignore any entries that it could not read, causing a bunch of valid data to be lost? I believe the intention is that acid-state is supposed to have a crash recover tool which allows you to examine the data and have a developer make a decisions about the right path forward. Though, such a tool does not yet exist. Does that sound correct @lemmih ? |
@stepcut To my knowledge, acid-state works in two stages:
My proposed fix doesn't affect 2), only 1). Incorrect migrations are handled at step 2). |
This is an attempt to fix acid-state#79
Since appending a bytestring is not atomic we can first write to a temporary file and rename that to the correct log file name. Renaming is atomic. That way a damaged log will never be tried to load when reopening the state after a crash. |
Won't this cause lots of copying if the state is big? Here another solution is offered but it's a bit more complicated (not much since we only have to log appends):
Read the post for the whole solution (which requires several
|
In particular, for our case the log will be simply “how many bytes the file had before appending”, and recovering will be a single |
@neongreen no, if the rename/move happens on the same filesystem its simply a matter of changing some inode pointers so its pretty cheap. If you move a file to different file systems it needs to copy the content of course which is probably the most expensive case. |
I'm not talking about renaming being expensive, I'm talking about “write to a temporary file” being expensive.
I don't remember already whether 1. or 2. is true. |
(I think I've seen appending to checkpoints files happening when I was using acid-state, but I'm not 100% sure.) |
Ah, I see your point. The route via temporary files would need a |
@neongreen yes, it uses the same |
Being able to easily diagnose and recover from an error such as this one is perhaps the number one most important task for acid-state. So I do want to take a serious look into this issue. Unfortunately, if you write too much Haskell code you reach a point where you spend all your time updating to the latest build dependencies and you don't seem to have much time left to write new code or tackle serious bugs. Regarding this statement in the initial bug report:
That is simply not true. A checkpoint file can contain more than one checkpoint. So, by itself, one checkpoint file being much smaller than another is meaningless. Regarding the error message itself:
I think there are (at least) three very different situations in which that error can arise. One way that can happen is if the application is killed in the middle of appending the event log. When this happens, the Another situation where this error could occur is when the database file has gotten corrupted after the fact. For example, a hard disk is starting to fail and a sector has gone bad. In this situation, silently deleting what used to be a valid entry sounds like a very wrong answer. I am pretty sure we do not want a database system that will silently delete corrupted data and then start processing new transactions as if nothing bad ever happened. The most common situation where this occurs (if I am not mistaken) is when you accidentally change a type, but forget to change the SafeCopy version number and create a Migration instance. For example, if you changed an Int32 to and Int64, then when the server starts up again it will attempt to read 4 more bytes than were written. In this case, the data on the disk is intact and uncorrupted. Silently ignoring those entries is definitely not the right solution. The right solution is to fix the broken code in the server and try again. So, it seems that we have 3 situations in which we might get the original error message, but in only 1 case can we argue that silently ignoring the entry is the correct thing to do. So, the question is, can we reliably determine which case we are looking at. @neongreen has suggested that deserializing happens in two steps -- first we deserialize an
And if we look at acid-state/src/Data/Acid/Archive.hs Line 40 in c684fc5
We see that it writes the length of of the So I think it is correct that we can distinguish between a user forgetting to update their 'SafeCopy' instances the case where the Additionally, if the However -- what about the case where the length says there should 20 bytes, but we only find 10? Can we say with certainty that the Entry was never fully written? Could it not also be the case that it was fully written, but got corrupted after the fact? Right now we treat both situations the same -- we abort. That is not wrong -- but is certainly annoying when the problem is just a transaction that never completed. We could switch to always deleting the the entry if it appears truncated -- but silently ignoring corruption would seem to be very wrong to me. One proposed solution was to write log entries to a new file and rename them after the write is complete. This would seem to solve the problem -- however, I think it would also be a major performance hit. Surely creating a new file is more expensive that appending to an already open file? At a very minimum we can do two things:
Perhaps another option would be for acid-state to have some sort of secondary log that indicates which log entries it believes it completed? If the log entry is short -- but the secondary log says it successfully wrote that entry, then it is almost certainly corruption. If the log entry is short and the secondary log does not indicate it was completed, then it is more likely (but not guaranteed) that it was just aborted mid-write? When HDD ruled the world I would be concerned about the extra disk seeks that might incur. In a SSD world -- I am not sure what to think. tl;dr - How to we tell the difference between a transaction that never finished writing, and one that did but got corrupted later? |
Agreed.
Agreed. I think that if this happens, the right course of action is to throw an error message instead of silently discarding the corrupted entry.
Good point. Unfortunately, I'm not sure it's possible even in theory to distinguish between these cases. If corruption is described as arbitrary modification of stored data, then for any state
I'm not sure – this would have to be benchmarked. Filesystems work in mysterious ways. If you are a filesystem expert or can ask a filesystem expert, this changes things :) |
I've never written a database before. are there problems here that are unique to
And is anyone working on this ticket actively? |
@jberryman my understanding is that CRC already gets written:
The problem isn't that we aren't protected against partial writes – we are. The problem is that we can't implement the following semantics:
The reason we can't implement it is that we can't distinguish between “data didn't get written fully” and “data got written fully but got truncated later”. The latter can happen easily if e.g. you copy the database on a USB drive and eject it mid-write. |
Actually, now that I think of it, the easiest solution seems to be something like this:
Now incomplete writes can be distinguished from truncation by looking at the first byte. For extra safety we can write 4-byte magic constants instead of @stepcut thoughts? |
@jberryman acid-state is essentially a write ahead log system, and does use CRCs to check the validity. The issue is what happens when a CRC check fails? How do you distinguish between an incomplete write, and a completed write that was later corrupted. In theory, you can safely delete an incomplete write, but if the data has become corrupted after the fact, then you probably want to abort. I think you are correct that acid-state is not doing anything fundamentally different than any other database. So, if we can learn what they are doing, we can probably implement the same solution. @neongreen It sounds like that option involves seeks and overwriting data? It could work, but sounds terrible for performance. It that it is basically the same as the idea I suggested about having a secondary log file which just logs completed writes. The difference is whether we are logging that in the same file or a different one. I also have at the back of my mind a concern about how these solutions might play out when I finally have time to implement an S3 backend. I do not believe S3 supports appending or modifying existing objects. Perhaps the File backend and S3 backend do not use the same solution -- who knows. |
So did this fix the original issue?: serokell@26c9339 . It sounds like it would be fairly easy (and important) to write an automated test for this. EDIT oh sorry, didn't realize that was a fork. |
Huh, I didn't know that overwriting data was slow. Can you recommend any sources I could read? (I'd google but since I'm a noob when it comes to low-level things, I don't know how to distinguish trustworthy sources from non-truthworthy ones.)
Okay, then I don't have an opinion on which one should actually be implemented :)
Cool, I didn't know an S3 backend was planned. (Does S3 even have a notion of incomplete writes?) |
Well, yes (as in “incomplete writes don't mess things up for us anymore”), but a) we don't have any way to detect corruption now so we wouldn't have noticed if things went wrong, and b) we've mostly moved to RocksDB by now anyway. |
@neongreen oh I misunderstood, I assumed you were an acid-state maintainer and meant that acid-state's backend would moving to rocksdb or something. Have you written about your experience with rocksdb anywhere? Are you using it in a way that's similar to acid-state? |
Nope.
Not really – for instance, we don't use versioning or migrations, and all of our queries are either “get a primitive value by key” or “get a complex value by key and then parse it with |
since the issue is fixed by serokell@26c9339 shouldn't this be closed? |
It looks like this was the concern.
My suggestion would be -- only rollback (i.e. drop the partial entry) at the very end of a file. Hence we don't have to worry about other forms of corruption -- this just handles the partial write in the event of a kill, power failure, etc. |
Right, I think the simplest way forward here is to detect that a file is truncated and rollback the last entry, but throw an error for other kinds of corruption. This should probably be an option chosen by the user when opening the state, because some users might not want it to happen automatically, or might have application-level mechanisms to distinguish clean from unclean shutdown. After all, we have no guarantee that a truncation is the result of a partial write as opposed to subsequent corruption, it's just the most likely cause. I suppose we could also have the option to make a best effort at restoring the log, ignoring any corrupted entries, which might be helpful for debugging. Side thought: would it make sense to offer a read-only mode that opens a state for queries only, again to help recovery? Related to this is the need for better tools to analyse a (possibly corrupted) log file. That will hopefully become easier if/when my work on alternative serialisation layers (#88) gets merged. |
Shouldn't this resolve this rather serious issue? Or what can a library user do in case of the |
If I run a bunch of updates on my state (local, on disk), then press
Ctrl-C
during some update, then try to open state (openLocalStateFrom
instack ghci
), sometimes I get the following error:I tried to debug it by my own and can provide some extra information.
There are two
checkpoints
files with two consequent numbers. The last one is much small than another one. So probably it's malformed. Also there areevents
files, but they seem to be irrelevant.Error happens in newestEntry function. To be more precise, it happens when the last (probably malformed) checkpoint is being read. If I go deeper, I can say that this
runGetPartial
fails.My guess is that the last checkpoint wasn't dumped properly because program crashed, but the previous checkpoint also exists and it should be fine. The problem is that
newestEntry
fails witherror
if the last checkpoint is malformed, but instead it should try to read another checkpoint (if it exists). It's just my guess, I am be wrong, because I don't know this code.Also I can provide an example of database which can't be read because of this bug. Unfortunatelly, this example is quite heavy, but maybe someone will look into it.
Here is database (zip archived): wallet-db.zip
Definition of this data type can be found in this repository. Please use f97db74cbf09e7d2aa403d2c47a7fe37f7583e8f revision (just in case).
Just run in
stack ghci
:and you will get that error.
The text was updated successfully, but these errors were encountered: