Corruption of persistent database file cause by sudden lost of power #189

Closed
thanhvtruong opened this Issue Jun 20, 2016 · 3 comments

Projects

None yet

3 participants

@thanhvtruong

It looks like when the persistent file is being save there is a tiny amount of time where a sudden power lost will cause the persistent database file to be corrupted.

Thousands of sudden power lost on our system were performed we notice the following:

  • Mosquitto start up with "invalid argument" and "database" read error.
  • mosquitto.db and mosquitto.db.new both existed in /var/lib/mosquitto
  • Both mosquitto.db and mosquitto.db.new have the same inode. (suggest that a rename has occurred)

We have a similar problem with another application that we build and reading into Linux documentation, we found out that flushing or closing a file is not enough to write the content to disk. An fsync need to be perform to confirm that the content is written to disk.

You can see the documentation in the man page (man close, under "note" second paragraph) on Fedora 23.

"A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored, use fsync(2). (It will depend on the disk hardware at this point.)"

@ralight ralight added this to the Fixes-next milestone Jun 26, 2016
@ralight ralight added a commit that referenced this issue Jun 26, 2016
@ralight ralight [189] Call fsync after persisting data.
To ensure it is correctly written. Closes #189.

Thanks to thanhvtruong.

Bug: #189
84df2bb
@ralight
Contributor
ralight commented Jun 26, 2016

Thanks for the report, I've added code to call fsync() on the file and on its directory. Could you please confirm whether this fixes the problem for you?

@kcallin
Contributor
kcallin commented Jun 27, 2016

Recommend calling fflush before fsync to ensure that application buffers are completely flushed to kernel buffer before being flushed to disk. It's awful hard to reproduce this, but between thanhvtruong and myself we started a long-term test series to mechanically verify.

I do not believe the directory sync is required; the rename logic should work as-is.

I opened a pull requrest for these changes and will update as the long-term tests progress.

@kcallin kcallin added a commit to kcallin/mosquitto that referenced this issue Jul 6, 2016
@kcallin kcallin [189] Mosquitto database corrupted on power-loss.
Mosquitto database writes are not atomic and if power is lost during
a write the file will be permanently lost.  This commit makes writes as
atomic as possible.

Signed-off-by: Keegan Callin <kc@kcallin.net>
Bug: eclipse#189
b7ac6c2
@ralight ralight added a commit that referenced this issue Aug 16, 2016
@kcallin @ralight kcallin + ralight [189] Mosquitto database corrupted on power-loss. (#206)
Mosquitto database writes are not atomic and if power is lost during
a write the file will be permanently lost.  This commit makes writes as
atomic as possible.

Signed-off-by: Keegan Callin <kc@kcallin.net>
Bug: #189
7ba3f3d
@ralight
Contributor
ralight commented Aug 16, 2016

Thanks very much for your work on this, I'm closing this now based on your pull request.

@ralight ralight closed this Aug 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment