Handle rename() failure in non-local FS #8192

riversand963 · 2021-04-15T06:23:29Z

In a distributed environment, a file rename() operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in rocksdb::SetCurrentFile(), which
can be called in LogAndApply() -> ProcessManifestWrites() if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.

This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent DB::Open() will try to look for the new MANIFEST and fail.

As a fix, we can track the execution result of IO operations on the new MANIFEST.

If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future LogAndApply() will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
  to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
  succeed and ignore the other.

Test plan:
make check

zhichao-cao · 2021-04-15T06:41:21Z

db/db_impl/db_impl_open.cc

+  }
+  if (!s.ok()) {
+    fs_->DeleteFile(manifest, IOOptions(), nullptr).PermitUncheckedError();
+    fs_->DeleteFile(CurrentFileName(dbname_), IOOptions(), nullptr)


I'm not sure the behavior here. Before we do the renaming, we have xxx.dbtmp and CURRENT. If rename failed, we may keep the same status xxx.dbtmp and CURRENT or we only have the CURRENT (new one). If it is the previous case, we will have not the CURRENT file at all?

This is NewDB(). If NewDB() failed, then the DB::Open() call fails too. I think it's OK to delete all generated files here.

I see. For newly created DB, it is fine to delete all of them, which will not create any lose.

Furthermore, if we do not delete them. A subsequent attempt to create new db may fail because some FS does not support overwriting an existing file.

Furthermore, if we do not delete them. A subsequent attempt to create new db may fail because some FS does not support overwriting an existing file.

I did not read the code or subsequent attempt. In that case, does the sequence number increases? If so, we aways create a Manifest file with a new name and so the xxxx.dbtmp.

I am a little confused here after reading the code. The SetCurrentFile already deletes the temporary file if the rename failed so I am not sure why that is needed here. I can see where the Manifest could be left around if the SetCurrentFile failed, but am not sure why we need both DeleteFile calls here.

Thinking more. I think original code of deleting only the MANIFEST here is correct, but the current one has issues. If we delete the MANIFEST, we should also make sure the CURRENT file is deleted in this case. Otherwise, a future call to DB::Open() will find the CURRENT file and think the DB exists. If the MANIFEST is deleted, then open will fail. The difficulty here is that we cannot make sure CURRENT is deleted if we already end up at this point.

If the rename succeeded on server side, but client returns error. Then we cannot rely on the client to delete the CURRENT file.

If rename failed on remote, then we know CURRENT does not exist, and we are fine.

The original motivation for deleting the MANIFEST is to avoid future NewDB() call overwriting an existing file. Now I think this should not be done in the clean-up-after-error phase. Instead, this should be done at the beginning of NewDB(). Assuming no data loss. If a previous call to NewDB() failed, then the caller may retry DB::Open(). There are several possibilities.

Both CURRENT and MANIFEST exist. NewDB() won't be called, and the db is empty. This is OK.

Only the MANIFEST exists (maybe there is a .tmp file that failed to have been deleted). NewDB() will be called. In this case, we should avoid creating the same MANIFEST and tmp file.

Furthermore, if we do not delete them. A subsequent attempt to create new db may fail because some FS does not support overwriting an existing file.

I did not read the code or subsequent attempt. In that case, does the sequence number increases? If so, we aways create a Manifest file with a new name and so the xxxx.dbtmp.

In DeleteUnreferencedSst(), we will bump the next_file_number_ to be the current largest file number + 1

db/db_test2.cc

mrambacher · 2021-04-15T15:01:32Z

db/db_impl/db_impl_open.cc

+  }
+  if (!s.ok()) {
+    fs_->DeleteFile(manifest, IOOptions(), nullptr).PermitUncheckedError();
+    fs_->DeleteFile(CurrentFileName(dbname_), IOOptions(), nullptr)


I am a little confused here after reading the code. The SetCurrentFile already deletes the temporary file if the rename failed so I am not sure why that is needed here. I can see where the Manifest could be left around if the SetCurrentFile failed, but am not sure why we need both DeleteFile calls here.

jay-zhuang

Is there other operation (failed but could be success in the backend) that could cause the similar problem?

jay-zhuang · 2021-04-15T16:41:50Z

db/version_set.cc

+    // a) CURRENT points to the new MANIFEST, and the new MANIFEST is present.
+    // b) CURRENT points to the original MANIFEST, and the original MANIFEST
+    //    also exists.
+    if (new_descriptor_log && !manifest_io_status.ok()) {


Will the manifest file be orphaned and never get cleaned up after this?

I think a subsequent db closing will delete it (https://github.com/facebook/rocksdb/blob/6.19.fb/db/db_impl/db_impl.cc#L585).

jay-zhuang · 2021-04-15T17:13:06Z

db/version_set.cc

    // If manifest append failed for whatever reason, the file could be
    // corrupted. So we need to force the next version update to start a
    // new manifest file.
    descriptor_log_.reset();
-    if (new_descriptor_log) {
+    // If manifest operations failed, then we know the CURRENT file still


If rename() failed, should it just keep retrying? I guess we will retry ProcessManifestWrites() later by re-writing the manifest file and call rename() again, but should it just retry the single failed operation rename()?

We do not have a good retry policy, so it is better to return the error and let application decide. We cannot keep retrying because we do not know how long it takes.
In the meantime, RocksDB should make sure there is no data loss. As you mentioned, the first subsequent LogAndApply() will do retry as necessary.

facebook-github-bot · 2021-04-15T21:28:57Z

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-04-16T00:58:21Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-04-16T00:58:39Z

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-04-16T06:35:42Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-04-16T06:40:17Z

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying

LGTM

siying · 2021-04-16T17:19:02Z

db/db_test2.cc

+                         std::unique_ptr<WritableFile>* result,
+                         const EnvOptions& env_opts) override {
+    Status s = target()->FileExists(fname);
+    EXPECT_TRUE(s.IsNotFound()) << fname << " already exists.";


Nit: can this logic live in SpecialEnv and guard by a variable?

Sure. Let me update

facebook-github-bot · 2021-04-16T19:27:41Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-04-16T19:28:15Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-04-16T19:29:45Z

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying

LGTM

facebook-github-bot · 2021-04-19T15:51:51Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-04-19T15:53:24Z

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zhichao-cao

LGTM, thanks for the fix!

facebook-github-bot · 2021-04-19T23:12:58Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-04-19T23:14:31Z

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-04-20T01:11:23Z

@riversand963 merged this pull request in a376c22.

Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: #8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4

Summary: DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place. If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully. This option is enabled in crash test in half of the time. Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases. Pull Request resolved: #8235 Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by #8192 and a similar one that fails when fsyncing parent directory. Reviewed By: anand1976 Differential Revision: D28010944 fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558

Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: facebook/rocksdb#8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4 Signed-off-by: Changlong Chen <levisonchen@live.cn>

facebook-github-bot added the CLA Signed label Apr 15, 2021

zhichao-cao reviewed Apr 15, 2021

View reviewed changes

mrambacher reviewed Apr 15, 2021

View reviewed changes

jay-zhuang reviewed Apr 15, 2021

View reviewed changes

riversand963 force-pushed the fix-set-current branch from f3d695e to cf422ad Compare April 15, 2021 17:40

riversand963 marked this pull request as ready for review April 15, 2021 21:27

riversand963 requested a review from siying April 16, 2021 17:15

siying reviewed Apr 16, 2021

View reviewed changes

riversand963 force-pushed the fix-set-current branch from 5f90e96 to 2a93d13 Compare April 16, 2021 19:28

riversand963 requested review from siying and zhichao-cao April 16, 2021 23:22

siying reviewed Apr 17, 2021

View reviewed changes

riversand963 force-pushed the fix-set-current branch from 2a93d13 to 638ff0d Compare April 19, 2021 15:51

zhichao-cao approved these changes Apr 19, 2021

View reviewed changes

riversand963 added 6 commits April 19, 2021 16:12

Handle rename failure in distributed env

9df950b

add unit test

c6aec4e

Address comments and check status on success path

f622b35

fix jtest;

505f80d

Assert manifest IO status

ac3fba1

Address comment

efaef63

riversand963 added 3 commits April 19, 2021 16:12

Update history

2882d1c

Address review comments

1e8cd21

Update HISTORY

de7c4da

riversand963 force-pushed the fix-set-current branch from 638ff0d to de7c4da Compare April 19, 2021 23:12

facebook-github-bot closed this in a376c22 Apr 20, 2021

facebook-github-bot added the Merged label Apr 20, 2021

riversand963 deleted the fix-set-current branch April 20, 2021 01:11

siying mentioned this pull request Apr 26, 2021

db_stress to add --open_metadata_write_fault_one_in #8235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle rename() failure in non-local FS #8192

Handle rename() failure in non-local FS #8192

riversand963 commented Apr 15, 2021 •

edited

Loading

zhichao-cao Apr 15, 2021

riversand963 Apr 15, 2021

zhichao-cao Apr 15, 2021

riversand963 Apr 15, 2021

zhichao-cao Apr 15, 2021

mrambacher Apr 15, 2021

riversand963 Apr 15, 2021

riversand963 Apr 15, 2021

mrambacher Apr 15, 2021

jay-zhuang left a comment

jay-zhuang Apr 15, 2021

riversand963 Apr 15, 2021

jay-zhuang Apr 15, 2021

riversand963 Apr 15, 2021

facebook-github-bot commented Apr 15, 2021

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

siying left a comment

siying Apr 16, 2021

riversand963 Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

siying left a comment

facebook-github-bot commented Apr 19, 2021

facebook-github-bot commented Apr 19, 2021

zhichao-cao left a comment

facebook-github-bot commented Apr 19, 2021

facebook-github-bot commented Apr 19, 2021

facebook-github-bot commented Apr 20, 2021

Handle rename() failure in non-local FS #8192

Handle rename() failure in non-local FS #8192

Conversation

riversand963 commented Apr 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jay-zhuang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 15, 2021

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

siying left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

facebook-github-bot commented Apr 16, 2021

siying left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 19, 2021

facebook-github-bot commented Apr 19, 2021

zhichao-cao left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 19, 2021

facebook-github-bot commented Apr 19, 2021

facebook-github-bot commented Apr 20, 2021

riversand963 commented Apr 15, 2021 •

edited

Loading