-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle rename() failure in non-local FS #8192
Conversation
db/db_impl/db_impl_open.cc
Outdated
} | ||
if (!s.ok()) { | ||
fs_->DeleteFile(manifest, IOOptions(), nullptr).PermitUncheckedError(); | ||
fs_->DeleteFile(CurrentFileName(dbname_), IOOptions(), nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the behavior here. Before we do the renaming, we have xxx.dbtmp and CURRENT. If rename failed, we may keep the same status xxx.dbtmp and CURRENT or we only have the CURRENT (new one). If it is the previous case, we will have not the CURRENT file at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is NewDB()
. If NewDB()
failed, then the DB::Open()
call fails too. I think it's OK to delete all generated files here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. For newly created DB, it is fine to delete all of them, which will not create any lose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, if we do not delete them. A subsequent attempt to create new db may fail because some FS does not support overwriting an existing file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, if we do not delete them. A subsequent attempt to create new db may fail because some FS does not support overwriting an existing file.
I did not read the code or subsequent attempt. In that case, does the sequence number increases? If so, we aways create a Manifest file with a new name and so the xxxx.dbtmp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused here after reading the code. The SetCurrentFile already deletes the temporary file if the rename failed so I am not sure why that is needed here. I can see where the Manifest could be left around if the SetCurrentFile failed, but am not sure why we need both DeleteFile calls here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more. I think original code of deleting only the MANIFEST here is correct, but the current one has issues. If we delete the MANIFEST, we should also make sure the CURRENT file is deleted in this case. Otherwise, a future call to DB::Open()
will find the CURRENT file and think the DB exists. If the MANIFEST is deleted, then open will fail. The difficulty here is that we cannot make sure CURRENT is deleted if we already end up at this point.
- If the rename succeeded on server side, but client returns error. Then we cannot rely on the client to delete the CURRENT file.
- If rename failed on remote, then we know CURRENT does not exist, and we are fine.
The original motivation for deleting the MANIFEST is to avoid future NewDB()
call overwriting an existing file. Now I think this should not be done in the clean-up-after-error phase. Instead, this should be done at the beginning of NewDB()
. Assuming no data loss. If a previous call to NewDB()
failed, then the caller may retry DB::Open()
. There are several possibilities.
- Both CURRENT and MANIFEST exist.
NewDB()
won't be called, and the db is empty. This is OK. - Only the MANIFEST exists (maybe there is a .tmp file that failed to have been deleted).
NewDB()
will be called. In this case, we should avoid creating the same MANIFEST and tmp file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, if we do not delete them. A subsequent attempt to create new db may fail because some FS does not support overwriting an existing file.
I did not read the code or subsequent attempt. In that case, does the sequence number increases? If so, we aways create a Manifest file with a new name and so the xxxx.dbtmp.
In DeleteUnreferencedSst()
, we will bump the next_file_number_ to be the current largest file number + 1
db/db_impl/db_impl_open.cc
Outdated
} | ||
if (!s.ok()) { | ||
fs_->DeleteFile(manifest, IOOptions(), nullptr).PermitUncheckedError(); | ||
fs_->DeleteFile(CurrentFileName(dbname_), IOOptions(), nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused here after reading the code. The SetCurrentFile already deletes the temporary file if the rename failed so I am not sure why that is needed here. I can see where the Manifest could be left around if the SetCurrentFile failed, but am not sure why we need both DeleteFile calls here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there other operation (failed but could be success in the backend) that could cause the similar problem?
// a) CURRENT points to the new MANIFEST, and the new MANIFEST is present. | ||
// b) CURRENT points to the original MANIFEST, and the original MANIFEST | ||
// also exists. | ||
if (new_descriptor_log && !manifest_io_status.ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the manifest file be orphaned and never get cleaned up after this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a subsequent db closing will delete it (https://github.com/facebook/rocksdb/blob/6.19.fb/db/db_impl/db_impl.cc#L585).
// If manifest append failed for whatever reason, the file could be | ||
// corrupted. So we need to force the next version update to start a | ||
// new manifest file. | ||
descriptor_log_.reset(); | ||
if (new_descriptor_log) { | ||
// If manifest operations failed, then we know the CURRENT file still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If rename()
failed, should it just keep retrying? I guess we will retry ProcessManifestWrites()
later by re-writing the manifest file and call rename()
again, but should it just retry the single failed operation rename()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have a good retry policy, so it is better to return the error and let application decide. We cannot keep retrying because we do not know how long it takes.
In the meantime, RocksDB should make sure there is no data loss. As you mentioned, the first subsequent LogAndApply()
will do retry as necessary.
f3d695e
to
cf422ad
Compare
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
db/db_test2.cc
Outdated
std::unique_ptr<WritableFile>* result, | ||
const EnvOptions& env_opts) override { | ||
Status s = target()->FileExists(fname); | ||
EXPECT_TRUE(s.IsNotFound()) << fname << " already exists."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can this logic live in SpecialEnv and guard by a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let me update
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
5f90e96
to
2a93d13
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2a93d13
to
638ff0d
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
638ff0d
to
de7c4da
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@riversand963 merged this pull request in a376c22. |
Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: #8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: #8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
Summary: DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place. If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully. This option is enabled in crash test in half of the time. Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases. Pull Request resolved: #8235 Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by #8192 and a similar one that fails when fsyncing parent directory. Reviewed By: anand1976 Differential Revision: D28010944 fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558
Summary: DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place. If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully. This option is enabled in crash test in half of the time. Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases. Pull Request resolved: #8235 Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by #8192 and a similar one that fails when fsyncing parent directory. Reviewed By: anand1976 Differential Revision: D28010944 fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558
Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: facebook/rocksdb#8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: facebook/rocksdb#8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4 Signed-off-by: Changlong Chen <levisonchen@live.cn>
In a distributed environment, a file
rename()
operation can succeed on server (remote)side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in
rocksdb::SetCurrentFile()
, whichcan be called in
LogAndApply() -> ProcessManifestWrites()
if RocksDB tries to switch to anew MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent
DB::Open()
will try to look for the new MANIFEST and fail.As a fix, we can track the execution result of IO operations on the new MANIFEST.
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
LogAndApply()
will switch to a new MANIFEST and update CURRENT.to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Test plan:
make check