-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hang in MultiRead with O_DIRECT and io_uring #10368
Conversation
@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I am working on adding a unit test also to catch this hang. |
06905e0
to
d54b32b
Compare
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
LGTM. Please add a unit test for this scenario. I'm surprised that it is not covered in EnvPosixTestWithParam.MultiRead where we explicitly generate cases where bytes_read=0. We might need to investigate whether the test doesn't do the job. |
I am trying to repro the hang with this patch applied. It will take a few hours to get a signal. Otherwise I don't have an opinion on the diff because I don't know this part of RocksDB that well. |
Thanks for the testing with this patch Mark, I am also trying to reproduce it with unit tests as well(currently this exact case is not covered). |
In our unit tests, Line 1434 in a543773
|
In other test, sync point is not enabled, because of which partial results were never injected. Line 1404 in a543773
|
This test bug was introduced only 8 days ago, after the bug was introduced: #10278 |
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
I was unable to reproduce it in 15 attempts. Will let it run overnight but without the patch I can repro it in 1 or 2 attempts. |
Summary: Fix bug in O_DIRECT and io_uring when its EOF and bytes_read = 0 because of wrong check, it gets stuck in an infinite loop. Test Plan: CircleCI Reviewers: Subscribers: Tasks: Tags:
02725be
to
09cd71f
Compare
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
I am able to reproduce the issue (hang) in the unit test. |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
09cd71f
to
89ee33e
Compare
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
7668797
to
3f2f585
Compare
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
3f2f585
to
4b369b9
Compare
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
@siying Do you know which platforms, direct_io is supported and which not? I am getting error
whereas it works fine on my devserver. |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
This is ramfs, so direct I/O is not supported. I think the best way of testing direct I/O scenario is to emulate the behavior. I think many tests in env_test do to test Direct I/O is to turn off direct I/O when creating the fd, and rely on the alignment assertion in the code to validate some behavior like here: Lines 1836 to 1844 in 9620653
|
Is it true that the bug would happen in non-Direct/IO case too? I feel that if we try to read beyond the file size, we will fall into this infinite look, no matter it is direct I/O. It's just that unless it is direct I/O, we never read beyond the file size. If that is the case, can we also try to reproduce it too? |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
@@ -1522,6 +1522,71 @@ TEST_F(EnvPosixTest, MultiReadNonAlignedLargeNum) { | |||
} | |||
} | |||
|
|||
TEST_F(EnvPosixTest, MultiReadDirectIONonAlignedLargeNum) { | |||
EnvOptions soptions; | |||
soptions.use_direct_reads = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that non-direct case also has the bug and either way we should add the coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue was with direct_io
and in case of non_direct_io
we do the read again and update the results there without adding to incomplete list in case bytes_reads are 0. We do have existing test for that MultiReadNonAlignedLargeNum
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
1238cd4
to
733e145
Compare
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing. |
@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
||
// Validate results | ||
for (size_t i = 0; i < num_reads; ++i) { | ||
ASSERT_OK(reqs[i].status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could validate result too.
Summary: Fix bug in O_DIRECT and io_uring when its EOF and bytes_read = 0 because of wrong check, it got added into incomplete list and gets stuck in an infinite loop as it will always return bytes_read = 0. The bug was introduced by PR #10197 and that PR is not released yet in any release branch. Pull Request resolved: #10368 Test Plan: Added new unit test Reviewed By: siying Differential Revision: D37885184 Pulled By: akankshamahajan15 fbshipit-source-id: 35b36a44b696d29b2f6f25301aa1b19547b4e03b
Summary: Fix bug in O_DIRECT and io_uring when its EOF and bytes_read =
0 because of wrong check, it got added into incomplete list and gets stuck in an infinite loop as it will always return bytes_read = 0. The bug was introduced by PR #10197 and that PR is not released yet in any release branch.
Test Plan: Added new unit test
Reviewers:
Subscribers:
Tasks:
Tags: