Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hang in MultiRead with O_DIRECT and io_uring #10368

Closed

Conversation

akankshamahajan15
Copy link
Contributor

@akankshamahajan15 akankshamahajan15 commented Jul 15, 2022

Summary: Fix bug in O_DIRECT and io_uring when its EOF and bytes_read =
0 because of wrong check, it got added into incomplete list and gets stuck in an infinite loop as it will always return bytes_read = 0. The bug was introduced by PR #10197 and that PR is not released yet in any release branch.

Test Plan: Added new unit test

Reviewers:

Subscribers:

Tasks:

Tags:

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@akankshamahajan15
Copy link
Contributor Author

I am working on adding a unit test also to catch this hang.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@siying
Copy link
Contributor

siying commented Jul 15, 2022

LGTM. Please add a unit test for this scenario. I'm surprised that it is not covered in EnvPosixTestWithParam.MultiRead where we explicitly generate cases where bytes_read=0. We might need to investigate whether the test doesn't do the job.

@mdcallag
Copy link
Contributor

I am trying to repro the hang with this patch applied. It will take a few hours to get a signal. Otherwise I don't have an opinion on the diff because I don't know this part of RocksDB that well.

@akankshamahajan15
Copy link
Contributor Author

I am trying to repro the hang with this patch applied. It will take a few hours to get a signal. Otherwise I don't have an opinion on the diff because I don't know this part of RocksDB that well.

Thanks for the testing with this patch Mark, I am also trying to reproduce it with unit tests as well(currently this exact case is not covered).

@akankshamahajan15
Copy link
Contributor Author

In our unit tests,

// In this test we don't do aligned read, so it doesn't work for
we don't do direct_reads with nonalignment because of which this scenario was never hit. I will try to see if I can do the alignment reads.

@akankshamahajan15
Copy link
Contributor Author

In other test, sync point is not enabled, because of which partial results were never injected.

ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();

@siying
Copy link
Contributor

siying commented Jul 15, 2022

In other test, sync point is not enabled, because of which partial results were never injected.

ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();

This test bug was introduced only 8 days ago, after the bug was introduced: #10278
It should definitely be fixed but before the bug was introduced, it didn't catch the bug either.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@mdcallag
Copy link
Contributor

I was unable to reproduce it in 15 attempts. Will let it run overnight but without the patch I can repro it in 1 or 2 attempts.

Summary: Fix bug in O_DIRECT and io_uring when its EOF and bytes_read =
0 because of wrong check, it gets stuck in an infinite loop.

Test Plan: CircleCI

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@akankshamahajan15
Copy link
Contributor Author

I am able to reproduce the issue (hang) in the unit test.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@akankshamahajan15
Copy link
Contributor Author

@siying Do you know which platforms, direct_io is supported and which not? I am getting error

 env/env_test.cc:1538
env_->NewWritableFile(fname, &wfile, soptions)
IO error: While open a file for appending: /dev/shm/rocksdb.7OId/testfile_1370631_5849386736499158703: Invalid argument

whereas it works fine on my devserver.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@siying
Copy link
Contributor

siying commented Jul 18, 2022

@siying Do you know which platforms, direct_io is supported and which not? I am getting error

 env/env_test.cc:1538
env_->NewWritableFile(fname, &wfile, soptions)
IO error: While open a file for appending: /dev/shm/rocksdb.7OId/testfile_1370631_5849386736499158703: Invalid argument

whereas it works fine on my devserver.

This is ramfs, so direct I/O is not supported. I think the best way of testing direct I/O scenario is to emulate the behavior. I think many tests in env_test do to test Direct I/O is to turn off direct I/O when creating the fd, and rely on the alignment assertion in the code to validate some behavior like here:

rocksdb/env/env_test.cc

Lines 1836 to 1844 in 9620653

#if !defined(OS_MACOSX) && !defined(OS_WIN) && !defined(OS_SOLARIS) && !defined(OS_AIX) && !defined(OS_OPENBSD) && !defined(OS_FREEBSD)
if (soptions.use_direct_writes) {
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"NewWritableFile:O_DIRECT", [&](void* arg) {
int* val = static_cast<int*>(arg);
*val &= ~O_DIRECT;
});
}
#endif
This may or may not work in this case though.

@siying
Copy link
Contributor

siying commented Jul 18, 2022

Is it true that the bug would happen in non-Direct/IO case too? I feel that if we try to read beyond the file size, we will fall into this infinite look, no matter it is direct I/O. It's just that unless it is direct I/O, we never read beyond the file size. If that is the case, can we also try to reproduce it too?

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

env/env_test.cc Outdated Show resolved Hide resolved
env/env_test.cc Outdated Show resolved Hide resolved
env/env_test.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

env/env_test.cc Outdated Show resolved Hide resolved
env/env_test.cc Outdated Show resolved Hide resolved
@@ -1522,6 +1522,71 @@ TEST_F(EnvPosixTest, MultiReadNonAlignedLargeNum) {
}
}

TEST_F(EnvPosixTest, MultiReadDirectIONonAlignedLargeNum) {
EnvOptions soptions;
soptions.use_direct_reads = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that non-direct case also has the bug and either way we should add the coverage.

Copy link
Contributor Author

@akankshamahajan15 akankshamahajan15 Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was with direct_io and in case of non_direct_io we do the read again and update the results there without adding to incomplete list in case bytes_reads are 0. We do have existing test for that MultiReadNonAlignedLargeNum

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.


// Validate results
for (size_t i = 0; i < num_reads; ++i) {
ASSERT_OK(reqs[i].status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could validate result too.

siying pushed a commit that referenced this pull request Jul 18, 2022
Summary:
Fix bug in O_DIRECT and io_uring when its EOF and bytes_read =
0 because of wrong check, it got added into incomplete list and gets stuck in an infinite loop as it will always return bytes_read = 0. The bug was introduced by PR #10197 and that PR is not released yet in any release branch.

Pull Request resolved: #10368

Test Plan: Added new unit test

Reviewed By: siying

Differential Revision: D37885184

Pulled By: akankshamahajan15

fbshipit-source-id: 35b36a44b696d29b2f6f25301aa1b19547b4e03b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants