Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Issue #1791: Read Submission should bypass OSE Threads" #3106

Closed
wants to merge 1 commit into from

Conversation

eolivelli
Copy link
Contributor

This reverts commit 6b99ff7.

Descriptions of the changes in this PR:
Revert #1792 "Issue #1791: Read Submission should bypass OSE Threads"

Motivation

See #3104

@eolivelli
Copy link
Contributor Author

@diegosalvi this change is also important for HerdDB and DistributedLog

@lhotari
Copy link
Member

lhotari commented Mar 14, 2022

There's #3105 as a possible solution for mitigating the performance issue which was the original motivation for the PR which is reverted by this PR.

@diegosalvi
Copy link
Contributor

@eolivelli interesting, thank you. We'll test disabling netty recycler as stated in apache/pulsar#14436 for now

@dlg99
Copy link
Contributor

dlg99 commented Mar 14, 2022

@Ghatage FYI

Copy link
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@congbobo184 @lhotari As I understand, there is no confirmation that reverting this PR actually fixes the problem reported in Pulsar.

If this problem is related to calling netty's recycle from another thread (which netty had and fixed) I'd love to see something that actually proves the theory so netty can fix it.

I don't see anything in PendingReadOp that requires initiate() and readEntryComplete() to run on the same thread for the read-only ledger. I can be wrong, please help me understand what I am missing. readEntryComplete() still runs on the OSE.

What you see in the log mentioned in the comment might happen if:

  • request took longer than speculative retry time
  • speculative read got submitted
  • both original request and the speculative retry succeeded

It is possible that handling of this case somehow corrupts recyclable entry though i don't see yet where it could happen (callbacks should be handled on the same OSE thread etc)

I need more proof to confirm this actually fixes the problem and not just reverts some PR that touches code you are suspecting.

@jvrao
Copy link
Contributor

jvrao commented Mar 14, 2022

+1 on @dlg99 's comment. I could not understand where exactly the problem and how it could get addressed by making create and recycle run in the same thread. You could always have outstanding read responses after erroring out. But even if that happens, what exactly the error/corruption that is visible to the client?

@congbobo184
Copy link
Contributor

#3110 will fix it

@lhotari
Copy link
Member

lhotari commented Mar 15, 2022

You could always have outstanding read responses after erroring out. But even if that happens, what exactly the error/corruption that is visible to the client?

@jvrao The issue description is #3104 and there are references to a Pulsar issue with a lot of troubleshooting comments. @congbobo184 has also summarized the behavior in the PR #3110 description.

@eolivelli
Copy link
Contributor Author

Closing for now

@eolivelli eolivelli closed this Mar 15, 2022
Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause of #3104 is not caused by the thread safety problem. And the isolation thread pool for cold and hot data reading will greatly reduce thread switching overhead and make the throughput more stable.

This issue has been fixed by #3110

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants