Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the read performance issue in the offload readAsync #12443

Merged
merged 2 commits into from
Oct 22, 2021

Conversation

zymap
Copy link
Member

@zymap zymap commented Oct 21, 2021


Motivation

In the #12123, I add the seek operation at the readAsync method.
It makes sure the data stream always seek to the first entry position
to read and will not introduce EOF exception.
But in the offload index entry, it groups a set of entries into a range,
the seek operation will seek the posistion to the first entry in the range.
That will introduce a performance issue because every read opeartion will
read from the first entry in the range until it find the actual first read
entry.
But if we remove the seek operation, that will cause a EOF exception from
the readAsync method. This PR adds a limitation of the seek opeartion.

Modifications

Add available method in the backedInputStream to get know how many bytes
we can read from the stream.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (yes / no)
  • The schema: (yes / no / don't know)
  • The default values of configurations: (yes / no)
  • The wire protocol: (yes / no)
  • The rest endpoints: (yes / no)
  • The admin cli options: (yes / no)
  • Anything that affects deployment: (yes / no / don't know)

Documentation

Check the box below and label this PR (if you have committer privilege).

Need to update docs?

  • doc-required

    (If you need help on updating docs, create a doc issue)

  • no-need-doc

    (Please explain why)

  • doc

    (If this PR contains doc changes)

---

*Motivation*

In the apache#12123, I add the seek operation at the readAsync method.
It makes sure the data stream always seek to the first entry position
to read and will not introduce EOF exception.
But in the offload index entry, it groups a set of entries into a range,
the seek operation will seek the posistion to the first entry in the range.
That will introduce a performance issue because every read opeartion will
read from the first entry in the range until it find the actual first read
entry.
But if we remove the seek operation, that will cause a EOF exception from
the readAsync method. This PR adds a limitation of the seek opeartion.

*Modifications*

Add available method in the backedInputStream to get know how many bytes
we can read from the stream.
@zymap zymap added this to the 2.10.0 milestone Oct 21, 2021
@zymap zymap self-assigned this Oct 21, 2021
@zymap zymap added the doc-not-needed Your PR changes do not impact docs label Oct 21, 2021
@eolivelli
Copy link
Contributor

@zymap:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

1 similar comment
@eolivelli
Copy link
Contributor

@zymap:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@eolivelli
Copy link
Contributor

@zymap:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@eolivelli
Copy link
Contributor

@zymap:Thanks for providing doc info!

@codelipenghui codelipenghui merged commit b4d05ac into apache:master Oct 22, 2021
zymap added a commit that referenced this pull request Oct 22, 2021
---

*Motivation*

In the #12123, I add the seek operation at the readAsync method.
It makes sure the data stream always seek to the first entry position
to read and will not introduce EOF exception.
But in the offload index entry, it groups a set of entries into a range,
the seek operation will seek the posistion to the first entry in the range.
That will introduce a performance issue because every read opeartion will
read from the first entry in the range until it find the actual first read
entry.
But if we remove the seek operation, that will cause a EOF exception from
the readAsync method. This PR adds a limitation of the seek opeartion.

*Modifications*

Add available method in the backedInputStream to get know how many bytes
we can read from the stream.

(cherry picked from commit b4d05ac)
@zymap zymap added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Oct 22, 2021
eolivelli pushed a commit to eolivelli/pulsar that referenced this pull request Nov 29, 2021
---

*Motivation*

In the apache#12123, I add the seek operation at the readAsync method.
It makes sure the data stream always seek to the first entry position
to read and will not introduce EOF exception.
But in the offload index entry, it groups a set of entries into a range,
the seek operation will seek the posistion to the first entry in the range.
That will introduce a performance issue because every read opeartion will
read from the first entry in the range until it find the actual first read
entry.
But if we remove the seek operation, that will cause a EOF exception from
the readAsync method. This PR adds a limitation of the seek opeartion.

*Modifications*

Add available method in the backedInputStream to get know how many bytes
we can read from the stream.
codelipenghui pushed a commit that referenced this pull request Dec 20, 2021
---

*Motivation*

In the #12123, I add the seek operation at the readAsync method.
It makes sure the data stream always seek to the first entry position
to read and will not introduce EOF exception.
But in the offload index entry, it groups a set of entries into a range,
the seek operation will seek the posistion to the first entry in the range.
That will introduce a performance issue because every read opeartion will
read from the first entry in the range until it find the actual first read
entry.
But if we remove the seek operation, that will cause a EOF exception from
the readAsync method. This PR adds a limitation of the seek opeartion.

*Modifications*

Add available method in the backedInputStream to get know how many bytes
we can read from the stream.

(cherry picked from commit b4d05ac)
@codelipenghui codelipenghui added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Dec 20, 2021
lhotari pushed a commit to datastax/pulsar that referenced this pull request Apr 11, 2022
---

*Motivation*

In the apache#12123, I add the seek operation at the readAsync method.
It makes sure the data stream always seek to the first entry position
to read and will not introduce EOF exception.
But in the offload index entry, it groups a set of entries into a range,
the seek operation will seek the posistion to the first entry in the range.
That will introduce a performance issue because every read opeartion will
read from the first entry in the range until it find the actual first read
entry.
But if we remove the seek operation, that will cause a EOF exception from
the readAsync method. This PR adds a limitation of the seek opeartion.

*Modifications*

Add available method in the backedInputStream to get know how many bytes
we can read from the stream.

(cherry picked from commit b4d05ac)
lhotari pushed a commit to datastax/pulsar that referenced this pull request Apr 11, 2022
---

*Motivation*

In the apache#12123, I add the seek operation at the readAsync method.
It makes sure the data stream always seek to the first entry position
to read and will not introduce EOF exception.
But in the offload index entry, it groups a set of entries into a range,
the seek operation will seek the posistion to the first entry in the range.
That will introduce a performance issue because every read opeartion will
read from the first entry in the range until it find the actual first read
entry.
But if we remove the seek operation, that will cause a EOF exception from
the readAsync method. This PR adds a limitation of the seek opeartion.

*Modifications*

Add available method in the backedInputStream to get know how many bytes
we can read from the stream.

(cherry picked from commit b4d05ac)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tieredstorage cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life doc-not-needed Your PR changes do not impact docs release/2.8.2 release/2.9.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants