Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] ListVector.setNull doesn't update lastSet - performance issue #40796

Closed
jarohen opened this issue Mar 26, 2024 · 5 comments
Closed

[Java] ListVector.setNull doesn't update lastSet - performance issue #40796

jarohen opened this issue Mar 26, 2024 · 5 comments

Comments

@jarohen
Copy link
Contributor

jarohen commented Mar 26, 2024

Describe the enhancement requested

In ListVector, setNull doesn't update lastSet. This means that, if you set many null values in a row, the offset buffer is unnecessarily re-set for the intervening values - e.g.:

  • If I .setNull(3) with lastSet == 2, I set offset 3
  • If I then .setNull(4), .setNull(5), .setNull(6), these set 3->4, 3->5, and 3->6 respectively - i.e. O(n²).
  • With large and sparse enough vectors, this adds significant time to our internal benchmarks.

Naively, I'd put lastSet = index in setNull in the same way as startNewValue, but aware there might be an implicit/nuanced reason for not doing so?

Cheers,

James

Component(s)

Java

@vibhatha
Copy link
Collaborator

@jarohen Thanks for raising this issue. I haven't evaluated this on my own yet, but just curious, planning on making a PR for this?

@jarohen
Copy link
Contributor Author

jarohen commented Mar 27, 2024

Hey @vibhatha 👋 I certainly could, although I'm a little concerned that our fix misses a nuance of how lastSet is intended to be used. If someone with knowledge of the context could confirm we haven't missed anything, yep - more than happy to submit a PR.

@jarohen
Copy link
Contributor Author

jarohen commented Mar 27, 2024

@vibhatha FYI PR opened - #40810

@vibhatha
Copy link
Collaborator

Hey @vibhatha 👋 I certainly could, although I'm a little concerned that our fix misses a nuance of how lastSet is intended to be used. If someone with knowledge of the context could confirm we haven't missed anything, yep - more than happy to submit a PR.

I totally understand. Let's see if we can review the PR and make improvements. Thanks for working on this.

jarohen added a commit to xtdb/arrow that referenced this issue Mar 27, 2024
lidavidm pushed a commit that referenced this issue Mar 27, 2024
… in ListVectors with lots of nulls (#40810)

Would benefit from someone with knowledge of the context double-checking this doesn't have nuances I'm not aware of - particularly, there's a comment on the field: `the maximum index that is actually set` which one _could_ read to mean 'excluding nulls'?

### Are these changes tested?

Yes

### Are there any user-facing changes?

No
* GitHub Issue: #40796

Authored-by: James Henderson <james@jarohen.dev>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm
Copy link
Member

Issue resolved by pull request 40810
#40810

@lidavidm lidavidm added this to the 16.0.0 milestone Mar 27, 2024
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
… O(n²) in ListVectors with lots of nulls (apache#40810)

Would benefit from someone with knowledge of the context double-checking this doesn't have nuances I'm not aware of - particularly, there's a comment on the field: `the maximum index that is actually set` which one _could_ read to mean 'excluding nulls'?

### Are these changes tested?

Yes

### Are there any user-facing changes?

No
* GitHub Issue: apache#40796

Authored-by: James Henderson <james@jarohen.dev>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants