Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Create a fetch node based on a batch index property #34059

Closed
westonpace opened this issue Feb 6, 2023 · 0 comments · Fixed by #34060
Closed

[C++] Create a fetch node based on a batch index property #34059

westonpace opened this issue Feb 6, 2023 · 0 comments · Fixed by #34060

Comments

@westonpace
Copy link
Member

Describe the enhancement requested

This node will eventually be usable mid-plan allowing us to get rid of the concept of a TopKSink and also to add Substrait consumer support for fetch.

This feature introduces the idea of an "ExecBatch::index" which is described more fully in https://docs.google.com/document/d/1MfVE9td9D4n5y-PTn66kk4-9xG7feXs1zSFf-qxQgPs/edit

This feature is part of a larger set of features and this node will not be very useful on its own until some of those features are added (currently, all nodes except table_source will either fail to set the index or will reset it back to -1 and the fetch node will fail). See #32991 for the larger roadmap.

Component(s)

C++

westonpace added a commit that referenced this issue Feb 11, 2023
This PR introduces the concept of `ExecBatch:index` but does not yet do much with it.  As a proof of concept this PR adds a fetch node which can be inserted anywhere in the plan (not just at the sink) to satisfy `LIMIT x OFFSET y` (Substrait calls this fetch and so I have also).

This PR also introduces two sequencing accumulation queues which will be useful, I hope, for anyone implementing nodes that rely on ordered execution.

This PR unfortunately introduces a new query option which is whether or not the sink node should pay the small performance hit required to sequence output.  While considering how best to add this option I realized we will probably have more query options in the near future regarding "how much RAM to use" (e.g. spillover) and potentially more beyond that.

So I have taken all the options and put them into `arrow::compute::QueryOptions` (this already existed but it was not user facing and I added more things to it).  I added a new DeclarationToXyz overload that accepts QueryOptions.  This has, unfortunately, led to a bit of overload explosion but I think this should be the last new addition to the overload set (and we can deprecate the older overloads at some point).

This PR also includes a new `gen::Gen / gen::TestGen` facility for generating test tables for input.  I'd like to eventually use this to simplify some of the existing exec plan tests as well.  I'm willing to split this into a separate PR if that makes sense.
* Closes: #34059

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 12.0.0 milestone Feb 11, 2023
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
…#34060)

This PR introduces the concept of `ExecBatch:index` but does not yet do much with it.  As a proof of concept this PR adds a fetch node which can be inserted anywhere in the plan (not just at the sink) to satisfy `LIMIT x OFFSET y` (Substrait calls this fetch and so I have also).

This PR also introduces two sequencing accumulation queues which will be useful, I hope, for anyone implementing nodes that rely on ordered execution.

This PR unfortunately introduces a new query option which is whether or not the sink node should pay the small performance hit required to sequence output.  While considering how best to add this option I realized we will probably have more query options in the near future regarding "how much RAM to use" (e.g. spillover) and potentially more beyond that.

So I have taken all the options and put them into `arrow::compute::QueryOptions` (this already existed but it was not user facing and I added more things to it).  I added a new DeclarationToXyz overload that accepts QueryOptions.  This has, unfortunately, led to a bit of overload explosion but I think this should be the last new addition to the overload set (and we can deprecate the older overloads at some point).

This PR also includes a new `gen::Gen / gen::TestGen` facility for generating test tables for input.  I'd like to eventually use this to simplify some of the existing exec plan tests as well.  I'm willing to split this into a separate PR if that makes sense.
* Closes: apache#34059

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
…#34060)

This PR introduces the concept of `ExecBatch:index` but does not yet do much with it.  As a proof of concept this PR adds a fetch node which can be inserted anywhere in the plan (not just at the sink) to satisfy `LIMIT x OFFSET y` (Substrait calls this fetch and so I have also).

This PR also introduces two sequencing accumulation queues which will be useful, I hope, for anyone implementing nodes that rely on ordered execution.

This PR unfortunately introduces a new query option which is whether or not the sink node should pay the small performance hit required to sequence output.  While considering how best to add this option I realized we will probably have more query options in the near future regarding "how much RAM to use" (e.g. spillover) and potentially more beyond that.

So I have taken all the options and put them into `arrow::compute::QueryOptions` (this already existed but it was not user facing and I added more things to it).  I added a new DeclarationToXyz overload that accepts QueryOptions.  This has, unfortunately, led to a bit of overload explosion but I think this should be the last new addition to the overload set (and we can deprecate the older overloads at some point).

This PR also includes a new `gen::Gen / gen::TestGen` facility for generating test tables for input.  I'd like to eventually use this to simplify some of the existing exec plan tests as well.  I'm willing to split this into a separate PR if that makes sense.
* Closes: apache#34059

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant