[C++] The new scan node should use values from fragment guarantees instead of loading them from disk #15059

westonpace · 2022-12-21T15:11:52Z

Describe the enhancement requested

The main reason we need to do this is because the columns are not always going to be on the disk (right now the new scan node fails in this case). It's also a performance enhancement to skip loading of these columns as well. The solution will, I suspect, also lay the groundwork for adding support for the augmented columns as well (filename, batch index, file index)

Component(s)

C++

…instead of loading the data from the fragment

…stead of fragment (#15129) If a fragment has a guarantee like `x == 5` then we don't need to load the column `x` from disk and can instead just use the scalar `5`. This is not just a performance improvement. In many cases, users will create partitioned datasets without actually storing the partition value as a separate column (e.g. the file `my_dataset/x=5/foo.parquet` will not have a column named `x`) * Closes: #15059 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

westonpace added Type: enhancement Component: C++ labels Dec 21, 2022

westonpace added a commit to westonpace/arrow that referenced this issue Dec 26, 2022

apacheGH-15059 the scan node will now populate guarantees as columns …

511daab

…instead of loading the data from the fragment

westonpace added a commit to westonpace/arrow that referenced this issue Dec 30, 2022

apacheGH-15059 the scan node will now populate guarantees as columns …

0722d10

…instead of loading the data from the fragment

github-actions bot mentioned this issue Dec 30, 2022

GH-15059: [C++][Acero] populate guarantee columns from expression intstead of fragment #15129

Merged

github-actions bot assigned westonpace Dec 30, 2022

westonpace added a commit to westonpace/arrow that referenced this issue Feb 22, 2023

apacheGH-15059 the scan node will now populate guarantees as columns …

48588c1

…instead of loading the data from the fragment

westonpace added a commit to westonpace/arrow that referenced this issue Feb 23, 2023

apacheGH-15059 the scan node will now populate guarantees as columns …

76ece33

…instead of loading the data from the fragment

westonpace closed this as completed in #15129 Feb 25, 2023

westonpace added this to the 12.0.0 milestone Feb 25, 2023

westonpace mentioned this issue Feb 25, 2023

[C++] Add an end-to-end fuzz test for the new scan node #34347

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] The new scan node should use values from fragment guarantees instead of loading them from disk #15059

[C++] The new scan node should use values from fragment guarantees instead of loading them from disk #15059

westonpace commented Dec 21, 2022

[C++] The new scan node should use values from fragment guarantees instead of loading them from disk #15059

[C++] The new scan node should use values from fragment guarantees instead of loading them from disk #15059

Comments

westonpace commented Dec 21, 2022

Describe the enhancement requested

Component(s)