Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9749: [C++][GLib][Python][R][Ruby][Dataset] Introduce FragmentScanOptions, consolidate ScanContext/ScanOptions #9686

Closed
wants to merge 4 commits into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Mar 12, 2021

  • ScanContext/ScanOptions have been merged, since they were essentially always passed together.
  • For scan options that are specific to a scan (e.g. CSV conversion options), a new FragmentScanOptions has been added to ScanOptions. Currently only CSV has this and it only wraps csv::ConvertOptions; follow up issues can tackle the rest.
  • GLib, Python, R, and Ruby bindings have been updated.

@github-actions
Copy link

@lidavidm lidavidm changed the title ARROW-9749: [C++][GLib][Python][R][Ruby][Dataset] WIP: Introduce FragmentScanOptions, consolidate ScanContext/ScanOptions ARROW-9749: [C++][GLib][Python][R][Ruby][Dataset] Introduce FragmentScanOptions, consolidate ScanContext/ScanOptions Mar 12, 2021
@lidavidm
Copy link
Member Author

As it stands implementing fragment scan options for a union dataset is a bit tricky, because the fragment doesn't know which dataset it originally belonged to (and it would be weird for every scan task implementation to have to know about a hypothetical UnionFragmentScanOptions). We could add a FragmentScanOptions field to Fragment, allowing UnionDataset::GetFragmentsImpl to do the work of matching up options to fragments there. (That'd also let an advanced caller manually specify per-fragment options if they really wanted to.)

However given we might just delete UnionDataset I'm going to leave this out of this PR for now.

@lidavidm lidavidm marked this pull request as ready for review March 15, 2021 13:10
@lidavidm lidavidm requested a review from bkietz March 15, 2021 13:10
@lidavidm
Copy link
Member Author

@mrkn would you be willing to look at the GLib/Datasets bindings changes here again? This reduces the API surface that has to be bound by a bit, fortunately.

@mrkn
Copy link
Member

mrkn commented Mar 15, 2021

@lidavidm I will look at it tomorrow.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this cleanup, just a couple comments:

cpp/src/jni/dataset/jni_wrapper.cc Show resolved Hide resolved
cpp/src/arrow/dataset/scanner.h Show resolved Hide resolved
@bkietz
Copy link
Member

bkietz commented Mar 15, 2021

Could you add a follow up JIRAs for extracting ParquetFragmentScanOptions and IpcFragmentScanOptions?

@lidavidm
Copy link
Member Author

Could you add a follow up JIRAs for extracting ParquetFragmentScanOptions and IpcFragmentScanOptions?

Done, ARROW-11972.

@mrkn
Copy link
Member

mrkn commented Mar 16, 2021

+1 for GLib part.

@mrkn
Copy link
Member

mrkn commented Mar 16, 2021

I created a new issue to support CsvFragmentScanOption in GLib binding.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merging

@bkietz bkietz closed this in 6789465 Mar 16, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…canOptions, consolidate ScanContext/ScanOptions

- ScanContext/ScanOptions have been merged, since they were essentially always passed together.
- For scan options that are specific to a scan (e.g. CSV conversion options), a new FragmentScanOptions has been added to ScanOptions. Currently only CSV has this and it only wraps csv::ConvertOptions; follow up issues can tackle the rest.
- GLib, Python, R, and Ruby bindings have been updated.

Closes apache#9686 from lidavidm/arrow-9749

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…canOptions, consolidate ScanContext/ScanOptions

- ScanContext/ScanOptions have been merged, since they were essentially always passed together.
- For scan options that are specific to a scan (e.g. CSV conversion options), a new FragmentScanOptions has been added to ScanOptions. Currently only CSV has this and it only wraps csv::ConvertOptions; follow up issues can tackle the rest.
- GLib, Python, R, and Ruby bindings have been updated.

Closes apache#9686 from lidavidm/arrow-9749

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants