Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12355: [C++] Implement efficient async CSV scanning #10103

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Apr 19, 2021

A tiny bit of cleanup and combines ARROW-12392 and ARROW-12289 to create a true async CSV format.

Keeping in draft until ARROW-12392, ARROW-12289, and ARROW-12386 are resolved.

@github-actions
Copy link

@westonpace westonpace force-pushed the feature/ARROW-12355--c-implement-efficient-async-csv-scanning branch 2 times, most recently from 4f0b2e0 to 25338ec Compare April 21, 2021 20:07
auto source = file->source();
auto reader_fut =
OpenReaderAsync(source, *this, scan_options, internal::GetCpuThreadPool());
return GeneratorFromReader(std::move(reader_fut));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go ahead and add readahead here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSV reader does not support parallel readahead yet. Serial readahead might be redundant since the background generator is pretty much already doing that. I suppose it would give us a bit of pipeline parallelism (allowing us to filter & project while we parse & decode) but at the cost breaking up some of the processing locality (e.g. right now we always parse X, decode X, filter X, project X). I'll run some experiments and see if it has any noticeable effect in either direction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding serial readahead here did not have any effect on performance.

@westonpace westonpace force-pushed the feature/ARROW-12355--c-implement-efficient-async-csv-scanning branch from fd34825 to b6c2c18 Compare May 6, 2021 02:43
@westonpace westonpace marked this pull request as ready for review May 6, 2021 03:48
ARROW-12355: MakeReadaheadGenerator is needed here for the case where everything is completely synchronous (e.g. cached I/O)
@westonpace westonpace force-pushed the feature/ARROW-12355--c-implement-efficient-async-csv-scanning branch from b6c2c18 to bb75e40 Compare May 6, 2021 03:52
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Did you get a chance to test the performance impact?

@lidavidm lidavidm closed this in 602a76a May 6, 2021
DileepSrigiri pushed a commit to DileepSrigiri/arrow that referenced this pull request May 6, 2021
A tiny bit of cleanup and combines ARROW-12392 and ARROW-12289 to create a true async CSV format.

~~Keeping in draft until ARROW-12392, ARROW-12289, and ARROW-12386 are resolved.~~

Closes apache#10103 from westonpace/feature/ARROW-12355--c-implement-efficient-async-csv-scanning

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
A tiny bit of cleanup and combines ARROW-12392 and ARROW-12289 to create a true async CSV format.

~~Keeping in draft until ARROW-12392, ARROW-12289, and ARROW-12386 are resolved.~~

Closes apache#10103 from westonpace/feature/ARROW-12355--c-implement-efficient-async-csv-scanning

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
A tiny bit of cleanup and combines ARROW-12392 and ARROW-12289 to create a true async CSV format.

~~Keeping in draft until ARROW-12392, ARROW-12289, and ARROW-12386 are resolved.~~

Closes apache#10103 from westonpace/feature/ARROW-12355--c-implement-efficient-async-csv-scanning

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@westonpace westonpace deleted the feature/ARROW-12355--c-implement-efficient-async-csv-scanning branch January 6, 2022 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants