-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12355: [C++] Implement efficient async CSV scanning #10103
ARROW-12355: [C++] Implement efficient async CSV scanning #10103
Conversation
4f0b2e0
to
25338ec
Compare
auto source = file->source(); | ||
auto reader_fut = | ||
OpenReaderAsync(source, *this, scan_options, internal::GetCpuThreadPool()); | ||
return GeneratorFromReader(std::move(reader_fut)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we go ahead and add readahead here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSV reader does not support parallel readahead yet. Serial readahead might be redundant since the background generator is pretty much already doing that. I suppose it would give us a bit of pipeline parallelism (allowing us to filter & project while we parse & decode) but at the cost breaking up some of the processing locality (e.g. right now we always parse X, decode X, filter X, project X). I'll run some experiments and see if it has any noticeable effect in either direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding serial readahead here did not have any effect on performance.
fd34825
to
b6c2c18
Compare
ARROW-12355: MakeReadaheadGenerator is needed here for the case where everything is completely synchronous (e.g. cached I/O)
b6c2c18
to
bb75e40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Did you get a chance to test the performance impact?
A tiny bit of cleanup and combines ARROW-12392 and ARROW-12289 to create a true async CSV format. ~~Keeping in draft until ARROW-12392, ARROW-12289, and ARROW-12386 are resolved.~~ Closes apache#10103 from westonpace/feature/ARROW-12355--c-implement-efficient-async-csv-scanning Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
A tiny bit of cleanup and combines ARROW-12392 and ARROW-12289 to create a true async CSV format. ~~Keeping in draft until ARROW-12392, ARROW-12289, and ARROW-12386 are resolved.~~ Closes apache#10103 from westonpace/feature/ARROW-12355--c-implement-efficient-async-csv-scanning Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
A tiny bit of cleanup and combines ARROW-12392 and ARROW-12289 to create a true async CSV format. ~~Keeping in draft until ARROW-12392, ARROW-12289, and ARROW-12386 are resolved.~~ Closes apache#10103 from westonpace/feature/ARROW-12355--c-implement-efficient-async-csv-scanning Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
A tiny bit of cleanup and combines ARROW-12392 and ARROW-12289 to create a true async CSV format.
Keeping in draft until ARROW-12392, ARROW-12289, and ARROW-12386 are resolved.