New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3666: [C++] Improve C++ parser performance #2886
Conversation
I'm open to suggestions so as to make this more readable and maintainable. Should probably add some comments. |
For the record, we're about 30% faster than Paratext when reading a CSV file of floating-point numbers. Even if we call As for other types (binary, ints), Paratext seems either broken or insanely slow on those. |
Codecov Report
@@ Coverage Diff @@
## master #2886 +/- ##
=========================================
Coverage ? 87.27%
=========================================
Files ? 404
Lines ? 63056
Branches ? 0
=========================================
Hits ? 55031
Misses ? 7931
Partials ? 94
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comments. You might be able to avoid reallocs when resizing buffers with Resize(n, false)
.
I would have to spend some more time examining this code to think about how to refactor for readability / maintainability, but so long as we have enough micro/macrobenchmarks to be able to refactor without fear of performance regressions, I am OK moving forward since there are so many features we'll need to build in this module
} | ||
|
||
void Finish(std::shared_ptr<Buffer>* out_parsed) { | ||
ARROW_CHECK_OK(parsed_buffer_->Resize(parsed_size_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might pass shrink_to_fit=false
to avoid realloc here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to make much of a difference though. Shrinking may help reduce the memory footprint a bit.
cpp/src/arrow/csv/parser.cc
Outdated
|
||
while (data < data_end && num_rows_ < max_num_rows_) { | ||
template <typename SpecializedOptions, typename ValuesWriter, typename ParsedWriter> | ||
Status BlockParser::ParseChunk(ValuesWriter& values_writer, ParsedWriter& parsed_writer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using T*
instead of T&
for these first two arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
Is the improved performance here mostly from the memory optimization in the presized case? |
Makes CSV parsing around 30% faster
Not only, I think it's also simpler code being generated. CSV parsing has a lot of data-dependent branching in its critical path. I don't know how reliable the numbers are, but |
73b38ed
to
1c60c54
Compare
+1, will merge. |
Make CSV parsing around 30% faster.
Benchmark of reading a CSV file with integer columns. Before:
After:
(on a 8-core 16-thread AMD Ryzen CPU)