-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40270: [C++] Use LargeStringArray for casting when writing tables to CSV #40271
Conversation
|
91069ee
to
da7c8cb
Compare
CI results indicate segfaults, surely something is wrong in this PR? |
da7c8cb
to
75ba58b
Compare
Issue fixed, and has passed the |
The timeout error of s3 test shouldn't be caused by this PR: https://github.com/apache/arrow/actions/runs/8081349229/job/22079702366 |
After running the benchmarks locally, it seems that this is triggering a bunch of regressions when writing a StringArray to CSV:
|
10% looks a bit significant, how could I replay the regression benchmark? Could you please help show me the command line? |
You can build in release mode and pass |
I think this means that this should be more careful when casting. Only cast to large string if the input data is too large for a regular string array. |
75ba58b
to
7e7e5c6
Compare
Thanks for the suggestion, indeed that is exactly what I was considering. I have pushed an implementation which tries to cast to I failed to find how to run regression benchmarks, but I have run the main branch and this PR's branch twice, the numbers looks good now. |
Hi @pitrou, I would like to if there are further comments on this pull request? Thanks! |
@ursabot please benchmark |
Benchmark runs are scheduled for commit 7e7e5c6. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
cpp/src/arrow/csv/writer.cc
Outdated
std::shared_ptr<Array> casted, | ||
compute::Cast(data, /*to_type=*/utf8(), compute::CastOptions(), &ctx)); | ||
casted_array_ = checked_pointer_cast<StringArray>(casted); | ||
auto casted = compute::Cast(data, /*to_type=*/utf8(), compute::CastOptions(), &ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps not very important, but I think the algorithm should ideally be:
- if the input data is large_binary or large_utf8, cast it to large_utf8
- otherwise, try to cast it to utf8, and fallback to large_utf8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 7e7e5c6. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. |
The regression of |
487cabb
to
a0dc29b
Compare
The CI failure shouldn't be caused by this PR:
|
cpp/src/arrow/csv/writer.cc
Outdated
@@ -108,6 +108,17 @@ int64_t CountQuotes(std::string_view s) { | |||
constexpr int64_t kQuoteCount = 2; | |||
constexpr int64_t kQuoteDelimiterCount = kQuoteCount + /*end_char*/ 1; | |||
|
|||
#ifndef CSV_WRITER_DISPATCH_STRING_ARRAY_TYPE | |||
#define CSV_WRITER_DISPATCH_STRING_ARRAY_TYPE(array, func, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace this macro with usage of VisitArrayInline or so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid it would be a bit over-design. There's actually two method PopulateRows
and UpdateRowLengths
that would require such a dispatch. That means, I would need to define four visitor classes for two different methods in two different classes.
Any ideas? @bkietz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly ping @bkietz, do you think if we would just keep current implementation, for simplicity?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, a visitor might be overkill, but it would be better to remove the macro and write the code explicitly, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thinks a template helper function or just write the code is a better way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. @pitrou could please help to take another look? thanks!
hello,will this issue be resolved in the near future? we encountered the same problem. |
Would you mind rebase this pr? It might fix some ci problem(and introduce some new) |
Rebased against current main. |
cpp/src/arrow/csv/writer.cc
Outdated
@@ -108,6 +108,17 @@ int64_t CountQuotes(std::string_view s) { | |||
constexpr int64_t kQuoteCount = 2; | |||
constexpr int64_t kQuoteDelimiterCount = kQuoteCount + /*end_char*/ 1; | |||
|
|||
#ifndef CSV_WRITER_DISPATCH_STRING_ARRAY_TYPE | |||
#define CSV_WRITER_DISPATCH_STRING_ARRAY_TYPE(array, func, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thinks a template helper function or just write the code is a better way
@@ -146,7 +165,7 @@ class ColumnPopulator { | |||
|
|||
protected: | |||
virtual Status UpdateRowLengths(int64_t* row_lengths) = 0; | |||
std::shared_ptr<StringArray> casted_array_; | |||
std::shared_ptr<Array> array_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just a comment for array_
that it would be a StringArray
or LargeStringArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cpp/src/arrow/csv/writer.cc
Outdated
} else { | ||
auto casted = compute::Cast(data, /*to_type=*/utf8(), compute::CastOptions(), &ctx); | ||
if (casted.ok()) { | ||
array_ = casted.ValueOrDie(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array_ = casted.ValueOrDie(); | |
array_ = std::move(casted).ValueOrDie(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cpp/src/arrow/csv/writer.cc
Outdated
} else { \ | ||
return func<LargeStringArray>(__VA_ARGS__); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also dcheck the type is large_string here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
7e1d00f
to
dcf7100
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for me, besides, would it be hard for a testing?
At least we can use a large string array (with short strings in it) as input?
dcf7100
to
b492768
Compare
Testcase added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one minor comment
ea8d2b1
to
05ef5c1
Compare
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Tao He <sighingnow@gmail.com>
457aa1c
to
ff4f7ea
Compare
@github-actions crossbow submit -g cpp |
Revision: ff4f7ea Submitted crossbow builds: ursacomputing/crossbow @ actions-8c63406bed |
@github-actions crossbow submit -g cpp |
Revision: ff4f7ea Submitted crossbow builds: ursacomputing/crossbow @ actions-547e0b6dcb |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 7288bd5. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…ables to CSV (apache#40271) ### Rationale for this change Avoid casting failures when tables contains too long large string arrays. ### What changes are included in this PR? Replace the usage of `StringArray` to `LargeStringArray`. ### Are these changes tested? No extra test case is needed (as it is to fix some corner cases). ### Are there any user-facing changes? No user-facing changes. * GitHub Issue: apache#40270 Lead-authored-by: Tao He <sighingnow@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ables to CSV (apache#40271) ### Rationale for this change Avoid casting failures when tables contains too long large string arrays. ### What changes are included in this PR? Replace the usage of `StringArray` to `LargeStringArray`. ### Are these changes tested? No extra test case is needed (as it is to fix some corner cases). ### Are there any user-facing changes? No user-facing changes. * GitHub Issue: apache#40270 Lead-authored-by: Tao He <sighingnow@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ables to CSV (apache#40271) ### Rationale for this change Avoid casting failures when tables contains too long large string arrays. ### What changes are included in this PR? Replace the usage of `StringArray` to `LargeStringArray`. ### Are these changes tested? No extra test case is needed (as it is to fix some corner cases). ### Are there any user-facing changes? No user-facing changes. * GitHub Issue: apache#40270 Lead-authored-by: Tao He <sighingnow@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Avoid casting failures when tables contains too long large string arrays.
What changes are included in this PR?
Replace the usage of
StringArray
toLargeStringArray
.Are these changes tested?
No extra test case is needed (as it is to fix some corner cases).
Are there any user-facing changes?
No user-facing changes.
WriteTable
failed with: Failed casting from large_string to string: input array too large #40270