Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40270: [C++] Use LargeStringArray for casting when writing tables to CSV #40271

Merged
merged 3 commits into from
Apr 11, 2024

Conversation

sighingnow
Copy link
Contributor

@sighingnow sighingnow commented Feb 28, 2024

Rationale for this change

Avoid casting failures when tables contains too long large string arrays.

What changes are included in this PR?

Replace the usage of StringArray to LargeStringArray.

Are these changes tested?

No extra test case is needed (as it is to fix some corner cases).

Are there any user-facing changes?

No user-facing changes.

Copy link

⚠️ GitHub issue #40270 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Feb 28, 2024

CI results indicate segfaults, surely something is wrong in this PR?

@sighingnow
Copy link
Contributor Author

CI results indicate segfaults, surely something is wrong in this PR?

Issue fixed, and has passed the arrow-csv-test in local environment.

@sighingnow
Copy link
Contributor Author

The timeout error of s3 test shouldn't be caused by this PR: https://github.com/apache/arrow/actions/runs/8081349229/job/22079702366

@pitrou
Copy link
Member

pitrou commented Feb 28, 2024

After running the benchmarks locally, it seems that this is triggering a bunch of regressions when writing a StringArray to CSV:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (8)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                   benchmark        baseline       contender  change %                                                                                                                                                                                         counters
    WriteCsvStringNoQuote/50 692.326 MiB/sec 621.875 MiB/sec   -10.176     {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'WriteCsvStringNoQuote/50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2217, 'null_percent': 50.0}
WriteCsvStringRejectQuote/50 704.688 MiB/sec 625.617 MiB/sec   -11.221 {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'WriteCsvStringRejectQuote/50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2418, 'null_percent': 50.0}
    WriteCsvStringNoQuote/10   1.478 GiB/sec   1.250 GiB/sec   -15.420     {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'WriteCsvStringNoQuote/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2840, 'null_percent': 10.0}
     WriteCsvStringNoQuote/0   1.976 GiB/sec   1.652 GiB/sec   -16.403       {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'WriteCsvStringNoQuote/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3384, 'null_percent': 0.0}
     WriteCsvStringNoQuote/1   1.857 GiB/sec   1.540 GiB/sec   -17.098       {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'WriteCsvStringNoQuote/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3184, 'null_percent': 1.0}
WriteCsvStringRejectQuote/10   1.829 GiB/sec   1.465 GiB/sec   -19.942 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'WriteCsvStringRejectQuote/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3843, 'null_percent': 10.0}
 WriteCsvStringRejectQuote/0   2.878 GiB/sec   2.300 GiB/sec   -20.086   {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'WriteCsvStringRejectQuote/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5652, 'null_percent': 0.0}
 WriteCsvStringRejectQuote/1   2.660 GiB/sec   1.966 GiB/sec   -26.074   {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'WriteCsvStringRejectQuote/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4869, 'null_percent': 1.0}

@sighingnow
Copy link
Contributor Author

sighingnow commented Feb 28, 2024

10% looks a bit significant, how could I replay the regression benchmark? Could you please help show me the command line?

@pitrou
Copy link
Member

pitrou commented Feb 28, 2024

You can build in release mode and pass -DARROW_BUILD_BENCHMARKS=ON to CMake. You'll then find some benchmark executables in the target directory, including arrow-csv-writer-benchmark.

@pitrou
Copy link
Member

pitrou commented Feb 28, 2024

I think this means that this should be more careful when casting. Only cast to large string if the input data is too large for a regular string array.

@sighingnow
Copy link
Contributor Author

I think this means that this should be more careful when casting. Only cast to large string if the input data is too large for a regular string array.

Thanks for the suggestion, indeed that is exactly what I was considering. I have pushed an implementation which tries to cast to StringArray then LargeStringArray if failed and dispatches to a templated implementations based on the array's type in runtime. As virtual member method cannot be template, this is the simplest / least changed solution I can think of.

I failed to find how to run regression benchmarks, but I have run the main branch and this PR's branch twice, the numbers looks good now.

image

@sighingnow
Copy link
Contributor Author

Hi @pitrou, I would like to if there are further comments on this pull request?

Thanks!

@bkietz
Copy link
Member

bkietz commented Mar 1, 2024

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Mar 1, 2024

Benchmark runs are scheduled for commit 7e7e5c6. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

std::shared_ptr<Array> casted,
compute::Cast(data, /*to_type=*/utf8(), compute::CastOptions(), &ctx));
casted_array_ = checked_pointer_cast<StringArray>(casted);
auto casted = compute::Cast(data, /*to_type=*/utf8(), compute::CastOptions(), &ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not very important, but I think the algorithm should ideally be:

  1. if the input data is large_binary or large_utf8, cast it to large_utf8
  2. otherwise, try to cast it to utf8, and fallback to large_utf8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 1, 2024
Copy link

Thanks for your patience. Conbench analyzed the 7 benchmarking runs that have been run so far on PR commit 7e7e5c6.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details.

@sighingnow
Copy link
Contributor Author

The regression of arrow-compute-vector-hash-benchmark should be caused by this PR.

@sighingnow
Copy link
Contributor Author

The CI failure shouldn't be caused by this PR:

================================== FAILURES ===================================
_______________________ test_dateutil_tzinfo_to_string ________________________
    def test_dateutil_tzinfo_to_string():
        pytest.importorskip("dateutil")
        import dateutil.tz
    
        tz = dateutil.tz.UTC
        assert pa.lib.tzinfo_to_string(tz) == 'UTC'
        tz = dateutil.tz.gettz('Europe/Paris')
>       assert pa.lib.tzinfo_to_string(tz) == 'Europe/Paris'
E       AssertionError: assert 'Europe/Monaco' == 'Europe/Paris'
E         - Europe/Paris
E         + Europe/Monaco
pyarrow\tests\test_types.py:355: AssertionError
============================== warnings summary ===============================

@@ -108,6 +108,17 @@ int64_t CountQuotes(std::string_view s) {
constexpr int64_t kQuoteCount = 2;
constexpr int64_t kQuoteDelimiterCount = kQuoteCount + /*end_char*/ 1;

#ifndef CSV_WRITER_DISPATCH_STRING_ARRAY_TYPE
#define CSV_WRITER_DISPATCH_STRING_ARRAY_TYPE(array, func, ...) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace this macro with usage of VisitArrayInline or so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid it would be a bit over-design. There's actually two method PopulateRows and UpdateRowLengths that would require such a dispatch. That means, I would need to define four visitor classes for two different methods in two different classes.

Any ideas? @bkietz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly ping @bkietz, do you think if we would just keep current implementation, for simplicity?

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, a visitor might be overkill, but it would be better to remove the macro and write the code explicitly, IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thinks a template helper function or just write the code is a better way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. @pitrou could please help to take another look? thanks!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 4, 2024
@jingshi-ant
Copy link

hello,will this issue be resolved in the near future? we encountered the same problem.

@mapleFU
Copy link
Member

mapleFU commented Mar 30, 2024

Would you mind rebase this pr? It might fix some ci problem(and introduce some new)

@sighingnow
Copy link
Contributor Author

Would you mind rebase this pr? It might fix some ci problem(and introduce some new)

Rebased against current main.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 30, 2024
@@ -108,6 +108,17 @@ int64_t CountQuotes(std::string_view s) {
constexpr int64_t kQuoteCount = 2;
constexpr int64_t kQuoteDelimiterCount = kQuoteCount + /*end_char*/ 1;

#ifndef CSV_WRITER_DISPATCH_STRING_ARRAY_TYPE
#define CSV_WRITER_DISPATCH_STRING_ARRAY_TYPE(array, func, ...) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thinks a template helper function or just write the code is a better way

@@ -146,7 +165,7 @@ class ColumnPopulator {

protected:
virtual Status UpdateRowLengths(int64_t* row_lengths) = 0;
std::shared_ptr<StringArray> casted_array_;
std::shared_ptr<Array> array_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just a comment for array_ that it would be a StringArray or LargeStringArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

} else {
auto casted = compute::Cast(data, /*to_type=*/utf8(), compute::CastOptions(), &ctx);
if (casted.ok()) {
array_ = casted.ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
array_ = casted.ValueOrDie();
array_ = std::move(casted).ValueOrDie();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

cpp/src/arrow/csv/writer.cc Show resolved Hide resolved
Comment on lines 116 to 117
} else { \
return func<LargeStringArray>(__VA_ARGS__); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also dcheck the type is large_string here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@sighingnow sighingnow force-pushed the ht/fixes-csv-string-array branch 2 times, most recently from 7e1d00f to dcf7100 Compare April 11, 2024 04:18
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for me, besides, would it be hard for a testing?

At least we can use a large string array (with short strings in it) as input?

@sighingnow
Copy link
Contributor Author

+1 for me, besides, would it be hard for a testing?

At least we can use a large string array (with short strings in it) as input?

Testcase added.

@sighingnow
Copy link
Contributor Author

Hi @mapleFU @pitrou , could you please take another look and help merge this PR?

Thanks!

@mapleFU
Copy link
Member

mapleFU commented Apr 11, 2024

This looks ok to me, but I'll wait for @pitrou @bkietz 's idea

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one minor comment

cpp/src/arrow/csv/writer.cc Outdated Show resolved Hide resolved
sighingnow and others added 3 commits April 11, 2024 17:20
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Tao He <sighingnow@gmail.com>
@pitrou
Copy link
Member

pitrou commented Apr 11, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: ff4f7ea

Submitted crossbow builds: ursacomputing/crossbow @ actions-8c63406bed

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou
Copy link
Member

pitrou commented Apr 11, 2024

@github-actions crossbow submit -g cpp

@pitrou pitrou merged commit 7288bd5 into apache:main Apr 11, 2024
36 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Apr 11, 2024
Copy link

Revision: ff4f7ea

Submitted crossbow builds: ursacomputing/crossbow @ actions-547e0b6dcb

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 7288bd5.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

@sighingnow sighingnow deleted the ht/fixes-csv-string-array branch April 12, 2024 01:17
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Apr 15, 2024
…ables to CSV (apache#40271)

### Rationale for this change

Avoid casting failures when tables contains too long large string arrays.

### What changes are included in this PR?

Replace the usage of `StringArray` to `LargeStringArray`.

### Are these changes tested?

No extra test case is needed (as it is to fix some corner cases).

### Are there any user-facing changes?

No user-facing changes.

* GitHub Issue: apache#40270

Lead-authored-by: Tao He <sighingnow@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…ables to CSV (apache#40271)

### Rationale for this change

Avoid casting failures when tables contains too long large string arrays.

### What changes are included in this PR?

Replace the usage of `StringArray` to `LargeStringArray`.

### Are these changes tested?

No extra test case is needed (as it is to fix some corner cases).

### Are there any user-facing changes?

No user-facing changes.

* GitHub Issue: apache#40270

Lead-authored-by: Tao He <sighingnow@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…ables to CSV (apache#40271)

### Rationale for this change

Avoid casting failures when tables contains too long large string arrays.

### What changes are included in this PR?

Replace the usage of `StringArray` to `LargeStringArray`.

### Are these changes tested?

No extra test case is needed (as it is to fix some corner cases).

### Are there any user-facing changes?

No user-facing changes.

* GitHub Issue: apache#40270

Lead-authored-by: Tao He <sighingnow@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants