[C++][Parquet] Parquet write_to_dataset performance regression #35498

alexhudspith · 2023-05-08T22:24:34Z

Describe the bug, including details regarding any error messages, version, and platform.

On Linux, pyarrow.parquet.write_to_dataset shows a large performance regression in Arrow 12.0 versus 11.0.

The following results were collected using Ubuntu 22.04.2 LTS (5.15.0-71-generic), Intel Haswell 4-core @ 3.6GHz, 16 GB RAM, Samsung 840 Pro SSD. They are elapsed times in seconds to write a single int64 column of integers [0,..., length-1] with no compression and no multi-threading:

Array length	Arrow 11 (s)	Arrow 12 (s)
1,000,000	0.1	0.1
2,000,000	0.2	0.4
4,000,000	0.3	1.6
8,000,000	0.8	6.2
16,000,000	2.3	24.4
32,000,000	6.5	94.1
64,000,000	13.5	371.7

The output directory was deleted before each run.

"""check.py"""
import sys
import time
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

def main():
    path = '/tmp/test.parquet'
    length = 10_000_000 if len(sys.argv) < 2 else int(sys.argv[1])
    table = pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])
    t0 = time.perf_counter()
    pq.write_to_dataset(
        table, path, schema=table.schema, use_legacy_dataset=False, use_threads=False, compression=None
    )
    duration = time.perf_counter() - t0
    print(f'{duration:.2f}s')

if __name__ == '__main__':
    main()

Running git bisect on local builds leads me to this commit: 660d259: [C++] Add ordered/segmented aggregation Substrait extension (#34627).

Following that change, Flamegraphs show a lot of additional time spent in arrow::util::EnsureAlignment calling glibc memcpy:

Before ~1.3s (ddd0a33)

After ~9.6s (660d259)

Reading and pyarrow.parquet.write_table appear unaffected.

Component(s)

C++, Parquet

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-05-09T09:22:23Z

@alexhudspith Thanks a lot for the report and the detailed analysis.

It's unfortunate that this got into the release, as it seems we actually also could see this in our own benchmarks (https://conbench.ursa.dev/benchmark-results/2b587cc1079f4e3a97f542e6f11e883e/, we need some better process to check regressions before a release)

A call to EnsureAlignment indeed got added in the PR you reference (the full for-loop got added):

arrow/cpp/src/arrow/acero/source_node.cc

Lines 106 to 113 in de6c3cd

    
           ExecBatch batch = morsel.Slice(offset, batch_size); 
        
           for (auto& value : batch.values) { 
        
             if (value.is_array()) { 
        
               ARROW_ASSIGN_OR_RAISE(value, arrow::util::EnsureAlignment( 
        
                                                value.make_array(), ipc::kArrowAlignment, 
        
                                                default_memory_pool())); 
        
             } 
        
           }

I am not directly sure for the reason to do this here, cc @rtpsw @westonpace

mapleFU · 2023-05-09T09:32:53Z

@jorisvandenbossche So is this due to alignment change ( EnsureAlignment ) between 11.0 and 12.0? Which causing unaligned array in morsel.Slice copied much more times?

jorisvandenbossche · 2023-05-09T15:43:08Z

I assume so yes, based on the flamegraphs shown above. But I am not familiar with why this was added.

rtpsw · 2023-05-09T16:19:46Z

I am not directly sure for the reason to do this here, cc @rtpsw @westonpace

The reason for adding EnsureAlignment is that Flight generated misaligned buffer (#32276), and more generally because the combination of a zero-copy policy with a non-aligned format leads to misaligned buffers. Arrow requires aligned buffers, and I'm not sure there is a good way around this requirement.

Note that so far I didn't look into the specifics of the current issue - it may be due to Flight or some other misaligned format used as input, or some other reason.

jorisvandenbossche · 2023-05-09T16:23:37Z

But so (for my understanding) that specific change wasn't really related to the rest of the PR?

rtpsw · 2023-05-09T16:52:22Z

But so (for my understanding) that specific change wasn't really related to the rest of the PR?

IIRC, the PR failed internal integration testing without the EnsureAlignment change. I'm not sure whether this means the change is unrelated.

rtpsw · 2023-05-10T06:56:16Z

Looking at the code, I suspect the reason for degraded performance is because the source table has misaligned numpy arrays and each batch of each of these arrays get realigned by EnsureAlignment, since the aligned default batch size leads the batch-slicing to preserve misalignment. This can explain why the performance degradation gets worse with larger arrays that get sliced to more batches. One way to test this theory is to increase the batch size in line with the array sizes - the performance degradation is expected to be reduced.

As for a cause of the problem, it looks like pa.Table.from_arrays([pa.array(np.arange(length))], names=['A']) results in per-Arrow misaligned arrays, due to zero-copy-wrapping of misaligned numpy arrays, which the Arrow spec forbids. However, since this code is natural and has likely been accepted since the beginning, the realignment should probably be done within Arrow (maybe with a warning), or be possible via Arrow configuration. The full arrays have a realignment performance cost, of course, but it should be much lower than many batches of each array have. Looking out further, I'd suggest considering adding facilities for getting per-Arrow aligned numpy arrays and documenting accordingly. If possible, better yet is to get numpy to support memory-alignment configuration, so that Arrow-user-code would not need to change.

rtpsw · 2023-05-10T06:56:49Z

@icexelloss, what are your thoughts about this?

rtpsw · 2023-05-10T07:17:57Z

IIRC, the PR failed internal integration testing without the EnsureAlignment change. I'm not sure whether this means the change is unrelated.

Digging this up for reference, the failure was raised by an invocation of util::CheckAlignment from CompareBinaryColumnToRow.

jorisvandenbossche · 2023-05-10T07:20:35Z

My understanding is that alignment is "recommended but not required for in memory data", it's only when serializing (IPC) that the requirement is enforced (https://arrow.apache.org/docs/dev/format/Columnar.html#buffer-alignment-and-padding)?

rtpsw · 2023-05-10T07:35:06Z

If possible, better yet is to get numpy to support memory-alignment configuration, so that Arrow-user-code would not need to change.

Looks like numpy already supports this via configurable memory routines (see NEP-49 and the corresponding PR).

rtpsw · 2023-05-10T07:43:34Z

My understanding is that alignment is "recommended but not required for in memory data", it's only when serializing (IPC) that the requirement is enforced (https://arrow.apache.org/docs/dev/format/Columnar.html#buffer-alignment-and-padding)?

That sounds right per the spec, though avoiding the recommendation may lead to other kinds of performance degradation. In practice, the existing Arrow code checks alignment of in-memory buffers (at least in the place I pointed to), despite what the spec says, and it would be a much wider scope of an issue to change that. Given all this, I think we need to consider what makes sense for a resolution in the short-term vs the longer-term.

mapleFU · 2023-05-10T07:49:36Z

I guess IPC will generate non-aligned, like 1byte in Int32Array, which would cause undefined behavior and type punning. But here it doesn't need to force subslice alignment?

rtpsw · 2023-05-10T07:58:48Z

I guess IPC will generate non-aligned, like 1byte in Int32Array, which would cause undefined behavior and type punning.

Right, this is basically what happened.

But here it doesn't need to force subslice alignment?

I don't know if no alignment enforcement would be OK, but it sounds like a smaller alignment would do. Perhaps a good quick fix is to change ipc::kArrowAlignment in the code to the byte size of the value's type. My understanding is that numpy ensures this alignment condition.

jorisvandenbossche · 2023-05-10T16:31:12Z

Related issue about buffer alignment in Acero:

[C++] Acero buffer alignment #33313

(which proposes to enforce alignment to 64 bytes in the Acero source node, but so in that issue that was something to discuss)

westonpace · 2023-05-10T18:43:57Z

I don't know if no alignment enforcement would be OK

Correct. As @mapleFU mentions this could lead to undefined behavior in type punning (this is the error we were getting from flight).

I don't know if no alignment enforcement would be OK, but it sounds like a smaller alignment would do. Perhaps a good quick fix is to change ipc::kArrowAlignment in #35498 (comment) to the byte size of the value's type. My understanding is that numpy ensures this alignment condition.

This seems like a the best solution.

My understanding is that alignment is "recommended but not required for in memory data", it's only when serializing (IPC) that the requirement is enforced

Correct. Most of Arrow-C++ aims to tolerate unaligned buffers. However, Acero doesn't. These sorts of type punning assumptions are subtle and there are a lot of compute functions. If someone wanted to support this then a first step would be to create unit tests for all the compute functions on unaligned buffers.

rtpsw · 2023-05-11T08:07:51Z

I don't know if no alignment enforcement would be OK, but it sounds like a smaller alignment would do. Perhaps a good quick fix is to change ipc::kArrowAlignment in #35498 (comment) to the byte size of the value's type. My understanding is that numpy ensures this alignment condition.

This seems like a the best solution.

I created #35541 as a proposed fix.

…4-byte aligned buffers to requiring value-aligned buffers (#35565) ### Rationale for this change Various compute kernels and Acero internals rely on type punning. This is only safe when the buffer has appropriate alignment (e.g. casting uint8_t* to uint32_t* is only safe if the buffer has 4-byte alignment). To avoid errors we enforced 64-byte alignment in Acero. However, this is too strict. While Arrow's allocators will always generate 64-byte aligned buffers this is not the case for numpy's allocators (and presumably many others). This PR relaxes the constraint so that we only require value-aligned buffers. ### What changes are included in this PR? The main complexity here is determining which buffers need aligned and how much. A special flag kMallocAlignment is added which can be specified when calling CheckAlignment or EnforceAlignment to only require value-alignment and not a particular number. ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: #35498 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

…4-byte aligned buffers to requiring value-aligned buffers (#35565) Various compute kernels and Acero internals rely on type punning. This is only safe when the buffer has appropriate alignment (e.g. casting uint8_t* to uint32_t* is only safe if the buffer has 4-byte alignment). To avoid errors we enforced 64-byte alignment in Acero. However, this is too strict. While Arrow's allocators will always generate 64-byte aligned buffers this is not the case for numpy's allocators (and presumably many others). This PR relaxes the constraint so that we only require value-aligned buffers. The main complexity here is determining which buffers need aligned and how much. A special flag kMallocAlignment is added which can be specified when calling CheckAlignment or EnforceAlignment to only require value-alignment and not a particular number. Yes No * Closes: #35498 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

alexhudspith added the Type: bug label May 8, 2023

github-actions bot added Component: C++ Component: Parquet labels May 8, 2023

alexhudspith changed the title ~~[C++][Parquet] Parquet write_todataset performance regression~~ [C++][Parquet] Parquet write_to_dataset performance regression May 8, 2023

jorisvandenbossche added this to the 12.0.1 milestone May 9, 2023

rtpsw mentioned this issue May 11, 2023

GH-35498: [C++] Fix source node batch realignment #35541

Closed

rtpsw added a commit to rtpsw/arrow that referenced this issue May 11, 2023

apacheGH-35498: [C++] Fix source node batch realignment

892e400

github-actions bot assigned rtpsw May 11, 2023

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 11, 2023

apacheGH-35498: [C++] Remove alignment in source node

9d1e616

westonpace added Priority: Blocker Marks a blocker for the release Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. labels May 11, 2023

github-actions bot mentioned this issue May 12, 2023

GH-35498: [C++] Relax EnsureAlignment check in Acero from requiring 64-byte aligned buffers to requiring value-aligned buffers #35565

Merged

westonpace closed this as completed in #35565 May 29, 2023

westonpace modified the milestones: 12.0.1, 13.0.0 May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Parquet write_to_dataset performance regression #35498

[C++][Parquet] Parquet write_to_dataset performance regression #35498

alexhudspith commented May 8, 2023 •

edited

Loading

jorisvandenbossche commented May 9, 2023

mapleFU commented May 9, 2023 •

edited

Loading

jorisvandenbossche commented May 9, 2023

rtpsw commented May 9, 2023

jorisvandenbossche commented May 9, 2023 •

edited

Loading

rtpsw commented May 9, 2023

rtpsw commented May 10, 2023 •

edited

Loading

rtpsw commented May 10, 2023

rtpsw commented May 10, 2023

jorisvandenbossche commented May 10, 2023

rtpsw commented May 10, 2023

rtpsw commented May 10, 2023

mapleFU commented May 10, 2023

rtpsw commented May 10, 2023

jorisvandenbossche commented May 10, 2023

westonpace commented May 10, 2023

rtpsw commented May 11, 2023

[C++][Parquet] Parquet write_to_dataset performance regression #35498

[C++][Parquet] Parquet write_to_dataset performance regression #35498

Comments

alexhudspith commented May 8, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

jorisvandenbossche commented May 9, 2023

mapleFU commented May 9, 2023 • edited Loading

jorisvandenbossche commented May 9, 2023

rtpsw commented May 9, 2023

jorisvandenbossche commented May 9, 2023 • edited Loading

rtpsw commented May 9, 2023

rtpsw commented May 10, 2023 • edited Loading

rtpsw commented May 10, 2023

rtpsw commented May 10, 2023

jorisvandenbossche commented May 10, 2023

rtpsw commented May 10, 2023

rtpsw commented May 10, 2023

mapleFU commented May 10, 2023

rtpsw commented May 10, 2023

jorisvandenbossche commented May 10, 2023

westonpace commented May 10, 2023

rtpsw commented May 11, 2023

alexhudspith commented May 8, 2023 •

edited

Loading

mapleFU commented May 9, 2023 •

edited

Loading

jorisvandenbossche commented May 9, 2023 •

edited

Loading

rtpsw commented May 10, 2023 •

edited

Loading