-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Parquet write_to_dataset performance regression #35498
Comments
@alexhudspith Thanks a lot for the report and the detailed analysis. It's unfortunate that this got into the release, as it seems we actually also could see this in our own benchmarks (https://conbench.ursa.dev/benchmark-results/2b587cc1079f4e3a97f542e6f11e883e/, we need some better process to check regressions before a release) A call to arrow/cpp/src/arrow/acero/source_node.cc Lines 106 to 113 in de6c3cd
I am not directly sure for the reason to do this here, cc @rtpsw @westonpace |
@jorisvandenbossche So is this due to alignment change ( |
I assume so yes, based on the flamegraphs shown above. But I am not familiar with why this was added. |
The reason for adding Note that so far I didn't look into the specifics of the current issue - it may be due to Flight or some other misaligned format used as input, or some other reason. |
But so (for my understanding) that specific change wasn't really related to the rest of the PR? |
IIRC, the PR failed internal integration testing without the |
Looking at the code, I suspect the reason for degraded performance is because the source table has misaligned numpy arrays and each batch of each of these arrays get realigned by As for a cause of the problem, it looks like |
@icexelloss, what are your thoughts about this? |
Digging this up for reference, the failure was raised by an invocation of |
My understanding is that alignment is "recommended but not required for in memory data", it's only when serializing (IPC) that the requirement is enforced (https://arrow.apache.org/docs/dev/format/Columnar.html#buffer-alignment-and-padding)? |
Looks like numpy already supports this via configurable memory routines (see NEP-49 and the corresponding PR). |
That sounds right per the spec, though avoiding the recommendation may lead to other kinds of performance degradation. In practice, the existing Arrow code checks alignment of in-memory buffers (at least in the place I pointed to), despite what the spec says, and it would be a much wider scope of an issue to change that. Given all this, I think we need to consider what makes sense for a resolution in the short-term vs the longer-term. |
I guess IPC will generate non-aligned, like 1byte in |
Right, this is basically what happened.
I don't know if no alignment enforcement would be OK, but it sounds like a smaller alignment would do. Perhaps a good quick fix is to change |
Related issue about buffer alignment in Acero: (which proposes to enforce alignment to 64 bytes in the Acero source node, but so in that issue that was something to discuss) |
Correct. As @mapleFU mentions this could lead to undefined behavior in type punning (this is the error we were getting from flight).
This seems like a the best solution.
Correct. Most of Arrow-C++ aims to tolerate unaligned buffers. However, Acero doesn't. These sorts of type punning assumptions are subtle and there are a lot of compute functions. If someone wanted to support this then a first step would be to create unit tests for all the compute functions on unaligned buffers. |
I created #35541 as a proposed fix. |
…4-byte aligned buffers to requiring value-aligned buffers (#35565) ### Rationale for this change Various compute kernels and Acero internals rely on type punning. This is only safe when the buffer has appropriate alignment (e.g. casting uint8_t* to uint32_t* is only safe if the buffer has 4-byte alignment). To avoid errors we enforced 64-byte alignment in Acero. However, this is too strict. While Arrow's allocators will always generate 64-byte aligned buffers this is not the case for numpy's allocators (and presumably many others). This PR relaxes the constraint so that we only require value-aligned buffers. ### What changes are included in this PR? The main complexity here is determining which buffers need aligned and how much. A special flag kMallocAlignment is added which can be specified when calling CheckAlignment or EnforceAlignment to only require value-alignment and not a particular number. ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: #35498 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…4-byte aligned buffers to requiring value-aligned buffers (#35565) Various compute kernels and Acero internals rely on type punning. This is only safe when the buffer has appropriate alignment (e.g. casting uint8_t* to uint32_t* is only safe if the buffer has 4-byte alignment). To avoid errors we enforced 64-byte alignment in Acero. However, this is too strict. While Arrow's allocators will always generate 64-byte aligned buffers this is not the case for numpy's allocators (and presumably many others). This PR relaxes the constraint so that we only require value-aligned buffers. The main complexity here is determining which buffers need aligned and how much. A special flag kMallocAlignment is added which can be specified when calling CheckAlignment or EnforceAlignment to only require value-alignment and not a particular number. Yes No * Closes: #35498 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
On Linux,
pyarrow.parquet.write_to_dataset
shows a large performance regression in Arrow 12.0 versus 11.0.The following results were collected using Ubuntu 22.04.2 LTS (5.15.0-71-generic), Intel Haswell 4-core @ 3.6GHz, 16 GB RAM, Samsung 840 Pro SSD. They are elapsed times in seconds to write a single int64 column of integers [0,..., length-1] with no compression and no multi-threading:
The output directory was deleted before each run.
Running
git bisect
on local builds leads me to this commit: 660d259: [C++] Add ordered/segmented aggregation Substrait extension (#34627).Following that change, Flamegraphs show a lot of additional time spent in
arrow::util::EnsureAlignment
calling glibcmemcpy
:Before ~1.3s (ddd0a33)
After ~9.6s (660d259)
Reading and
pyarrow.parquet.write_table
appear unaffected.Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: