Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Alignment not enforced; undefined behavior in Parquet writer #33336

Closed
asfimport opened this issue Oct 24, 2022 · 6 comments · Fixed by #14488
Closed

[C++] Alignment not enforced; undefined behavior in Parquet writer #33336

asfimport opened this issue Oct 24, 2022 · 6 comments · Fixed by #14488

Comments

@asfimport
Copy link
Collaborator

It is possible to create arrays using unaligned memory addresses (e.g. for int64). This seems to be in line with the arrow specification which as far as I understand does not require alignment [1].

However, the C++ standard requires alignment, e.g. 8 byte alignment for int64. It is undefined behavior (UB) to create an unaligned pointer / accessing data via an unaligned pointer.

Typically, this is not an issue in practice on x86, since gcc and other compilers mostly emit instructions that can deal with unaligned data. However, for gcc 6.3.0 (and probably up to including gcc versions 7.X), this code:

for (int64_t i = 0; i < length; i++) {

creates an aligned move instruction (movdqa) for the expression {}values[i]{}. This, in turn, triggers a SIGSEGV in case values is called via an unaligned buffer. Later compiler versions (in particular gcc 9.X used to build the wheels published on pypi) will emit instructions that can deal with unaligned data (movdqu instead of movdqa).

The python script "test1.py" reproduces this issue on python-level; note that it will only trigger a SIGSEGV if compiling arrow with a compiler that emits movdqa for the code linked above, e.g. by using gcc 6.3.0 to compile arrow.

In the wild, unaligned buffers are rare, but can appear, e.g. as a result of deserializing pandas dataframes / numpy arrays using pickle protocol 5 that allows out-of-band byte buffers that are re-used as arrow array buffers.

I think the line to first enter the UB regime is this reinterpret_cast:

values = reinterpret_cast<const T*>(data.values()->data()) + data.offset();

[1]https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding merely "recommends" that buffers are aligned, but does not require it.

Reporter: Jochen Ott
Assignee: Antoine Pitrou / @pitrou

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-18141. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Hmm, this is unfortunate. It's true that we mostly don't test for unaligned buffers.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Could you try with draft PR #14488 (I don't have gcc 6.3.0 here)?

@asfimport
Copy link
Collaborator Author

Jochen Ott:
Thanks for the quick fix. Unfortunately, I currently don't have the time to try this on full arrow (I've set up all dependencies and such for arrow 5.0.0, so I'd need to backport this ...). However, I'm pretty confident this fixes the issue, as I tried the following code on godbolt.org (note that I had to slightly modify SafeLoadAs to compile on 6.3.0):

#include <inttypes.h>
#include <algorithm>
#include <type_traits>
#include <cstring>

template <typename T>
inline std::enable_if_t<std::is_trivially_copyable<T>::value, T> SafeLoadAs(
    const uint8_t* unaligned) {
  typename std::remove_const<T>::type ret;
  std::memcpy(&ret, unaligned, sizeof(T));
  return ret;
}

std::pair<int64_t, int64_t> minmax(const int64_t * values, int64_t n) {
    int64_t minval = 0, maxval = 0; // obviously not correct, but not important here
    for(int64_t i=0; i<n; ++i){
        // "old":
        int64_t value = values[i];
        // "new":
        // int64_t value = SafeLoadAs<int64_t>((const uint8_t *)(values + i));
        minval = std::min(minval, value);
        maxval = std::max(maxval, value);
    }
    return std::make_pair(minval, maxval);
}

Using compiler options "-O3 -march=sandybridge", gcc 6.3.0 produces the probelmatic, aligned instruction "vmovdqa" for the "old" code. For the "new" variant, it produces "vmovdqu", though, which seems like the best we could hope for.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
[~jott] I'm sure it fixes this particular code path. My question was more if there were other similar issues down the road.

@asfimport
Copy link
Collaborator Author

Jochen Ott:
I did not find any other issues down the road, i.e. writing out parquet files with "write_statistics=False" works with unaligned buffers. Having said that, I would not be surprised  if there are more latent bugs similar to that (i.e. UB due to unaligned access on code-level that do not manifest because current compilers emit instructions that work even for unaligned data). To catch those, I think a code-level review of all the reinterpret_casts or array access would probably the best way to catch those.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Ok, thanks for the feedback. I have found a couple other places to migrate, otherwise I will declare the PR ready.

@asfimport asfimport added this to the 11.0.0 milestone Jan 11, 2023
@raulcd raulcd removed this from the 11.0.0 milestone Jan 11, 2023
@zeroshade zeroshade added this to the 12.0.0 milestone Feb 1, 2023
sjperkins pushed a commit to sjperkins/arrow that referenced this issue Feb 10, 2023
Some gcc versions (such as 6.3.0) may emit an aligned-only load instruction, but the Parquet writer can be called with unaligned buffers.
* Closes: apache#33336

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
Some gcc versions (such as 6.3.0) may emit an aligned-only load instruction, but the Parquet writer can be called with unaligned buffers.
* Closes: apache#33336

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
Some gcc versions (such as 6.3.0) may emit an aligned-only load instruction, but the Parquet writer can be called with unaligned buffers.
* Closes: apache#33336

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants