-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Alignment not enforced; undefined behavior in Parquet writer #33336
Comments
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
Jochen Ott: #include <inttypes.h>
#include <algorithm>
#include <type_traits>
#include <cstring>
template <typename T>
inline std::enable_if_t<std::is_trivially_copyable<T>::value, T> SafeLoadAs(
const uint8_t* unaligned) {
typename std::remove_const<T>::type ret;
std::memcpy(&ret, unaligned, sizeof(T));
return ret;
}
std::pair<int64_t, int64_t> minmax(const int64_t * values, int64_t n) {
int64_t minval = 0, maxval = 0; // obviously not correct, but not important here
for(int64_t i=0; i<n; ++i){
// "old":
int64_t value = values[i];
// "new":
// int64_t value = SafeLoadAs<int64_t>((const uint8_t *)(values + i));
minval = std::min(minval, value);
maxval = std::max(maxval, value);
}
return std::make_pair(minval, maxval);
} Using compiler options "-O3 -march=sandybridge", gcc 6.3.0 produces the probelmatic, aligned instruction "vmovdqa" for the "old" code. For the "new" variant, it produces "vmovdqu", though, which seems like the best we could hope for. |
Antoine Pitrou / @pitrou: |
Jochen Ott: |
Antoine Pitrou / @pitrou: |
Some gcc versions (such as 6.3.0) may emit an aligned-only load instruction, but the Parquet writer can be called with unaligned buffers. * Closes: apache#33336 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Some gcc versions (such as 6.3.0) may emit an aligned-only load instruction, but the Parquet writer can be called with unaligned buffers. * Closes: apache#33336 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Some gcc versions (such as 6.3.0) may emit an aligned-only load instruction, but the Parquet writer can be called with unaligned buffers. * Closes: apache#33336 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
It is possible to create arrays using unaligned memory addresses (e.g. for int64). This seems to be in line with the arrow specification which as far as I understand does not require alignment [1].
However, the C++ standard requires alignment, e.g. 8 byte alignment for int64. It is undefined behavior (UB) to create an unaligned pointer / accessing data via an unaligned pointer.
Typically, this is not an issue in practice on x86, since gcc and other compilers mostly emit instructions that can deal with unaligned data. However, for gcc 6.3.0 (and probably up to including gcc versions 7.X), this code:
arrow/cpp/src/parquet/statistics.cc
Line 355 in 4591d76
creates an aligned move instruction (movdqa) for the expression
{}values[i]{
}. This, in turn, triggers a SIGSEGV in casevalues
is called via an unaligned buffer. Later compiler versions (in particular gcc 9.X used to build the wheels published on pypi) will emit instructions that can deal with unaligned data (movdqu instead of movdqa).The python script "test1.py" reproduces this issue on python-level; note that it will only trigger a SIGSEGV if compiling arrow with a compiler that emits movdqa for the code linked above, e.g. by using gcc 6.3.0 to compile arrow.
In the wild, unaligned buffers are rare, but can appear, e.g. as a result of deserializing pandas dataframes / numpy arrays using pickle protocol 5 that allows out-of-band byte buffers that are re-used as arrow array buffers.
I think the line to first enter the UB regime is this reinterpret_cast:
arrow/cpp/src/parquet/column_writer.cc
Line 1592 in 33f2c0e
[1]https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding merely "recommends" that buffers are aligned, but does not require it.
Reporter: Jochen Ott
Assignee: Antoine Pitrou / @pitrou
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-18141. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: