Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] RLE decoder may not big-endian compatible #20620

Closed
asfimport opened this issue Dec 13, 2018 · 5 comments
Closed

[C++] RLE decoder may not big-endian compatible #20620

asfimport opened this issue Dec 13, 2018 · 5 comments
Assignees
Milestone

Comments

@asfimport
Copy link

This issue was found by Coverity. The RleDecoder::NextCounts method has the following code to fetch the repeated literal in repeated runs:

    bool result =
        bit_reader_.GetAligned<T>(static_cast<int>(BitUtil::CeilDiv(bit_width_, 8)),
                                  reinterpret_cast<T*>(&current_value_));

Coverity says this:

Pointer "&this->current_value_" points to an object whose effective type is "unsigned long long" (64 bits, unsigned) but is dereferenced as a narrower "unsigned int" (32 bits, unsigned). This may lead to unexpected results depending on machine endianness.

In addition, it's not obvious whether current_value_ also needs byte-swapping (presumably, at least in the Parquet file format, it's supposed to be stored in little-endian format in the RLE bitstream).

Reporter: Antoine Pitrou / @pitrou
Assignee: Kazuaki Ishizaki / @kiszk

PRs and other links:

Note: This issue was originally created as ARROW-4018. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Also [~emkornfield@gmail.com]

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
Hmm, i think we've gone back and forth on Endianness support.  I know when the project started I thought it was important because at the time it seemed like Spark was intending to support both (I don't know if it still does).  

 

Are we actually clean in terms of endianness in other places?  I would need to investigate further, but it sounds strange to be slicing a long like coverity describes have you looked to see if this is intended?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:

Are we actually clean in terms of endianness in other places?

Presumably no, because we're reinterpreting array bytes as larger types such as int64_t etc. And we're also serializing those bytes directly to disk or wire.

it sounds strange to be slicing a long like coverity describes have you looked to see if this is intended?

I think it's just a dirty implementation shortcut. Instead of doing:

    bool result =
        bit_reader_.GetAligned<T>(static_cast<int>(BitUtil::CeilDiv(bit_width_, 8)),
                                  reinterpret_cast<T*>(&current_value_));

The code could presumably be written as:

    T value;
    bool result =
        bit_reader_.GetAligned<T>(static_cast<int>(BitUtil::CeilDiv(bit_width_, 8)), &value);
    current_value_ = static_cast<uint64_t>(value);

@asfimport
Copy link
Author

Kazuaki Ishizaki / @kiszk:
Minor comment. This code should be

    T value = 0;
    bool result =
        bit_reader_.GetAligned<T>(static_cast<int>(BitUtil::CeilDiv(bit_width_, 8)), &value);
    current_value_ = static_cast<uint64_t>(value);

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 7144
#7144

@asfimport asfimport added this to the 1.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants