Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix misaligned reference and logic error in crc32 #1906

Merged
merged 1 commit into from
Jun 19, 2022
Merged

Fix misaligned reference and logic error in crc32 #1906

merged 1 commit into from
Jun 19, 2022

Conversation

saethlin
Copy link
Contributor

@saethlin saethlin commented Jun 19, 2022

Which issue does this PR close?

Closes #.

Rationale for this change

Previously, this code tried to turn a &[u8] into a &[u32] without
checking alignment. This means it could and did create misaligned
references, which is UB. This can be detected by running the tests with
-Zbuild-std --target=x86_64-unknown-linux-gnu (or whatever your host
is). This change adopts the approach from the murmurhash implementation.

The previous implementation also ignored the tail bytes. The loop at the
end treats num_bytes as if it is the full length of the slice, but it
isn't, num_bytes number of bytes after the last 4-byte group. This can
be observed for example by changing "hello" to just "hell" in the tests.
Under the old implementation, the test will still pass. Now, the value
that comes out changes, and "hello" and "hell" hash to different values.

What changes are included in this PR?

A soundness and correctness fix for crc32_hash

Are there any user-facing changes?

I don't know. Can users observe the values that come out of crc32_hash?

Previously, this code tried to turn a &[u8] into a &[u32] without
checking alignment. This means it could and did create misaligned
references, which is UB. This can be detected by running the tests with
-Zbuild-std --target=x86_64-unknown-linux-gnu (or whatever your host
is). This change adopts the approach from the murmurhash implementation.

The previous implementation also ignored the tail bytes. The loop at the
end treats num_bytes as if it is the full length of the slice, but it
isn't, num_bytes number of bytes after the last 4-byte group. This can
be observed for example by changing "hello" to just "hell" in the tests.
Under the old implementation, the test will still pass. Now, the value
that comes out changes, and "hello" and "hell" hash to different values.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 19, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #1906 (5d6bd7c) into master (535cd20) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 5d6bd7c differs from pull request most recent head cba393e. Consider uploading reports for the commit cba393e to get more accurate results

@@            Coverage Diff             @@
##           master    #1906      +/-   ##
==========================================
- Coverage   83.42%   83.42%   -0.01%     
==========================================
  Files         214      214              
  Lines       57025    57018       -7     
==========================================
- Hits        47574    47567       -7     
  Misses       9451     9451              
Impacted Files Coverage Δ
parquet/src/util/hash_util.rs 95.16% <100.00%> (-0.50%) ⬇️
arrow/src/datatypes/datatype.rs 65.42% <0.00%> (-0.38%) ⬇️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (ø)
parquet/src/encodings/encoding.rs 93.65% <0.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 535cd20...cba393e. Read the comment docs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I am a bit surprised at how incorrect this method was. Is it actually used anywhere, mainly wondering if we have a gap in our test coverage somewhere...

while offset < num_bytes {
hash = _mm_crc32_u8(hash, bytes[offset]);
offset += 1;
let remainder = bytes.len() % 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold tustvold merged commit ded6316 into apache:master Jun 19, 2022
mcheshkov pushed a commit to cube-js/arrow-rs that referenced this pull request Jul 16, 2024
Previously, this code tried to turn a &[u8] into a &[u32] without
checking alignment. This means it could and did create misaligned
references, which is UB. This can be detected by running the tests with
-Zbuild-std --target=x86_64-unknown-linux-gnu (or whatever your host
is). This change adopts the approach from the murmurhash implementation.

The previous implementation also ignored the tail bytes. The loop at the
end treats num_bytes as if it is the full length of the slice, but it
isn't, num_bytes number of bytes after the last 4-byte group. This can
be observed for example by changing "hello" to just "hell" in the tests.
Under the old implementation, the test will still pass. Now, the value
that comes out changes, and "hello" and "hell" hash to different values.
mcheshkov pushed a commit to cube-js/arrow-rs that referenced this pull request Jul 17, 2024
Previously, this code tried to turn a &[u8] into a &[u32] without
checking alignment. This means it could and did create misaligned
references, which is UB. This can be detected by running the tests with
-Zbuild-std --target=x86_64-unknown-linux-gnu (or whatever your host
is). This change adopts the approach from the murmurhash implementation.

The previous implementation also ignored the tail bytes. The loop at the
end treats num_bytes as if it is the full length of the slice, but it
isn't, num_bytes number of bytes after the last 4-byte group. This can
be observed for example by changing "hello" to just "hell" in the tests.
Under the old implementation, the test will still pass. Now, the value
that comes out changes, and "hello" and "hell" hash to different values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants