Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Use signed arithmetic for frame of reference in DeltaBitPackEncoder #37939

Closed
etseidl opened this issue Sep 28, 2023 · 0 comments · Fixed by #37940
Closed

[C++] Use signed arithmetic for frame of reference in DeltaBitPackEncoder #37939

etseidl opened this issue Sep 28, 2023 · 0 comments · Fixed by #37940

Comments

@etseidl
Copy link
Contributor

etseidl commented Sep 28, 2023

Describe the enhancement requested

The current implementation of DeltaBitPackEncoder uses unsigned arithmetic to handle possible overflow when calculating deltas (see here). This has unfortunate consequences when encoding small negative deltas. As an example, writing a vector with values {1, 0, -1, 0, 1, 0, -1, 0, 1} produces the following output (starting at the delta binary header):

00000030:                          8001 0409 0202  ................
00000040: 2000 0000 feff ffff feff ffff 0000 0000   ...............
00000050: 0000 0000 feff ffff feff ffff 0000 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000

The encoder uses a bit width of 32 for all values. If signed values are used instead, then the result is:

00000030:                     8001 0409 0201 0200  ................
00000040: 0000 a0a0 0000 0000 0000

Here the encoder can use 2 bits per value. This can result in much smaller files, especially in cases where the logical type is less than 32 bits.

Component(s)

C++

pitrou pushed a commit that referenced this issue Oct 3, 2023
…oding DELTA_BINARY_PACKED (#37940)

Closes #37939. 

### What changes are included in this PR?

This PR changes values used in the `DELTA_BINARY_PACKED` encoder to signed types. To gracefully handle overflow, arithmetic is still performed in the unsigned domain, but other operations such as computing the min and max deltas are done in the signed domain.

Using signed types ensures the optimal number of bits is used when encoding the deltas, which was not the case before if any negative deltas were encountered (which is obviously common).

### Are these changes tested?
I've included two tests that result in overflow.

### Are there any user-facing changes?
No

* Closes: #37939

Authored-by: seidl <seidl2@llnl.gov>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 14.0.0 milestone Oct 3, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…en encoding DELTA_BINARY_PACKED (apache#37940)

Closes apache#37939. 

### What changes are included in this PR?

This PR changes values used in the `DELTA_BINARY_PACKED` encoder to signed types. To gracefully handle overflow, arithmetic is still performed in the unsigned domain, but other operations such as computing the min and max deltas are done in the signed domain.

Using signed types ensures the optimal number of bits is used when encoding the deltas, which was not the case before if any negative deltas were encountered (which is obviously common).

### Are these changes tested?
I've included two tests that result in overflow.

### Are there any user-facing changes?
No

* Closes: apache#37939

Authored-by: seidl <seidl2@llnl.gov>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…en encoding DELTA_BINARY_PACKED (apache#37940)

Closes apache#37939. 

### What changes are included in this PR?

This PR changes values used in the `DELTA_BINARY_PACKED` encoder to signed types. To gracefully handle overflow, arithmetic is still performed in the unsigned domain, but other operations such as computing the min and max deltas are done in the signed domain.

Using signed types ensures the optimal number of bits is used when encoding the deltas, which was not the case before if any negative deltas were encountered (which is obviously common).

### Are these changes tested?
I've included two tests that result in overflow.

### Are there any user-facing changes?
No

* Closes: apache#37939

Authored-by: seidl <seidl2@llnl.gov>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…en encoding DELTA_BINARY_PACKED (apache#37940)

Closes apache#37939. 

### What changes are included in this PR?

This PR changes values used in the `DELTA_BINARY_PACKED` encoder to signed types. To gracefully handle overflow, arithmetic is still performed in the unsigned domain, but other operations such as computing the min and max deltas are done in the signed domain.

Using signed types ensures the optimal number of bits is used when encoding the deltas, which was not the case before if any negative deltas were encountered (which is obviously common).

### Are these changes tested?
I've included two tests that result in overflow.

### Are there any user-facing changes?
No

* Closes: apache#37939

Authored-by: seidl <seidl2@llnl.gov>
Signed-off-by: Antoine Pitrou <antoine@python.org>
abandy added a commit to abandy/arrow that referenced this issue Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants