Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C#] Slice value buffers when writing slices of list arrays and binary arrays #41225

Closed
adamreeve opened this issue Apr 15, 2024 · 1 comment
Closed

Comments

@adamreeve
Copy link
Contributor

Describe the enhancement requested

This is a follow up to #40517, which made round-tripping of sliced arrays via the IPC format work. When writing list arrays and binary/string arrays the full values buffer is written and the value offsets and validity buffer are sliced, but to reduce file sizes the value buffer should be sliced and the value offsets adjusted. Eg. see the C++ implementation.

Component(s)

C#

CurtHagenlocher pushed a commit that referenced this issue Apr 16, 2024
… arrays in IPC format (#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: #41225

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
@CurtHagenlocher
Copy link
Contributor

Issue resolved by pull request 41230
#41230

@CurtHagenlocher CurtHagenlocher added this to the 17.0.0 milestone Apr 16, 2024
@raulcd raulcd modified the milestones: 17.0.0, 16.1.0 Apr 29, 2024
raulcd pushed a commit that referenced this issue Apr 29, 2024
… arrays in IPC format (#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: #41225

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 4, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…binary arrays in IPC format (apache#41230)

### Rationale for this change

This reduces file sizes when writing sliced binary or list arrays to IPC format.

### What changes are included in this PR?

Changes `ArrowStreamWriter` to write only the subset of the values that is needed rather than the full value buffer when writing a `ListArray` or `BinaryArray`, and compute shifted value offset buffers.

### Are these changes tested?

This code is covered by existing tests and the change doesn't introduce any difference in the observed array values, so I haven't added new tests or checks.

I did change how list arrays are compared though as we can no longer compare the value and value offset buffers directly, so the tests now get list items as arrays and create a new `ArrayComparer` to compare them. This meant that array offsets are no longer always zero, so I've changed the offset assertions to only be used in strict mode.

### Are there any user-facing changes?

Yes, this might reduce IPC file sizes for users writing sliced data.
* GitHub Issue: apache#41225

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants