Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] VectorSchemaRoot uses inefficient stream to copy fieldVectors #41573

Closed
schlosna opened this issue May 7, 2024 · 1 comment
Closed

Comments

@schlosna
Copy link
Contributor

schlosna commented May 7, 2024

Describe the bug, including details regarding any error messages, version, and platform.

While reviewing allocation profiling of an Arrow intensive application, I noticed significant allocations due to ArrayList#grow() originating from org.apache.arrow.vector.VectorSchemaRoot#getFieldVectors(). The org.apache.arrow.vector.VectorSchemaRoot#getFieldVectors() method uses an inefficient fieldVectors.stream().collect(Collectors.toList()) to create a list copy, leading to reallocations as the target list is collected. This could be replaced with a more efficent new ArrayList<>(fieldVectors) to make a pre-sized list copy, or even better an unmodifiable view via Collections.unmodifiableList(fieldVectors).

Component(s)

Java

lidavidm pushed a commit that referenced this issue May 8, 2024
…ldVectors (#41574)

### Rationale for this change

While reviewing allocation profiling of an Arrow intensive application, I noticed significant allocations due to `ArrayList#grow()` originating from `org.apache.arrow.vector.VectorSchemaRoot#getFieldVectors()`. The `org.apache.arrow.vector.VectorSchemaRoot#getFieldVectors()` method uses an inefficient `fieldVectors.stream().collect(Collectors.toList())` to create a list copy, leading to reallocations as the target list is collected. This could be replaced with a more efficent `new ArrayList<>(fieldVectors)` to make a pre-sized list copy, or even better an unmodifiable view via `Collections.unmodifiableList(fieldVectors)`.

### What changes are included in this PR?

* Use `Collections.unmodifiableList(List)` to return unmodifiable list view of `fieldVectors` from `getFieldVectors()`
* Pre-size the `fieldVectors` `ArrayList` in static factory `VectorSchemaRoot#create(Schema, BufferAllocator)`
* `VectorSchemaRoot#setRowCount(int)` iterates over instance `fieldVectors` instead of copied list (similar to existing `allocateNew()`, `clear()`, `contentToTSVString()`).

### Are these changes tested?

These changes are covered by existing unit and integration tests.

### Are there any user-facing changes?

No

* GitHub Issue: #41573

Authored-by: David Schlosnagle <davids@palantir.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm
Copy link
Member

lidavidm commented May 8, 2024

Issue resolved by pull request 41574
#41574

@lidavidm lidavidm added this to the 17.0.0 milestone May 8, 2024
@lidavidm lidavidm closed this as completed May 8, 2024
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…py fieldVectors (apache#41574)

### Rationale for this change

While reviewing allocation profiling of an Arrow intensive application, I noticed significant allocations due to `ArrayList#grow()` originating from `org.apache.arrow.vector.VectorSchemaRoot#getFieldVectors()`. The `org.apache.arrow.vector.VectorSchemaRoot#getFieldVectors()` method uses an inefficient `fieldVectors.stream().collect(Collectors.toList())` to create a list copy, leading to reallocations as the target list is collected. This could be replaced with a more efficent `new ArrayList<>(fieldVectors)` to make a pre-sized list copy, or even better an unmodifiable view via `Collections.unmodifiableList(fieldVectors)`.

### What changes are included in this PR?

* Use `Collections.unmodifiableList(List)` to return unmodifiable list view of `fieldVectors` from `getFieldVectors()`
* Pre-size the `fieldVectors` `ArrayList` in static factory `VectorSchemaRoot#create(Schema, BufferAllocator)`
* `VectorSchemaRoot#setRowCount(int)` iterates over instance `fieldVectors` instead of copied list (similar to existing `allocateNew()`, `clear()`, `contentToTSVString()`).

### Are these changes tested?

These changes are covered by existing unit and integration tests.

### Are there any user-facing changes?

No

* GitHub Issue: apache#41573

Authored-by: David Schlosnagle <davids@palantir.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants