Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20474] Fixing OnHeapColumnVector reallocation #17773

Closed

Conversation

michal-databricks
Copy link
Contributor

What changes were proposed in this pull request?

OnHeapColumnVector reallocation copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the ColumnVector.appendX API, while ColumnVector.putX is more commonly used.

How was this patch tested?

Tested using existing unit tests.

@hvanhovell
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Apr 26, 2017

Test build #76184 has finished for PR 17773 at commit 3914766.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -410,53 +410,53 @@ protected void reserveInternal(int newCapacity) {
int[] newLengths = new int[newCapacity];
int[] newOffsets = new int[newCapacity];
if (this.arrayLengths != null) {
System.arraycopy(this.arrayLengths, 0, newLengths, 0, elementsAppended);
System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, elementsAppended);
System.arraycopy(this.arrayLengths, 0, newLengths, 0, capacity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Do we also need to fix reserveInternal in OffHeapColumnVector? Additionally, after this change, do we even need elementsAppended anymore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elementsAppended is necessary to keep the tail position by append<TYPE>().

@sameeragarwal
Copy link
Member

add to whitelist

1 similar comment
@gatorsmile
Copy link
Member

add to whitelist

@SparkQA
Copy link

SparkQA commented Apr 26, 2017

Test build #76191 has finished for PR 17773 at commit 3914766.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 26, 2017

Test build #76192 has finished for PR 17773 at commit 3914766.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Apr 26, 2017

Merging in master/branch-2.2.

asfgit pushed a commit that referenced this pull request Apr 26, 2017
## What changes were proposed in this pull request?
OnHeapColumnVector reallocation copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the ColumnVector.appendX API, while ColumnVector.putX is more commonly used.

## How was this patch tested?
Tested using existing unit tests.

Author: Michal Szafranski <michal@databricks.com>

Closes #17773 from michal-databricks/spark-20474.

(cherry picked from commit a277ae8)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in a277ae8 Apr 26, 2017
@kiszk
Copy link
Member

kiszk commented Apr 27, 2017

Do we need similar changes for OffHeapColumnVector?

@michal-databricks
Copy link
Contributor Author

Actually yes, I missed it becauseOffHeapColumnVector uses Platform.reallocateMemory, but this still has a parameter to indicate how much to copy (and it is currently based on elementsAppended.

@kiszk
Copy link
Member

kiszk commented Apr 27, 2017

Yes, I think it should see capacity. I will create a PR tomorrow for OffHeapColumnVector.

asfgit pushed a commit that referenced this pull request May 2, 2017
## What changes were proposed in this pull request?

As #17773 revealed `OnHeapColumnVector` may copy a part of the original storage.

`OffHeapColumnVector` reallocation also copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the `ColumnVector.appendX` API, while `ColumnVector.putX` is more commonly used.
This PR copies the new storage data up to the previously-allocated size in`OffHeapColumnVector`.

## How was this patch tested?

Existing test suites

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #17811 from kiszk/SPARK-20537.

(cherry picked from commit afb21bf)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request May 2, 2017
## What changes were proposed in this pull request?

As #17773 revealed `OnHeapColumnVector` may copy a part of the original storage.

`OffHeapColumnVector` reallocation also copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the `ColumnVector.appendX` API, while `ColumnVector.putX` is more commonly used.
This PR copies the new storage data up to the previously-allocated size in`OffHeapColumnVector`.

## How was this patch tested?

Existing test suites

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #17811 from kiszk/SPARK-20537.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants