Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport [SPARK-24133][SQL] Check for integer overflows when resizing WritableColumnVectors #21227

Closed
wants to merge 1 commit into from

Conversation

ala
Copy link
Contributor

@ala ala commented May 3, 2018

ColumnVectors store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store more than 2GB of string data.
But since the Parquet files commonly contain large blobs stored as strings, and ColumnVectors by default carry 4096 values, it's entirely possible to go past that limit. In such cases a negative capacity is requested from WritableColumnVector.reserve(). The call succeeds (requested capacity is smaller than already allocated capacity), and consequently java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually attempts to put the data into the array.

This change introduces a simple check for integer overflow to WritableColumnVector.reserve() which should help catch the error earlier and provide more informative exception. Additionally, the error message in WritableColumnVector.throwUnsupportedException() was corrected.

New units tests were added.

Author: Ala Luszczak ala@databricks.com

Closes #21206 from ala/overflow-reserve.

(cherry picked from commit 8bd2702)
Signed-off-by: Ala Luszczak ala@databricks.com

…ColumnVectors

`ColumnVector`s store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a single `ColumnVector` cannot store more than 2GB of string data.
But since the Parquet files commonly contain large blobs stored as strings, and `ColumnVector`s by default carry 4096 values, it's entirely possible to go past that limit. In such cases a negative capacity is requested from `WritableColumnVector.reserve()`. The call succeeds (requested capacity is smaller than already allocated capacity), and consequently `java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader actually attempts to put the data into the array.

This change introduces a simple check for integer overflow to `WritableColumnVector.reserve()` which should help catch the error earlier and provide more informative exception. Additionally, the error message in `WritableColumnVector.throwUnsupportedException()` was corrected, as it previously encouraged users to increase rather than reduce the batch size.

New units tests were added.

Author: Ala Luszczak <ala@databricks.com>

Closes apache#21206 from ala/overflow-reserve.

(cherry picked from commit 8bd2702)
Signed-off-by: Ala Luszczak <ala@databricks.com>
@ala
Copy link
Contributor Author

ala commented May 3, 2018

@gatorsmile I got rid of the references to batch size config (not available in 2.3), and now it should be fine.

@@ -81,7 +81,9 @@ public void close() {
}

public void reserve(int requiredCapacity) {
if (requiredCapacity > capacity) {
if (requiredCapacity < 0) {
throwUnsupportedException(requiredCapacity, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about use ArithmeticException instead of null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrian-wang Since this is just a backport, I would prefer to keep the modifications of the original commit to the minimum. Also, it might be a bit confusing the user, if we create ArithmeticException here, as this is not the place where invalid computation was performed, but just the place where we realize the input is not sane.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation!

@ala ala changed the title [SPARK-24133][SQL] Check for integer overflows when resizing WritableColumnVectors Backport [SPARK-24133][SQL] Check for integer overflows when resizing WritableColumnVectors May 3, 2018
@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90119 has finished for PR 21227 at commit 638aea7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM - merging to 2.3. Thanks!

asfgit pushed a commit that referenced this pull request May 3, 2018
…ing WritableColumnVectors

`ColumnVector`s store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a single `ColumnVector` cannot store more than 2GB of string data.
But since the Parquet files commonly contain large blobs stored as strings, and `ColumnVector`s by default carry 4096 values, it's entirely possible to go past that limit. In such cases a negative capacity is requested from `WritableColumnVector.reserve()`. The call succeeds (requested capacity is smaller than already allocated capacity), and consequently `java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader actually attempts to put the data into the array.

This change introduces a simple check for integer overflow to `WritableColumnVector.reserve()` which should help catch the error earlier and provide more informative exception. Additionally, the error message in `WritableColumnVector.throwUnsupportedException()` was corrected.

New units tests were added.

Author: Ala Luszczak <aladatabricks.com>

Closes #21206 from ala/overflow-reserve.

(cherry picked from commit 8bd2702)
Signed-off-by: Ala Luszczak <aladatabricks.com>

Author: Ala Luszczak <ala@databricks.com>

Closes #21227 from ala/cherry-pick-overflow-reserve.
@ala ala closed this May 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants