New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport [SPARK-24133][SQL] Check for integer overflows when resizing WritableColumnVectors #21227
Conversation
…ColumnVectors `ColumnVector`s store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a single `ColumnVector` cannot store more than 2GB of string data. But since the Parquet files commonly contain large blobs stored as strings, and `ColumnVector`s by default carry 4096 values, it's entirely possible to go past that limit. In such cases a negative capacity is requested from `WritableColumnVector.reserve()`. The call succeeds (requested capacity is smaller than already allocated capacity), and consequently `java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader actually attempts to put the data into the array. This change introduces a simple check for integer overflow to `WritableColumnVector.reserve()` which should help catch the error earlier and provide more informative exception. Additionally, the error message in `WritableColumnVector.throwUnsupportedException()` was corrected, as it previously encouraged users to increase rather than reduce the batch size. New units tests were added. Author: Ala Luszczak <ala@databricks.com> Closes apache#21206 from ala/overflow-reserve. (cherry picked from commit 8bd2702) Signed-off-by: Ala Luszczak <ala@databricks.com>
@gatorsmile I got rid of the references to batch size config (not available in 2.3), and now it should be fine. |
@@ -81,7 +81,9 @@ public void close() { | |||
} | |||
|
|||
public void reserve(int requiredCapacity) { | |||
if (requiredCapacity > capacity) { | |||
if (requiredCapacity < 0) { | |||
throwUnsupportedException(requiredCapacity, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about use ArithmeticException
instead of null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrian-wang Since this is just a backport, I would prefer to keep the modifications of the original commit to the minimum. Also, it might be a bit confusing the user, if we create ArithmeticException
here, as this is not the place where invalid computation was performed, but just the place where we realize the input is not sane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation!
Test build #90119 has finished for PR 21227 at commit
|
LGTM - merging to 2.3. Thanks! |
…ing WritableColumnVectors `ColumnVector`s store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a single `ColumnVector` cannot store more than 2GB of string data. But since the Parquet files commonly contain large blobs stored as strings, and `ColumnVector`s by default carry 4096 values, it's entirely possible to go past that limit. In such cases a negative capacity is requested from `WritableColumnVector.reserve()`. The call succeeds (requested capacity is smaller than already allocated capacity), and consequently `java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader actually attempts to put the data into the array. This change introduces a simple check for integer overflow to `WritableColumnVector.reserve()` which should help catch the error earlier and provide more informative exception. Additionally, the error message in `WritableColumnVector.throwUnsupportedException()` was corrected. New units tests were added. Author: Ala Luszczak <aladatabricks.com> Closes #21206 from ala/overflow-reserve. (cherry picked from commit 8bd2702) Signed-off-by: Ala Luszczak <aladatabricks.com> Author: Ala Luszczak <ala@databricks.com> Closes #21227 from ala/cherry-pick-overflow-reserve.
ColumnVector
s store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a singleColumnVector
cannot store more than 2GB of string data.But since the Parquet files commonly contain large blobs stored as strings, and
ColumnVector
s by default carry 4096 values, it's entirely possible to go past that limit. In such cases a negative capacity is requested fromWritableColumnVector.reserve()
. The call succeeds (requested capacity is smaller than already allocated capacity), and consequentlyjava.lang.ArrayIndexOutOfBoundsException
is thrown when the reader actually attempts to put the data into the array.This change introduces a simple check for integer overflow to
WritableColumnVector.reserve()
which should help catch the error earlier and provide more informative exception. Additionally, the error message inWritableColumnVector.throwUnsupportedException()
was corrected.New units tests were added.
Author: Ala Luszczak ala@databricks.com
Closes #21206 from ala/overflow-reserve.
(cherry picked from commit 8bd2702)
Signed-off-by: Ala Luszczak ala@databricks.com