-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-17905. Modify Text.ensureCapacity() to efficiently max out the… #3423
Conversation
… backing array size Change-Id: I3f804011519166204f63a9f1a4b297799210f163
fc99e97
to
ba11e4d
Compare
@@ -301,9 +305,18 @@ public void clear() { | |||
*/ | |||
private boolean ensureCapacity(final int capacity) { | |||
if (bytes.length < capacity) { | |||
// use long to allow overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optionally, we can also check here if capacity > MAX_ARRAY_SIZE and fail gracefully with a good error message. But it's not strictly necessary, System.arrayCopy() would fail later.
💔 -1 overall
This message was automatically generated. |
@@ -73,6 +73,10 @@ protected CharsetDecoder initialValue() { | |||
} | |||
}; | |||
|
|||
// max size of the byte array, seems to be a safe choice for multiple VMs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple VMs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should have written different kind of VMs (OpenJDK, HotSpot, etc). It's more like a practical value that will likely work under different versions. Some details: https://programming.guide/java/array-maximum-length.html. If this comment is confusing, I can remove it (or perhaps extend it a little bit?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JVMs then
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java
Show resolved
Hide resolved
// | ||
// If the calculated value is beyond the size | ||
// limit, we cap it to ARRAY_MAX_SIZE | ||
int targetSize = (int)Math.min(ARRAY_MAX_SIZE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would separate it into two lines.
int targetSize = Math.max(tmpCapacity, tmpLength + (tmpLength >> 1));
targetSize = (int)Math.min(ARRAY_MAX_SIZE, targetSize);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the cast to int by the way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overloaded Math.max(long, long) will be used, which returns long, so it has to be cast back to an int. Therefore, in your suggestion, targetSize should also be a long (which, again has to be cast to an "int" when you instantiate the array).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to cast inside so it uses Math.max(int, int) and keeps everything as ints.
The splitting for the min and the max is more a matter of readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@goiri could you comment on the latest changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following is easier to read and less casting required:
int targetSize = (int) bytes.length + (bytes.length >> 1);
targetSize = Math.max(capacity, targetSize);
targetSize = Math.min(ARRAY_MAX_SIZE, targetSize);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that bytes.length + (bytes.length >> 1);
might overflow and end up being negative which results in choosing capacity
all the time instead of ARRAY_MAX_SIZE
. So we either temporarily store this as long
and cast back to int
or we directly check if targetSize
is negative after int targetSize = (int) bytes.length + (bytes.length >> 1);
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the explanation it makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are protecting against overflows, then we should make it explicit:
long targetSizeLong = bytes.length + (bytes.length >> 1);
int targetSize = (int)Math.min(targetSizeLong, ARRAY_MAX_SIZE);
targetSize = Math.max(capacity, targetSize);
💔 -1 overall
This message was automatically generated. |
Change-Id: Idb901f386dd46d9ee36f23d40a31ca5f3d2c2afc
b928024
to
d82e53f
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM.
Change-Id: Id79df0aaf0f7f6eed88627474c763ef938db26c8
💔 -1 overall
This message was automatically generated. |
…... (apache#3423) Change-Id: I28bc6338abcfddd76edd6d3f146b88f4c38a122b (cherry picked from commit e9944164fd215ef0b83486d8237a3ce2053b2dcc)
… out the... (apache#3423)" This reverts commit d37662f. Change-Id: I485958d9f7035d392511bed912e52e77a347d01f
… backing array size
Change-Id: I3cfae85000c5fa7aa86c40c2d7efa282958178bf
Description of PR
Allow org.apache.hadoop.io.Text to expand the underlying byte array to a safe maximum.
How was this patch tested?
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?