-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23090][SQL] polish ColumnVector #20277
Conversation
|
||
private final ArrowVectorAccessor accessor; | ||
private ArrowColumnVector[] childColumns; | ||
|
||
private void ensureAccessible(int index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnVector
is a performance critical place, we don't need index checking here, like other column vector implementations.
Test build #86179 has started for PR 20277 at commit |
@cloud-fan did you do some benchmarks? I'd like to make sure that the abstract class to interface change does not negatively impact performance. |
08d06a7
to
bc6d0af
Compare
@hvanhovell good idea. I ran the |
Test build #86240 has finished for PR 20277 at commit
|
bc6d0af
to
77e8c4b
Compare
ensureAccessible(index, 1); | ||
} | ||
|
||
private void ensureAccessible(int index, int count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnVector is a performance critical place, we don't need index checking here, like other column vector implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this in non-debug version. Can we add assert of this check at each caller site for debugging?
p.s. Sorry for slow reviews since I am on vacation this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we do it later? We need to find a central place to put this check, instead of doing it in every implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good to do it later. I agree that we do the same check at one place.
77e8c4b
to
f3f9d5e
Compare
Test build #86222 has finished for PR 20277 at commit
|
Test build #86246 has finished for PR 20277 at commit
|
Test build #86245 has finished for PR 20277 at commit
|
Test build #86242 has finished for PR 20277 at commit
|
retest this please |
Test build #86254 has finished for PR 20277 at commit
|
Test build #86340 has finished for PR 20277 at commit
|
212c841
to
eccdca1
Compare
Test build #86371 has finished for PR 20277 at commit
|
@@ -53,166 +41,83 @@ public int numNulls() { | |||
@Override | |||
public void close() { | |||
if (childColumns != null) { | |||
for (int i = 0; i < childColumns.length; i++) { | |||
childColumns[i].close(); | |||
for (ArrowColumnVector childColumn : childColumns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also apply similar change to WritableColumnVector.close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the performance is same, it's just a more standard way to iterate an array in java
/** | ||
* Returns the array for rowId. | ||
*/ | ||
public final ColumnarArray getArray(int rowId) { | ||
return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId)); | ||
return new ColumnarArray(getChild(0), getArrayOffset(rowId), getArrayLength(rowId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also paste the benchmark result after the changes.
s"$columnVar.getStruct($ordinal)" | ||
} else { | ||
ctx.getValue(columnVar, dataType, ordinal) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use this API?
/**
* Returns the specialized code to access a value from a column vector for a given `DataType`.
*/
def getValue(vector: String, rowId: String, dataType: DataType): String = {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I didn't know there is such an API. I'll use it instead.
val value = if (key.dataType.isInstanceOf[StructType]) { | ||
s"vectors[$ordinal].getStruct(buckets[idx])" | ||
} else { | ||
ctx.getValue(s"vectors[$ordinal]", "buckets[idx]", key.dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
Test build #86372 has finished for PR 20277 at commit
|
Test build #86376 has finished for PR 20277 at commit
|
retest this please |
Test build #86380 has finished for PR 20277 at commit
|
retest this please, since the |
Test build #86394 has finished for PR 20277 at commit
|
@@ -55,164 +43,82 @@ public void close() { | |||
if (childColumns != null) { | |||
for (int i = 0; i < childColumns.length; i++) { | |||
childColumns[i].close(); | |||
childColumns[i] = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK not to call close()
while ColumnVector.close()
is provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean? ColumnVector.close
is a required interface.
@@ -55,164 +43,82 @@ public void close() { | |||
if (childColumns != null) { | |||
for (int i = 0; i < childColumns.length; i++) { | |||
childColumns[i].close(); | |||
childColumns[i] = null; | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do childColumns = null
after the for loop, otherwise NullPointerException
will be thrown if close()
is called twice?
key.dataType), key.name)})""" | ||
// `ColumnVector.getStruct` is different from `InternalRow.getStruct`, it only takes an | ||
// `ordinal` parameter. | ||
val value = ctx.getValue(s"vectors[$ordinal]", key.dataType, "buckets[idx]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getValueFromVector
instead of getValue
?
LGTM pending Jenkins. |
Test build #86459 has finished for PR 20277 at commit
|
Test build #86463 has finished for PR 20277 at commit
|
Jenkins, retest this please. |
Test build #86469 has finished for PR 20277 at commit
|
thanks, merging to master/2.3! |
## What changes were proposed in this pull request? Several improvements: * provide a default implementation for the batch get methods * rename `getChildColumn` to `getChild`, which is more concise * remove `getStruct(int, int)`, it's only used to simplify the codegen, which is an internal thing, we should not add a public API for this purpose. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #20277 from cloud-fan/column-vector. (cherry picked from commit 5d680ca) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Several improvements:
getChildColumn
togetChild
, which is more concisegetStruct(int, int)
, it's only used to simplify the codegen, which is an internal thing, we should not add a public API for this purpose.How was this patch tested?
existing tests