Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23090][SQL] polish ColumnVector #20277

Closed
wants to merge 7 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 16, 2018

What changes were proposed in this pull request?

Several improvements:

  • provide a default implementation for the batch get methods
  • rename getChildColumn to getChild, which is more concise
  • remove getStruct(int, int), it's only used to simplify the codegen, which is an internal thing, we should not add a public API for this purpose.

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author


private final ArrowVectorAccessor accessor;
private ArrowColumnVector[] childColumns;

private void ensureAccessible(int index) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnVector is a performance critical place, we don't need index checking here, like other column vector implementations.

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86179 has started for PR 20277 at commit ba7ca0f.

@hvanhovell
Copy link
Contributor

@cloud-fan did you do some benchmarks? I'd like to make sure that the abstract class to interface change does not negatively impact performance.

@cloud-fan cloud-fan force-pushed the column-vector branch 3 times, most recently from 08d06a7 to bc6d0af Compare January 17, 2018 05:04
@cloud-fan
Copy link
Contributor Author

@hvanhovell good idea. I ran the ColumnarBatchBenchmark and found getArray has a regression as it's not final anymore. I've reverted the interface stuff and ColumnVector is still abstract class now.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86240 has finished for PR 20277 at commit 08d06a7.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ensureAccessible(index, 1);
}

private void ensureAccessible(int index, int count) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnVector is a performance critical place, we don't need index checking here, like other column vector implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this in non-debug version. Can we add assert of this check at each caller site for debugging?

p.s. Sorry for slow reviews since I am on vacation this week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we do it later? We need to find a central place to put this check, instead of doing it in every implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to do it later. I agree that we do the same check at one place.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86222 has finished for PR 20277 at commit 3cba91b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class OrcColumnVector implements org.apache.spark.sql.vectorized.ColumnVector
  • public abstract class WritableColumnVector implements ColumnVector
  • public final class ArrowColumnVector implements ColumnVector

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86246 has finished for PR 20277 at commit f3f9d5e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86245 has finished for PR 20277 at commit 77e8c4b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86242 has finished for PR 20277 at commit bc6d0af.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86254 has finished for PR 20277 at commit f3f9d5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2018

Test build #86340 has finished for PR 20277 at commit 212c841.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #86371 has finished for PR 20277 at commit eccdca1.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -53,166 +41,83 @@ public int numNulls() {
@Override
public void close() {
if (childColumns != null) {
for (int i = 0; i < childColumns.length; i++) {
childColumns[i].close();
for (ArrowColumnVector childColumn : childColumns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be faster?

Copy link
Contributor

@jiangxb1987 jiangxb1987 Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also apply similar change to WritableColumnVector.close()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the performance is same, it's just a more standard way to iterate an array in java

/**
* Returns the array for rowId.
*/
public final ColumnarArray getArray(int rowId) {
return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId));
return new ColumnarArray(getChild(0), getArrayOffset(rowId), getArrayLength(rowId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also paste the benchmark result after the changes.

s"$columnVar.getStruct($ordinal)"
} else {
ctx.getValue(columnVar, dataType, ordinal)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use this API?

  /**
   * Returns the specialized code to access a value from a column vector for a given `DataType`.
   */
  def getValue(vector: String, rowId: String, dataType: DataType): String = {
    ...
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I didn't know there is such an API. I'll use it instead.

val value = if (key.dataType.isInstanceOf[StructType]) {
s"vectors[$ordinal].getStruct(buckets[idx])"
} else {
ctx.getValue(s"vectors[$ordinal]", "buckets[idx]", key.dataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #86372 has finished for PR 20277 at commit 37c82e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #86376 has finished for PR 20277 at commit 3972093.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #86380 has finished for PR 20277 at commit 3972093.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please, since the ColumnarBatch PR has been merged.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #86394 has finished for PR 20277 at commit 3972093.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -55,164 +43,82 @@ public void close() {
if (childColumns != null) {
for (int i = 0; i < childColumns.length; i++) {
childColumns[i].close();
childColumns[i] = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK not to call close() while ColumnVector.close() is provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean? ColumnVector.close is a required interface.

@@ -55,164 +43,82 @@ public void close() {
if (childColumns != null) {
for (int i = 0; i < childColumns.length; i++) {
childColumns[i].close();
childColumns[i] = null;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do childColumns = null after the for loop, otherwise NullPointerException will be thrown if close() is called twice?

key.dataType), key.name)})"""
// `ColumnVector.getStruct` is different from `InternalRow.getStruct`, it only takes an
// `ordinal` parameter.
val value = ctx.getValue(s"vectors[$ordinal]", key.dataType, "buckets[idx]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getValueFromVector instead of getValue?

@ueshin
Copy link
Member

ueshin commented Jan 22, 2018

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Jan 22, 2018

Test build #86459 has finished for PR 20277 at commit 55a288e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2018

Test build #86463 has finished for PR 20277 at commit 0c22f5b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jan 22, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 22, 2018

Test build #86469 has finished for PR 20277 at commit 0c22f5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Jan 22, 2018
## What changes were proposed in this pull request?

Several improvements:
* provide a default implementation for the batch get methods
* rename `getChildColumn` to `getChild`, which is more concise
* remove `getStruct(int, int)`, it's only used to simplify the codegen, which is an internal thing, we should not add a public API for this purpose.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #20277 from cloud-fan/column-vector.

(cherry picked from commit 5d680ca)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in 5d680ca Jan 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants