Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector. #18958

Closed
wants to merge 18 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Aug 16, 2017

What changes were proposed in this pull request?

This is a refactoring of ColumnVector hierarchy and related classes.

  1. make ColumnVector read-only
  2. introduce WritableColumnVector with write interface
  3. remove ReadOnlyColumnVector

How was this patch tested?

Existing tests.

@ueshin
Copy link
Member Author

ueshin commented Aug 16, 2017

cc @cloud-fan @BryanCutler

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80720 has finished for PR 18958 at commit cd0de39.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Aug 16, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80723 has finished for PR 18958 at commit cd0de39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert (!columns[ordinal].isConstant);
columns[ordinal].putNotNull(rowId);
columns[ordinal].putBoolean(rowId, value);
((MutableColumnVector) columns[ordinal]).putNotNull(rowId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move the assertion and the cast in a private getter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a private getter and update these.

" to false.";

if (cause != null) {
throw new RuntimeException(message, cause);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are allowed to pass null as a cause to the RuntimeException constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll update it.

public OffHeapColumnVector(int capacity, DataType type) {
super(capacity, type);

if (type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to move this initialization logic into the parent class? We should be able to factor out the on/off-heap specific initialization logic into a separate method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll try it.

* Reserve a integer column for ids of dictionary.
*/
@Override
public OffHeapColumnVector reserveDictionaryIds(int capacity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in the constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll try it.

| aggregateBufferSchema, org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
| for (int i = 0 ; i < aggregateBufferBatch.numCols(); i++) {
| aggregateBufferBatch.setColumn(i, batch.column(i+${groupingKeys.length}));
| batchVectors = new org.apache.spark.sql.execution.vectorized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens quite a few times. It might be better to create a static util method that creates the vectors for you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll try it.

| }
| // TODO: Possibly generate this projection in HashAggregate directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I'm not sure because this is from original code.

* elements. This means that the put() APIs do not check as in common cases (i.e. flat schemas),
* the lengths are known up front.
*
* A ColumnVector should be considered immutable once originally created. In other words, it is not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts the name of this class. Maybe reuseable is a better way of describing what is going on here. Also cc @michal-databricks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about WritableColumnVector?

@hvanhovell
Copy link
Contributor

On a more generic level. We could also choose to make ColumnVectors immutable, and create builder classes to create (reusable) instances; this would create a better separation between the API's and make sure that the mutable vectors are not used incorrectly.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @ueshin! From the perspective of using this for ArrowColumnVector batches, LGTM. I just had one question about removing the capacity var from ColumnarBatch, I think we can get away with just using numRows.

this.capacity = maxRows;
this.columns = new ColumnVector[schema.size()];
this.columns = columns;
this.capacity = capacity;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does capacity really mean anything in here anymore since the ColumnVectors are allocated and populated outside now? Could we just initialize this.numRows = 0 and delay initializing of this.filteredRows until setNumRows() is called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some places referring ColumnarBatch.capacity(), so I'd be a little conservative to do that for now.

@ueshin
Copy link
Member Author

ueshin commented Aug 17, 2017

also cc @kiszk for another column vector pr.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80766 has finished for PR 18958 at commit b6ab633.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Aug 17, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80775 has finished for PR 18958 at commit b6ab633.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -450,14 +450,13 @@ class CodegenContext {
/**
* Returns the specialized code to set a given value in a column vector for a given `DataType`.
*/
def setValue(batch: String, row: String, dataType: DataType, ordinal: Int,
value: String): String = {
def setValue(vector: String, row: String, dataType: DataType, value: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: row -> rowId

@@ -433,7 +434,8 @@ private void readBinaryBatch(int rowId, int num, ColumnVector column) throws IOE
}

private void readFixedLenByteArrayBatch(int rowId, int num,
ColumnVector column, int arrayLen) throws IOException {
MutableColumnVector column,
int arrayLen) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

private void xxx(
    para1: XX,
    para2: XX)

anyNullsSet = false;
}
}
public abstract void reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ColumnVector is read-only, why we need a reset API?

this.resultArray = null;
this.resultStruct = null;
}
this.isConstant = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think isConstant should belong to MutableColumnVector, because it's used to indicate that this column vector should not be updated.

assert (!columns[ordinal].isConstant);
columns[ordinal].putNotNull(rowId);
columns[ordinal].putByte(rowId, value);
MutableColumnVector column = getColumnAsMutable(ordinal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little afraid about this per-call type cast, but JVM should be able to optimize it perfectly, cc @kiszk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, cast still occurs at runtime. The cast operation may consist compare and branch.
I am thinking about how we can reduce the cost of operations.

Copy link
Member

@kiszk kiszk Aug 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin @cloud-fan
Since MutableColumnVector in each column in ColumnarBatch is immutable, we can create an array of MutableColumnVector by applying cast from ColumnVector at initialization. If an cast exception occurs, we can ignore it since the column will not call setter APIs. Then, each setter in refers to an element of the array without a cast.

What do you think?

@ueshin ueshin changed the title [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce MutableColumnVector. [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector. Aug 21, 2017
*/
protected int elementsAppended;

/**
* If this is a nested type (array or struct), the column for the child data.
*/
protected ColumnVector[] childColumns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to WritableColumnVector? I think ColumnVector only need ColumnVector getChildColumn(int ordinal), and WritableColumnVector can overwrite it to WritableColumnVector getChildColumn(int ordinal)

Copy link
Member Author

@ueshin ueshin Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this field for ArrowColumnVector to store its child columns for now, too.
Do you want to make the method getChildColumn(int ordinal) abstract and move the field to more concrete classes to manage by themselves?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, because mostly the child columns are of the same type of concrete column vector type.

@@ -307,64 +293,70 @@ public void update(int ordinal, Object value) {

@Override
public void setNullAt(int ordinal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question, does the rows returned by ColumnarBatch.rowIterator have to be mutable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the rows returned by ColumnarBatch.rowIterator doesn't need to be mutable with our current tests, but ColumnarBatch.Row still needs to be mutable, the write apis of which are used in HashAggregateExec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, then we really need to think about how to eliminate the per-call type cast...

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80918 has finished for PR 18958 at commit 4d94655.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Aug 21, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80923 has finished for PR 18958 at commit 4d94655.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Initialize child columns.
*/
protected void initialize() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move this method into the constructor.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80962 has finished for PR 18958 at commit 9eb88a8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Aug 22, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80968 has finished for PR 18958 at commit 9eb88a8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80979 has finished for PR 18958 at commit 65cd681.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Aug 22, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80987 has finished for PR 18958 at commit 65cd681.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

column.putDecimal(rowId, value, precision);
}

private WritableColumnVector getColumnAsWritable(int ordinal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getWritableColumn

@cloud-fan
Copy link
Contributor

LGTM except some minor comments

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81038 has finished for PR 18958 at commit 8330870.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Aug 24, 2017

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 9e33954 Aug 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants