[SPARK-23388][SQL] Support for Parquet Binary DecimalType in VectorizedColumnReader #20580

jamesthomp · 2018-02-11T22:56:05Z

What changes were proposed in this pull request?

Re-add support for parquet binary DecimalType in VectorizedColumnReader

How was this patch tested?

Existing test suite

kiszk · 2018-02-12T01:00:42Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java

@@ -691,11 +696,6 @@ public WritableColumnVector arrayData() {
   */
  protected abstract WritableColumnVector reserveNewColumn(int capacity, DataType type);

-  protected boolean isArray() {
-    return type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType ||


Would it be better to minimize this change? i.e. just protected -> public without changing the place.

happy to keep the previous place but I noticed that all the public methods in this class are grouped together so I thought that I should move this to keep that consistent

I noticed this change in WritableColumnVector class.

WritableColumnVector class will be public API in Spark 2.3. It seem to be necessary to discuss with @cloud-fan for changing the visibility of a method or field.

kiszk · 2018-02-12T01:01:37Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

@@ -444,7 +444,7 @@ private void readBinaryBatch(int rowId, int num, WritableColumnVector column) {
    // This is where we implement support for the valid type conversions.
    // TODO: implement remaining type conversions
    VectorizedValuesReader data = (VectorizedValuesReader) dataColumn;
-    if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType) {
+    if (column.isArray()) {


Do we need new test cases for supporting a new type?

kiszk · 2018-02-12T01:03:47Z

IIUC, does this PR add to support array type, too?

jamesthomp · 2018-02-12T01:23:16Z

Yeah I believe it will add support for the array type too. Spark actually previously supported these types but the support was removed in this PR: 9c29c55#diff-7bdf5fd0ce0b1ccbf4ecf083611976e6R428

I'm just trying to add it back.

kiszk · 2018-02-12T07:43:45Z

@cloud-fan Is there any reason that the above PR removed to support some types such as Array?

a10y · 2018-02-12T11:40:47Z

As far as we can tell this was an accidental breaking change, as dropping support for this in vectorized Parquet reader was never called out. We have Parquet datasets with binary columns with logical type DECIMAL that were loadable before the change and have since become unloadable, throwing in readBinaryBatch

jamesthomp · 2018-02-12T11:45:23Z

@kiszk - I've changed the implementation to no longer use column.isArray() and instead just inline the decimal type check (so no changes needed to the public api). I don't think you could actually hit this codepath with ArrayType, so that part was unnecessary.

As for testing, it might be easiest to check in a parquet file with the binary decimal format and then check that spark can read it?

a10y · 2018-02-12T12:54:33Z

If you did add a test it should probably generate the Parquet file programmatically rather than checking it in. Some examples in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala.

However, given that there weren't tests before and this just fixes a break that was recently introduced, it seems reasonable that creation of tests for the vectorized reader can be done in a folowup?

cloud-fan · 2018-02-12T13:41:20Z

It was an accident, thanks for the fix!

Can we add a test? It's always good to have a test for a bug fix, even the bug was introduced recently.

jamesthomp · 2018-02-12T14:32:43Z

I'll see if I can generate a parquet file with the right schema to add for a test, but probably cannot look at this till tomorrow.

cloud-fan · 2018-02-12T16:26:47Z

ok to test

cloud-fan · 2018-02-12T16:28:53Z

You don't need to generate the parquet file manually, just write a parquet file using Spark and read it back. We can probably add this test in ParquetFileFormatSuite.

gatorsmile · 2018-02-12T19:13:33Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

@@ -444,7 +444,8 @@ private void readBinaryBatch(int rowId, int num, WritableColumnVector column) {
    // This is where we implement support for the valid type conversions.
    // TODO: implement remaining type conversions
    VectorizedValuesReader data = (VectorizedValuesReader) dataColumn;
-    if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType) {
+    if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType
+            || DecimalType.isByteArrayDecimalType(column.dataType())) {


Also please correct the indent too.

SparkQA · 2018-02-12T19:29:12Z

Test build #87335 has finished for PR 20580 at commit 378ce28.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-12T19:32:59Z

LGTM.

Since this is a regression + blocker of Spark 2.3 release, I am merging it now. Please submit a follow-up PR to add the tests. Thanks!

…edColumnReader ## What changes were proposed in this pull request? Re-add support for parquet binary DecimalType in VectorizedColumnReader ## How was this patch tested? Existing test suite Author: James Thompson <jamesthomp@users.noreply.github.com> Closes #20580 from jamesthomp/jt/add-back-binary-decimal. (cherry picked from commit 5bb1141) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

jamesthomp mentioned this pull request Feb 11, 2018

[SPARK-23388][SQL] Support for Parquet Binary DecimalType in VectorizedColumnReader palantir/spark#315

Merged

kiszk reviewed Feb 12, 2018

View reviewed changes

add back binary decimal

378ce28

jamesthomp force-pushed the jt/add-back-binary-decimal branch from e04d8a8 to 378ce28 Compare February 12, 2018 11:42

gatorsmile reviewed Feb 12, 2018

View reviewed changes

asfgit closed this in 5bb1141 Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23388][SQL] Support for Parquet Binary DecimalType in VectorizedColumnReader #20580

[SPARK-23388][SQL] Support for Parquet Binary DecimalType in VectorizedColumnReader #20580

jamesthomp commented Feb 11, 2018

kiszk Feb 12, 2018

jamesthomp Feb 12, 2018

kiszk Feb 12, 2018 •

edited

kiszk Feb 12, 2018

kiszk commented Feb 12, 2018 •

edited

jamesthomp commented Feb 12, 2018

kiszk commented Feb 12, 2018

a10y commented Feb 12, 2018 •

edited

jamesthomp commented Feb 12, 2018

a10y commented Feb 12, 2018

cloud-fan commented Feb 12, 2018

jamesthomp commented Feb 12, 2018

cloud-fan commented Feb 12, 2018

cloud-fan commented Feb 12, 2018

gatorsmile Feb 12, 2018

SparkQA commented Feb 12, 2018

gatorsmile commented Feb 12, 2018

[SPARK-23388][SQL] Support for Parquet Binary DecimalType in VectorizedColumnReader #20580

[SPARK-23388][SQL] Support for Parquet Binary DecimalType in VectorizedColumnReader #20580

Conversation

jamesthomp commented Feb 11, 2018

What changes were proposed in this pull request?

How was this patch tested?

kiszk Feb 12, 2018

Choose a reason for hiding this comment

jamesthomp Feb 12, 2018

Choose a reason for hiding this comment

kiszk Feb 12, 2018 • edited

Choose a reason for hiding this comment

kiszk Feb 12, 2018

Choose a reason for hiding this comment

kiszk commented Feb 12, 2018 • edited

jamesthomp commented Feb 12, 2018

kiszk commented Feb 12, 2018

a10y commented Feb 12, 2018 • edited

jamesthomp commented Feb 12, 2018

a10y commented Feb 12, 2018

cloud-fan commented Feb 12, 2018

jamesthomp commented Feb 12, 2018

cloud-fan commented Feb 12, 2018

cloud-fan commented Feb 12, 2018

gatorsmile Feb 12, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 12, 2018

gatorsmile commented Feb 12, 2018

kiszk Feb 12, 2018 •

edited

kiszk commented Feb 12, 2018 •

edited

a10y commented Feb 12, 2018 •

edited