Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23388][SQL] Support for Parquet Binary DecimalType in VectorizedColumnReader #20580

Closed

Conversation

jamesthomp
Copy link
Contributor

What changes were proposed in this pull request?

Re-add support for parquet binary DecimalType in VectorizedColumnReader

How was this patch tested?

Existing test suite

@@ -691,11 +696,6 @@ public WritableColumnVector arrayData() {
*/
protected abstract WritableColumnVector reserveNewColumn(int capacity, DataType type);

protected boolean isArray() {
return type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to minimize this change? i.e. just protected -> public without changing the place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to keep the previous place but I noticed that all the public methods in this class are grouped together so I thought that I should move this to keep that consistent

Copy link
Member

@kiszk kiszk Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this change in WritableColumnVector class.

WritableColumnVector class will be public API in Spark 2.3. It seem to be necessary to discuss with @cloud-fan for changing the visibility of a method or field.

@@ -444,7 +444,7 @@ private void readBinaryBatch(int rowId, int num, WritableColumnVector column) {
// This is where we implement support for the valid type conversions.
// TODO: implement remaining type conversions
VectorizedValuesReader data = (VectorizedValuesReader) dataColumn;
if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType) {
if (column.isArray()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need new test cases for supporting a new type?

@kiszk
Copy link
Member

kiszk commented Feb 12, 2018

IIUC, does this PR add to support array type, too?

@jamesthomp
Copy link
Contributor Author

Yeah I believe it will add support for the array type too. Spark actually previously supported these types but the support was removed in this PR: 9c29c55#diff-7bdf5fd0ce0b1ccbf4ecf083611976e6R428

I'm just trying to add it back.

@kiszk
Copy link
Member

kiszk commented Feb 12, 2018

@cloud-fan Is there any reason that the above PR removed to support some types such as Array?

@a10y
Copy link

a10y commented Feb 12, 2018

As far as we can tell this was an accidental breaking change, as dropping support for this in vectorized Parquet reader was never called out. We have Parquet datasets with binary columns with logical type DECIMAL that were loadable before the change and have since become unloadable, throwing in readBinaryBatch

@jamesthomp
Copy link
Contributor Author

@kiszk - I've changed the implementation to no longer use column.isArray() and instead just inline the decimal type check (so no changes needed to the public api). I don't think you could actually hit this codepath with ArrayType, so that part was unnecessary.

As for testing, it might be easiest to check in a parquet file with the binary decimal format and then check that spark can read it?

@a10y
Copy link

a10y commented Feb 12, 2018

If you did add a test it should probably generate the Parquet file programmatically rather than checking it in. Some examples in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala.

However, given that there weren't tests before and this just fixes a break that was recently introduced, it seems reasonable that creation of tests for the vectorized reader can be done in a folowup?

@cloud-fan
Copy link
Contributor

It was an accident, thanks for the fix!

Can we add a test? It's always good to have a test for a bug fix, even the bug was introduced recently.

@jamesthomp
Copy link
Contributor Author

I'll see if I can generate a parquet file with the right schema to add for a test, but probably cannot look at this till tomorrow.

@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

You don't need to generate the parquet file manually, just write a parquet file using Spark and read it back. We can probably add this test in ParquetFileFormatSuite.

@@ -444,7 +444,8 @@ private void readBinaryBatch(int rowId, int num, WritableColumnVector column) {
// This is where we implement support for the valid type conversions.
// TODO: implement remaining type conversions
VectorizedValuesReader data = (VectorizedValuesReader) dataColumn;
if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType) {
if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType
|| DecimalType.isByteArrayDecimalType(column.dataType())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please correct the indent too.

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87335 has finished for PR 20580 at commit 378ce28.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM.

Since this is a regression + blocker of Spark 2.3 release, I am merging it now. Please submit a follow-up PR to add the tests. Thanks!

asfgit pushed a commit that referenced this pull request Feb 12, 2018
…edColumnReader

## What changes were proposed in this pull request?

Re-add support for parquet binary DecimalType in VectorizedColumnReader

## How was this patch tested?

Existing test suite

Author: James Thompson <jamesthomp@users.noreply.github.com>

Closes #20580 from jamesthomp/jt/add-back-binary-decimal.

(cherry picked from commit 5bb1141)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in 5bb1141 Feb 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants