Skip to content

Conversation

@sameeragarwal
Copy link
Member

What changes were proposed in this pull request?

This PR adds support for TimestampType in the vectorized parquet reader

How was this patch tested?

  1. VectorizedColumnReader initially had a gating condition on primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96) that made us fall back on parquet-mr for handling timestamps. This condition is now removed.
  2. The ParquetHadoopFsRelationSuite (that tests for all supported hive types -- including TimestampType) fails when the gating condition is removed ([WIP][SPARK-13994][SQL] Investigate types not supported by vectorized parquet reader #11808) and should now pass with this change. Similarly, the ParquetHiveCompatibilitySuite.SPARK-10177 timestamp test that fails when the gating condition is removed, should now pass as well.
  3. Added tests in HadoopFsRelationTest that test both the dictionary encoded and non-encoded versions across all supported datatypes.

@sameeragarwal
Copy link
Member Author

cc @nongli

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53757 has finished for PR 11882 at commit a612f8b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case INT96:
if (column.dataType() == DataTypes.TimestampType) {
for (int i = rowId; i < rowId + num; ++i) {
Binary v = dictionary.decodeToBinary(dictionaryIds.getInt(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm -- maybe we can do something a lot cheaper? At the very least maybe we can remove the creation of the this Binary object, since we are turning it immediately into a Long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that sounds good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO for converting the dictionary of binaries into long to make it cheaper

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53759 has finished for PR 11882 at commit 3656dae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (column.dataType() == DataTypes.LongType ||
if (column.dataType() == DataTypes.LongType || column.dataType() == DataTypes.TimestampType ||
DecimalType.is64BitDecimalType(column.dataType())) {
defColumn.readLongs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. How does this work? readLongs is expecting to read parquet int64 physical types. How is it able to read this other physical type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests where this type is not dictionary encoded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed. Also added tests in HadoopFsRelationTest that test both the dictionary encoded and non-encoded versions across all supported datatypes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53829 has finished for PR 11882 at commit 1f511ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


originalTypes[i] = t.getOriginalType();

// TODO: Be extremely cautious in what is supported. Expand this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this check now too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, sounds good.

@nongli
Copy link
Contributor

nongli commented Mar 23, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53940 has finished for PR 11882 at commit 00b1f12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 0a64294 Mar 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants