Vectorized Reads of Parquet with Identity Partitions by RussellSpitzer · Pull Request #1287 · apache/iceberg

RussellSpitzer · 2020-08-03T16:32:22Z

Previously vectorization would be disabled whenever an underlying iceberg table
was using Parquet files and also used Identity transforms in it's partitioning.

To fix this we extend the DummyVectorReader to be a ConstantVectorReader which is
used when a column's value can be determined from the PartitionSpec. Then when
constructing the reader we use a ConstantColumnVector to fill in the missing
column.

Previously vectorization would be disabled whenever an underlying iceberg table was using Parquet files and also used Identity transforms in it's partitioning. To fix this we extend the DummyVectorReader to be a ConstantVectorReader which is used when a column's value can be determined from the PartitionSpec. Then when constructing the reader we use a ConstantColumnVector to fill in the missing column.

RussellSpitzer · 2020-08-03T16:32:39Z

CC @aokolnychyi

rdblue · 2020-08-03T17:44:55Z

@samarthjain, can you help review this one?

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

spark/src/test/java/org/apache/iceberg/spark/source/TestIdentityPartitionData.java

rdblue · 2020-08-03T17:49:41Z

spark/src/test/java/org/apache/iceberg/spark/source/TestIdentityPartitionData.java

        .option("vectorization-enabled", String.valueOf(vectorized))
-        .load(table.location()).orderBy("id").collectAsList();
+        .load(table.location()).orderBy("id")
+        .select("id", "date", "level", "message")


Isn't this the default? Why was it necessary to add select?

When I added in the Hive Import it gets the schema in a different order, I think this may be an issue with the import code? I'm not sure, but I know the default column order does not come out the same way :/

That's suspicious. We'll have to look into why the schema has the wrong order. I see select before all the writes, so it shouldn't need the reorder here.

I'll try to figure out the actual issue today, but I agree it shouldn't work this way. My assumption is that the Hive table schema is just being listed in a different order or when we use SparkSchemaUtil the order is getting scrambled.

I spent some time digging into this,
When you call saveAsTable it ends up in this bit of code in DataFrameWriter

val tableDesc = CatalogTable( identifier = tableIdent, tableType = tableType, storage = storage, schema = new StructType, provider = Some(source), partitionColumnNames = partitioningColumns.getOrElse(Nil), bucketSpec = getBucketSpec)

Which strips out whatever incoming schema you have. So the new table is created without any information about the actual ordering of columns you used in the create.

Then when the Relation is resolved, that's when the attributes are looked up again and the schema is created from the Attribute output. So long story short, saveAsTable doesn't care about your field ordering as far as I can tell. This is all in Spark and I'm not sure we can do anything about it here.

I'm fine with this, then. Thanks for looking into it!

rdblue · 2020-08-03T17:59:31Z

Looks mostly good to me!

RussellSpitzer · 2020-08-03T18:03:41Z

I wasn't really happy about doing the instance checking in Java (if dummy then cast), it makes me long for Scala :P I do think this is probably a minimal set of changes to get this in without breaking too much open

…

On Mon, Aug 3, 2020 at 12:59 PM Ryan Blue ***@***.***> wrote: Looks mostly good to me! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1287 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADE2YN4YJZSKNOSMAPSBU3R633JDANCNFSM4PTQKJVQ> .

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

samarthjain · 2020-08-03T18:54:38Z

One minor comment, but generally looks good to me.

Renamed Classes Parameterized Others

RussellSpitzer · 2020-08-03T22:21:39Z

Thanks @samarthjain and @rdblue I applied all your comments! The only thing which I couldn't address was the Hive (saveAsTable) reordering thing. But hopefully I can get some time to working on making a save.format(iceberg) to due some column pruning with identity transforms and simplify this test later?

rdblue · 2020-08-05T01:08:24Z

Merged. Thanks, @RussellSpitzer! Good to have this feature done.

…che#1287)

rdblue reviewed Aug 3, 2020

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 3, 2020

View reviewed changes

spark/src/test/java/org/apache/iceberg/spark/source/TestIdentityPartitionData.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 3, 2020

View reviewed changes

samarthjain reviewed Aug 3, 2020

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java Show resolved Hide resolved

Reviewer Comments

8bd9c5e

Renamed Classes Parameterized Others

rdblue merged commit 71de518 into apache:master Aug 5, 2020

shardulm94 mentioned this pull request Aug 10, 2020

Spark3: Enable Parquet vectorized reads with identity partition values #1312

Merged

aokolnychyi pushed a commit to aokolnychyi/iceberg that referenced this pull request Aug 18, 2020

Parquet: Support vectorized reads with identity partition values (apa…

9fad4e1

…che#1287)

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Parquet: Support vectorized reads with identity partition values (apa…

d7c100b

…che#1287)

Conversation

RussellSpitzer commented Aug 3, 2020

Uh oh!

RussellSpitzer commented Aug 3, 2020

Uh oh!

rdblue commented Aug 3, 2020

Uh oh!

Uh oh!

Uh oh!

rdblue Aug 3, 2020

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Aug 3, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Aug 3, 2020

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Aug 3, 2020

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Aug 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 3, 2020

Uh oh!

RussellSpitzer commented Aug 3, 2020 via email

Uh oh!

Uh oh!

samarthjain commented Aug 3, 2020

Uh oh!

RussellSpitzer commented Aug 3, 2020

Uh oh!

rdblue commented Aug 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

RussellSpitzer Aug 3, 2020 •

edited

Loading