Skip to content

Vectorized Reads of Parquet with Identity Partitions#1287

Merged
rdblue merged 2 commits intoapache:masterfrom
RussellSpitzer:IdentityColumnarReads
Aug 5, 2020
Merged

Vectorized Reads of Parquet with Identity Partitions#1287
rdblue merged 2 commits intoapache:masterfrom
RussellSpitzer:IdentityColumnarReads

Conversation

@RussellSpitzer
Copy link
Member

Previously vectorization would be disabled whenever an underlying iceberg table
was using Parquet files and also used Identity transforms in it's partitioning.

To fix this we extend the DummyVectorReader to be a ConstantVectorReader which is
used when a column's value can be determined from the PartitionSpec. Then when
constructing the reader we use a ConstantColumnVector to fill in the missing
column.

Previously vectorization would be disabled whenever an underlying iceberg table
was using Parquet files and also used Identity transforms in it's partitioning.
To fix this we extend the DummyVectorReader to be a ConstantVectorReader which is
used when a column's value can be determined from the PartitionSpec. Then when
constructing the reader we use a ConstantColumnVector to fill in the missing
column.
@RussellSpitzer
Copy link
Member Author

CC @aokolnychyi

@rdblue
Copy link
Contributor

rdblue commented Aug 3, 2020

@samarthjain, can you help review this one?

.option("vectorization-enabled", String.valueOf(vectorized))
.load(table.location()).orderBy("id").collectAsList();
.load(table.location()).orderBy("id")
.select("id", "date", "level", "message")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the default? Why was it necessary to add select?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I added in the Hive Import it gets the schema in a different order, I think this may be an issue with the import code? I'm not sure, but I know the default column order does not come out the same way :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's suspicious. We'll have to look into why the schema has the wrong order. I see select before all the writes, so it shouldn't need the reorder here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to figure out the actual issue today, but I agree it shouldn't work this way. My assumption is that the Hive table schema is just being listed in a different order or when we use SparkSchemaUtil the order is getting scrambled.

Copy link
Member Author

@RussellSpitzer RussellSpitzer Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time digging into this,
When you call saveAsTable it ends up in this bit of code in DataFrameWriter

    val tableDesc = CatalogTable(
      identifier = tableIdent,
      tableType = tableType,
      storage = storage,
      schema = new StructType,
      provider = Some(source),
      partitionColumnNames = partitioningColumns.getOrElse(Nil),
      bucketSpec = getBucketSpec)

Which strips out whatever incoming schema you have. So the new table is created without any information about the actual ordering of columns you used in the create.

Then when the Relation is resolved, that's when the attributes are looked up again and the schema is created from the Attribute output. So long story short, saveAsTable doesn't care about your field ordering as far as I can tell. This is all in Spark and I'm not sure we can do anything about it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this, then. Thanks for looking into it!

@rdblue
Copy link
Contributor

rdblue commented Aug 3, 2020

Looks mostly good to me!

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Aug 3, 2020 via email

@samarthjain
Copy link
Collaborator

One minor comment, but generally looks good to me.

Renamed Classes
Parameterized Others
@RussellSpitzer
Copy link
Member Author

Thanks @samarthjain and @rdblue I applied all your comments! The only thing which I couldn't address was the Hive (saveAsTable) reordering thing. But hopefully I can get some time to working on making a save.format(iceberg) to due some column pruning with identity transforms and simplify this test later?

@rdblue rdblue merged commit 71de518 into apache:master Aug 5, 2020
@rdblue
Copy link
Contributor

rdblue commented Aug 5, 2020

Merged. Thanks, @RussellSpitzer! Good to have this feature done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments