Spark: Coerce shorts and bytes into ints in Parquet Writer #10349

shardulm94 · 2024-05-17T23:08:39Z

#9440 refactored the Spark Parquet writer to move to a visitor pattern. During this move, it removed some code required for coercion from bytes/shorts to int. This PR adds back the coercion code along with unit tests validating the change.

Note that even without coercion code, the tests succeed in Spark 3.5 but not in 3.3 and 3.4. It succeeds in 3.5 due to apache/spark#40734 which routes CTAS/RTAS through AppendData, in which case Spark adds its own projection to handle coercion here.

So the change is not strictly necessary for Spark 3.5 but I add it anyway to:

keep the code consistent between 3.3/3.4/3.5
future proofing in case Spark removes the auto-coercion at some point

shardulm94 · 2024-05-18T00:37:50Z

cc: @Fokko @jkolash

@Fokko regarding your comment here, its actually different than what visiting IntLogicalTypeAnnotation achieves. The visitor will create a ByteWrite if the annotation of the Parquet logical type (based on Iceberg schema) says its a byte. In this case, the table schema says its an int (not byte), but the datatype of the Spark DF is byte and hence coercion is required.

Fokko

@shardulm94 This makes sense to me, thanks for fixing this 👍

Spark: Coerce shorts and bytes into ints in Parquet Writer

62827a6

github-actions bot added the spark label May 17, 2024

shardulm94 added 2 commits May 17, 2024 16:12

Apply spotless changes

660f79d

More spotless fixes

c342a1b

Fokko approved these changes May 19, 2024

View reviewed changes

shardulm94 merged commit 8d6bee7 into apache:main May 20, 2024
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Coerce shorts and bytes into ints in Parquet Writer #10349

Spark: Coerce shorts and bytes into ints in Parquet Writer #10349

shardulm94 commented May 17, 2024

shardulm94 commented May 18, 2024

Fokko left a comment

Spark: Coerce shorts and bytes into ints in Parquet Writer #10349

Spark: Coerce shorts and bytes into ints in Parquet Writer #10349

Conversation

shardulm94 commented May 17, 2024

shardulm94 commented May 18, 2024

Fokko left a comment

Choose a reason for hiding this comment