Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Coerce shorts and bytes into ints in Parquet Writer #10349

Merged
merged 3 commits into from
May 20, 2024

Conversation

shardulm94
Copy link
Contributor

Fixes #10225

#9440 refactored the Spark Parquet writer to move to a visitor pattern. During this move, it removed some code required for coercion from bytes/shorts to int. This PR adds back the coercion code along with unit tests validating the change.

Note that even without coercion code, the tests succeed in Spark 3.5 but not in 3.3 and 3.4. It succeeds in 3.5 due to apache/spark#40734 which routes CTAS/RTAS through AppendData, in which case Spark adds its own projection to handle coercion here.

So the change is not strictly necessary for Spark 3.5 but I add it anyway to:

  1. keep the code consistent between 3.3/3.4/3.5
  2. future proofing in case Spark removes the auto-coercion at some point

@github-actions github-actions bot added the spark label May 17, 2024
@shardulm94
Copy link
Contributor Author

cc: @Fokko @jkolash

@Fokko regarding your comment here, its actually different than what visiting IntLogicalTypeAnnotation achieves. The visitor will create a ByteWrite if the annotation of the Parquet logical type (based on Iceberg schema) says its a byte. In this case, the table schema says its an int (not byte), but the datatype of the Spark DF is byte and hence coercion is required.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shardulm94 This makes sense to me, thanks for fixing this 👍

@shardulm94 shardulm94 merged commit 8d6bee7 into apache:main May 20, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

byte and short types in spark no longer auto coerce to int32
2 participants