[SPARK-41151][FOLLOW-UP][SQL] Keep built-in file _metadata fields nullable value consistent#38777
[SPARK-41151][FOLLOW-UP][SQL] Keep built-in file _metadata fields nullable value consistent#38777Yaohua628 wants to merge 4 commits intoapache:masterfrom
Conversation
|
@cloud-fan @dongjoon-hyun @HeartSaVioR Sorry for the back and forth. The previous PR, we changed the Please let me know WDYT. As @cloud-fan mentioned, it should be fine to write not-null data into a nullable column. But my only concern is this change might break the existing stateful streaming schema compatibility check? Also, cc @ala to confirm Thanks for all your help! |
|
state store schema checker handles the compatibility for nullability. It does not only allow equality, but also allow the case when column for existing schema is nullable whereas column for new schema is non-nullable. So ensuring columns to be non-nullable would be OK for compatibility point of view. |
|
Let me see if there are further review comments today. I will merge this tomorrow if there is no outstanding comment. |
|
Thank you, Jungtaek! Also wanna confirm with @ala on nullability of |
|
Ah OK, let's wait for feedback from @ala and ensure we make clear before merging it. |
|
Can one of the admins verify this patch? |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM.
According to the above discussion, gentle ping, @ala .
|
Sorry, I was sick for a couple of days. I think there's no issue for row index. I think we go ahead and merge this PR. I could imagine a scenario where a new metadata field would be gradually introduced, and only supported in one of the readers at first, so it would be best if we don't make it a hard requirement to make all fields non-nullable. |
|
@Yaohua628 Could you please push a new empty commit or rebase to master branch to retrigger build? Let's make sure build is green before merging this. |
|
@HeartSaVioR @ala some tests in I had a fix here to resolve failures (basically: keep the internal |
|
I don't have context for that, sorry. I'm OK either way if the nullability of column is guaranteed to not fluctuate. |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
Show resolved
Hide resolved
|
Addressed comments, thanks for taking a look! |
| // Change the `_tmp_metadata_row_index` to `row_index`, | ||
| // and also change the nullability to not nullable, | ||
| // which is consistent with the nullability of `row_index` field | ||
| .get.withName(FileFormat.ROW_INDEX).withNullability(false) |
There was a problem hiding this comment.
shall we update fileFormatReaderGeneratedMetadataColumns to set nullablity as false?
There was a problem hiding this comment.
Thanks, Wenchen, I tried that before, but it failed many test cases in FileMetadataStructRowIndexSuite. See this fix commit and this comment.
I don't have much context on the row_index, not sure what caused the issue, any idea? Thanks!
|
Well, the issue seems to be that the vectorized reader recognizes the row index column as a "missing column" (aka. columns that are not read from the file, but instead populated by a higher layer in the reader). Since these are normally populated with nulls, it's a problem if the data type is non-nullable. We could tweak this if condition to not throw on generate column/row index, or use the workaround you put in place already.
|
|
Thanks for the explanation, @ala! I am OK either way cc @cloud-fan @HeartSaVioR feel free to merge it if you think it is OK ^ |
|
It seems OK to me as well but I'll lean on @cloud-fan on the decision as I'm not an expert on this subject. |
|
thanks, merging to master/3.3! |
|
oh it conflicts with 3.3, @Yaohua628 can you open a backport PR? thanks! |
…lable value consistent A follow-up PR of apache#38683. Apart from making `_metadata` struct not nullable, we should also make all fields inside of `_metadata` not nullable (`file_path`, `file_name`, `file_modification_time`, `file_size`, `row_index`). Consistent nullability behavior for everything No New UTs Closes apache#38777 from Yaohua628/spark-41151-follow-up. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…s nullable value consistent ### What changes were proposed in this pull request? Cherry-pick #38777. Resolved conflicts in ac2d027 ### Why are the changes needed? N/A ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? N/A Closes #38910 from Yaohua628/spark-41151-follow-up-3-3. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…lable value consistent ### What changes were proposed in this pull request? A follow-up PR of apache#38683. Apart from making `_metadata` struct not nullable, we should also make all fields inside of `_metadata` not nullable (`file_path`, `file_name`, `file_modification_time`, `file_size`, `row_index`). ### Why are the changes needed? Consistent nullability behavior for everything ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs Closes apache#38777 from Yaohua628/spark-41151-follow-up. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
A follow-up PR of #38683.
Apart from making
_metadatastruct not nullable, we should also make all fields inside of_metadatanot nullable (file_path,file_name,file_modification_time,file_size,row_index).Why are the changes needed?
Consistent nullability behavior for everything
Does this PR introduce any user-facing change?
No
How was this patch tested?
New UTs