[SPARK-41151][FOLLOW-UP][SQL] Keep built-in file _metadata fields nullable value consistent by Yaohua628 · Pull Request #38777 · apache/spark

Yaohua628 · 2022-11-23T22:57:20Z

What changes were proposed in this pull request?

A follow-up PR of #38683.

Apart from making _metadata struct not nullable, we should also make all fields inside of _metadata not nullable (file_path, file_name, file_modification_time, file_size, row_index).

Why are the changes needed?

Consistent nullability behavior for everything

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UTs

Yaohua628 · 2022-11-23T23:05:08Z

@cloud-fan @dongjoon-hyun @HeartSaVioR Sorry for the back and forth.

The previous PR, we changed the _metadata to not null. And I just realized we probably should make all fields inside of the _metadata (file_path, file_name, file_modification_time, file_size, row_index) not null as well for consistency.

Please let me know WDYT. As @cloud-fan mentioned, it should be fine to write not-null data into a nullable column. But my only concern is this change might break the existing stateful streaming schema compatibility check?

Also, cc @ala to confirm row_index will always be not null for supported file formats (e.g. Parquet)

Thanks for all your help!

HeartSaVioR · 2022-11-23T23:30:10Z

state store schema checker handles the compatibility for nullability. It does not only allow equality, but also allow the case when column for existing schema is nullable whereas column for new schema is non-nullable. So ensuring columns to be non-nullable would be OK for compatibility point of view.

HeartSaVioR

+1

HeartSaVioR · 2022-11-24T01:19:47Z

Let me see if there are further review comments today. I will merge this tomorrow if there is no outstanding comment.

Yaohua628 · 2022-11-24T01:43:58Z

Thank you, Jungtaek! Also wanna confirm with @ala on nullability of row_index

HeartSaVioR · 2022-11-24T01:45:28Z

Ah OK, let's wait for feedback from @ala and ensure we make clear before merging it.

AmplabJenkins · 2022-11-25T11:54:40Z

Can one of the admins verify this patch?

dongjoon-hyun

+1, LGTM.

According to the above discussion, gentle ping, @ala .

ala · 2022-11-29T14:26:44Z

Sorry, I was sick for a couple of days.

I think there's no issue for row index. I think we go ahead and merge this PR. I could imagine a scenario where a new metadata field would be gradually introduced, and only supported in one of the readers at first, so it would be best if we don't make it a hard requirement to make all fields non-nullable.

HeartSaVioR · 2022-11-30T03:10:24Z

@Yaohua628 Could you please push a new empty commit or rebase to master branch to retrigger build? Let's make sure build is green before merging this.

Yaohua628 · 2022-11-30T12:35:44Z

@HeartSaVioR @ala some tests in FileMetadataStructRowIndexSuite are failed complaining:

java.io.IOException: Required column is missing in data file. Col: [_tmp_metadata_row_index]
[info] 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkColumn(VectorizedParquetRecordReader.java:375)
[info] 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initializeInternal(VectorizedParquetRecordReader.java:349)
[info] 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:181)

I had a fix here to resolve failures (basically: keep the internal _tmp_metadata_row_index nullable, but _metadata.row_index is still not null). I don't fully understand what happened internally, could you take a look? Thanks!

HeartSaVioR · 2022-12-01T00:03:31Z

I don't have context for that, sorry. I'm OK either way if the nullability of column is guaranteed to not fluctuate.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

Yaohua628 · 2022-12-01T08:24:51Z

Addressed comments, thanks for taking a look!

cloud-fan · 2022-12-01T09:33:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+                // Change the `_tmp_metadata_row_index` to `row_index`,
+                // and also change the nullability to not nullable,
+                // which is consistent with the nullability of `row_index` field
+                .get.withName(FileFormat.ROW_INDEX).withNullability(false)


shall we update fileFormatReaderGeneratedMetadataColumns to set nullablity as false?

Thanks, Wenchen, I tried that before, but it failed many test cases in FileMetadataStructRowIndexSuite. See this fix commit and this comment.

I don't have much context on the row_index, not sure what caused the issue, any idea? Thanks!

ala · 2022-12-01T11:41:20Z

Well, the issue seems to be that the vectorized reader recognizes the row index column as a "missing column" (aka. columns that are not read from the file, but instead populated by a higher layer in the reader). Since these are normally populated with nulls, it's a problem if the data type is non-nullable.

spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

Lines 372 to 376 in 0f1c515

    
           if (column.required()) { 
        
             // Column is missing in data but the required data is non-nullable. This file is invalid. 
        
             throw new IOException("Required column is missing in data file. Col: " + 
        
               Arrays.toString(path)); 
        
           }

We could tweak this if condition to not throw on generate column/row index, or use the workaround you put in place already.

Yaohua628 · 2022-12-04T00:24:21Z

Thanks for the explanation, @ala! I am OK either way cc @cloud-fan @HeartSaVioR feel free to merge it if you think it is OK ^

HeartSaVioR · 2022-12-04T23:29:14Z

It seems OK to me as well but I'll lean on @cloud-fan on the decision as I'm not an expert on this subject.

cloud-fan · 2022-12-05T04:49:08Z

thanks, merging to master/3.3!

cloud-fan · 2022-12-05T04:50:22Z

oh it conflicts with 3.3, @Yaohua628 can you open a backport PR? thanks!

…lable value consistent A follow-up PR of apache#38683. Apart from making `_metadata` struct not nullable, we should also make all fields inside of `_metadata` not nullable (`file_path`, `file_name`, `file_modification_time`, `file_size`, `row_index`). Consistent nullability behavior for everything No New UTs Closes apache#38777 from Yaohua628/spark-41151-follow-up. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…s nullable value consistent ### What changes were proposed in this pull request? Cherry-pick #38777. Resolved conflicts in ac2d027 ### Why are the changes needed? N/A ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? N/A Closes #38910 from Yaohua628/spark-41151-follow-up-3-3. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…lable value consistent ### What changes were proposed in this pull request? A follow-up PR of apache#38683. Apart from making `_metadata` struct not nullable, we should also make all fields inside of `_metadata` not nullable (`file_path`, `file_name`, `file_modification_time`, `file_size`, `row_index`). ### Why are the changes needed? Consistent nullability behavior for everything ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs Closes apache#38777 from Yaohua628/spark-41151-follow-up. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

not nullable for all fields

1ac9695

github-actions bot added the SQL label Nov 23, 2022

HeartSaVioR approved these changes Nov 24, 2022

View reviewed changes

cloud-fan approved these changes Nov 24, 2022

View reviewed changes

dongjoon-hyun approved these changes Nov 28, 2022

View reviewed changes

Yaohua628 added 2 commits November 30, 2022 16:08

Merge branch 'master' into spark-41151-follow-up

21dd1cd

fix?

0f98dd9

cloud-fan reviewed Dec 1, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 1, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala Show resolved Hide resolved

doc

d6a0554

cloud-fan reviewed Dec 1, 2022

View reviewed changes

cloud-fan closed this in d8a600e Dec 5, 2022

Yaohua628 mentioned this pull request Dec 5, 2022

[SPARK-41151][FOLLOW-UP][SQL][3.3] Keep built-in file _metadata fields nullable value consistent #38910

Closed

Conversation

Yaohua628 commented Nov 23, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Yaohua628 commented Nov 23, 2022

Uh oh!

HeartSaVioR commented Nov 23, 2022

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Nov 24, 2022

Uh oh!

Yaohua628 commented Nov 24, 2022

Uh oh!

HeartSaVioR commented Nov 24, 2022

Uh oh!

AmplabJenkins commented Nov 25, 2022

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

ala commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Nov 30, 2022

Uh oh!

Yaohua628 commented Nov 30, 2022

Uh oh!

HeartSaVioR commented Dec 1, 2022

Uh oh!

Uh oh!

Uh oh!

Yaohua628 commented Dec 1, 2022

Uh oh!

cloud-fan Dec 1, 2022

Choose a reason for hiding this comment

Uh oh!

Yaohua628 Dec 1, 2022

Choose a reason for hiding this comment

Uh oh!

ala commented Dec 1, 2022

Uh oh!

Yaohua628 commented Dec 4, 2022

Uh oh!

HeartSaVioR commented Dec 4, 2022

Uh oh!

cloud-fan commented Dec 5, 2022

Uh oh!

cloud-fan commented Dec 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ala commented Nov 29, 2022 •

edited

Loading