Skip to content

[SPARK-41151][FOLLOW-UP][SQL] Keep built-in file _metadata fields nullable value consistent#38777

Closed
Yaohua628 wants to merge 4 commits intoapache:masterfrom
Yaohua628:spark-41151-follow-up
Closed

[SPARK-41151][FOLLOW-UP][SQL] Keep built-in file _metadata fields nullable value consistent#38777
Yaohua628 wants to merge 4 commits intoapache:masterfrom
Yaohua628:spark-41151-follow-up

Conversation

@Yaohua628
Copy link
Contributor

What changes were proposed in this pull request?

A follow-up PR of #38683.

Apart from making _metadata struct not nullable, we should also make all fields inside of _metadata not nullable (file_path, file_name, file_modification_time, file_size, row_index).

Why are the changes needed?

Consistent nullability behavior for everything

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UTs

@github-actions github-actions bot added the SQL label Nov 23, 2022
@Yaohua628
Copy link
Contributor Author

@cloud-fan @dongjoon-hyun @HeartSaVioR Sorry for the back and forth.

The previous PR, we changed the _metadata to not null. And I just realized we probably should make all fields inside of the _metadata (file_path, file_name, file_modification_time, file_size, row_index) not null as well for consistency.

Please let me know WDYT. As @cloud-fan mentioned, it should be fine to write not-null data into a nullable column. But my only concern is this change might break the existing stateful streaming schema compatibility check?

Also, cc @ala to confirm row_index will always be not null for supported file formats (e.g. Parquet)

Thanks for all your help!

@HeartSaVioR
Copy link
Contributor

state store schema checker handles the compatibility for nullability. It does not only allow equality, but also allow the case when column for existing schema is nullable whereas column for new schema is non-nullable. So ensuring columns to be non-nullable would be OK for compatibility point of view.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HeartSaVioR
Copy link
Contributor

Let me see if there are further review comments today. I will merge this tomorrow if there is no outstanding comment.

@Yaohua628
Copy link
Contributor Author

Thank you, Jungtaek! Also wanna confirm with @ala on nullability of row_index

@HeartSaVioR
Copy link
Contributor

Ah OK, let's wait for feedback from @ala and ensure we make clear before merging it.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

According to the above discussion, gentle ping, @ala .

@ala
Copy link
Contributor

ala commented Nov 29, 2022

Sorry, I was sick for a couple of days.

I think there's no issue for row index. I think we go ahead and merge this PR. I could imagine a scenario where a new metadata field would be gradually introduced, and only supported in one of the readers at first, so it would be best if we don't make it a hard requirement to make all fields non-nullable.

@HeartSaVioR
Copy link
Contributor

@Yaohua628 Could you please push a new empty commit or rebase to master branch to retrigger build? Let's make sure build is green before merging this.

@Yaohua628
Copy link
Contributor Author

@HeartSaVioR @ala some tests in FileMetadataStructRowIndexSuite are failed complaining:

java.io.IOException: Required column is missing in data file. Col: [_tmp_metadata_row_index]
[info] 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkColumn(VectorizedParquetRecordReader.java:375)
[info] 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initializeInternal(VectorizedParquetRecordReader.java:349)
[info] 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:181)

I had a fix here to resolve failures (basically: keep the internal _tmp_metadata_row_index nullable, but _metadata.row_index is still not null). I don't fully understand what happened internally, could you take a look? Thanks!

@HeartSaVioR
Copy link
Contributor

I don't have context for that, sorry. I'm OK either way if the nullability of column is guaranteed to not fluctuate.

@Yaohua628
Copy link
Contributor Author

Addressed comments, thanks for taking a look!

// Change the `_tmp_metadata_row_index` to `row_index`,
// and also change the nullability to not nullable,
// which is consistent with the nullability of `row_index` field
.get.withName(FileFormat.ROW_INDEX).withNullability(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we update fileFormatReaderGeneratedMetadataColumns to set nullablity as false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Wenchen, I tried that before, but it failed many test cases in FileMetadataStructRowIndexSuite. See this fix commit and this comment.

I don't have much context on the row_index, not sure what caused the issue, any idea? Thanks!

@ala
Copy link
Contributor

ala commented Dec 1, 2022

Well, the issue seems to be that the vectorized reader recognizes the row index column as a "missing column" (aka. columns that are not read from the file, but instead populated by a higher layer in the reader). Since these are normally populated with nulls, it's a problem if the data type is non-nullable.

if (column.required()) {
// Column is missing in data but the required data is non-nullable. This file is invalid.
throw new IOException("Required column is missing in data file. Col: " +
Arrays.toString(path));
}

We could tweak this if condition to not throw on generate column/row index, or use the workaround you put in place already.

@Yaohua628
Copy link
Contributor Author

Thanks for the explanation, @ala! I am OK either way cc @cloud-fan @HeartSaVioR feel free to merge it if you think it is OK ^

@HeartSaVioR
Copy link
Contributor

It seems OK to me as well but I'll lean on @cloud-fan on the decision as I'm not an expert on this subject.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.3!

@cloud-fan cloud-fan closed this in d8a600e Dec 5, 2022
@cloud-fan
Copy link
Contributor

oh it conflicts with 3.3, @Yaohua628 can you open a backport PR? thanks!

Yaohua628 added a commit to Yaohua628/spark that referenced this pull request Dec 5, 2022
…lable value consistent

A follow-up PR of apache#38683.

Apart from making `_metadata` struct not nullable, we should also make all fields inside of `_metadata` not nullable (`file_path`, `file_name`, `file_modification_time`, `file_size`, `row_index`).

Consistent nullability behavior for everything

No

New UTs

Closes apache#38777 from Yaohua628/spark-41151-follow-up.

Authored-by: yaohua <yaohua.zhao@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
dongjoon-hyun pushed a commit that referenced this pull request Dec 6, 2022
…s nullable value consistent

### What changes were proposed in this pull request?
Cherry-pick #38777. Resolved conflicts in ac2d027

### Why are the changes needed?
N/A

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
N/A

Closes #38910 from Yaohua628/spark-41151-follow-up-3-3.

Authored-by: yaohua <yaohua.zhao@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…lable value consistent

### What changes were proposed in this pull request?
A follow-up PR of apache#38683.

Apart from making `_metadata` struct not nullable, we should also make all fields inside of `_metadata` not nullable (`file_path`, `file_name`, `file_modification_time`, `file_size`, `row_index`).

### Why are the changes needed?
Consistent nullability behavior for everything

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UTs

Closes apache#38777 from Yaohua628/spark-41151-follow-up.

Authored-by: yaohua <yaohua.zhao@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants