Add a serializer for FileScanTask #1698

stevenzwu · 2020-11-01T22:02:28Z

For batch/bounded mode, Java Serializable works well as there is no concern of schema evolution. If we are going to support the streaming read with long-running jobs, we need to consider schema evolution for checkpoint state. Otherwise, change in the code might break the Java serialization and ability to restore from checkpoint.

Here are some high-level thoughts.

Move most of the schema defined in DataFile to parent interface ContentFile. Extend the schema with the additional fields in DataFile and DeleteFile.
Add a schema to FileScanTask where ResidualEvaluator and PartitionSpec fields will be defined as string type.
CombinedScanTask schema is straightforward. it should be just a collection of FileScanTask
Add ScanTasks util class in iceberg-core that handles the serialization and deserialization of FileScanTask and CombinedScanTask

One challenge is how to plugin custom field serializers for ResidualEvaluator and PartitionSpec.

Overall, this seems like a large change. not sure if there is a simpler way.

The text was updated successfully, but these errors were encountered:

stevenzwu · 2020-11-06T18:41:26Z

@openinx @JingsongLi @rdblue any comment?

openinx · 2021-01-07T10:30:24Z

Thanks @stevenzwu for bringing this up. It's indeed an problem for flink streaming reader because it depends on the java serialization in StreamingReaderOperator now, it's easy to crash when we upgrade iceberg lib version (which changes the CombinedScanTask classes) and restart the flink job.

I'd prefer to define the avro schema in BaseCombinedScanTask which is similar to BaseDataFile, then we could maintain the binary bytes which is serialized by avro approach to flink state backend. Let me evaluate the work.

coolderli · 2021-06-23T02:26:40Z

@stevenzwu @openinx Is there a patch to fix this problem ？

stevenzwu · 2022-08-23T22:14:19Z

@rdblue @pvary I updated the description with some high-level thoughts on how this can potentially achieved. Can you please share your thoughts?

rdblue · 2022-08-23T23:47:09Z

@stevenzwu, I think that we should introduce a JSON format and parser for these tasks. The information in a FileScanTask has been stable for a really long time so it wouldn't be a problem to maintain it. And we've had other projects ask for this before as well, since it is more common to use JSON to serialize in some settings. Trino uses JSON for RPC and we've also recently discussed adding job planning to the REST catalog interface.

stevenzwu · 2022-08-24T00:37:48Z

@rdblue JSON would work, although it is less efficient in terms of space and serialization. But I see the benefit that it can be useful in some other scenarios. I can look into that direction.

stevenzwu · 2022-08-24T02:58:13Z

@aokolnychyi would also like to get your input. With the recent changelog scan, we may also need to document the JSON format for those changelog scan tasks in the future. not needed right now especially as we are still iterating on those interfaces.

aokolnychyi · 2022-09-08T15:04:02Z

I am also +1 on trying to come up with a reasonable JSON representation. Handling job planning via the REST catalog is something I'd be interested to see.

stevenzwu · 2022-09-08T16:06:00Z

Anton, thanks a lot for the input. Looks like we have a direction moving forward.

stevenzwu · 2023-07-08T04:46:03Z

this is completed via the 3 PRs linked

stevenzwu mentioned this issue Jan 6, 2021

Flink: Support streaming reader. #1793

Merged

openinx self-assigned this Jan 7, 2021

stevenzwu mentioned this issue Dec 3, 2021

Flink: FLIP-27 Iceberg source split #3501

Merged

stevenzwu mentioned this issue Aug 22, 2022

API: Remove source type from Transform #5601

Merged

stevenzwu changed the title ~~FileScanTask Serializer for Flink source checkpointing~~ Add a serializer for FileScanTask Aug 24, 2022

stevenzwu mentioned this issue Feb 24, 2023

core: add JSON parser for ContentFile and FileScanTask #6934

Merged

stevenzwu mentioned this issue Apr 4, 2023

Flink: Data statistics operator sends local data statistics to coordinator and receive aggregated data statistics from coordinator for smart shuffling #7269

Merged

This was referenced Jul 8, 2023

Core: add missing start and length for FileScanTaskParser. #7936

Merged

Flink: switch to FileScanTaskParser for JSON serialization of IcebergSourceSplit #7978

Merged

stevenzwu closed this as completed Jul 8, 2023

stevenzwu mentioned this issue Jan 31, 2024

Core: complete task JSON serialization for other types (like data task, manifest task) #9597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a serializer for FileScanTask #1698

Add a serializer for FileScanTask #1698

stevenzwu commented Nov 1, 2020 •

edited

Loading

stevenzwu commented Nov 6, 2020

openinx commented Jan 7, 2021

coolderli commented Jun 23, 2021

stevenzwu commented Aug 23, 2022

rdblue commented Aug 23, 2022

stevenzwu commented Aug 24, 2022 •

edited

Loading

stevenzwu commented Aug 24, 2022

aokolnychyi commented Sep 8, 2022

stevenzwu commented Sep 8, 2022

stevenzwu commented Jul 8, 2023

Add a serializer for FileScanTask #1698

Add a serializer for FileScanTask #1698

Comments

stevenzwu commented Nov 1, 2020 • edited Loading

stevenzwu commented Nov 6, 2020

openinx commented Jan 7, 2021

coolderli commented Jun 23, 2021

stevenzwu commented Aug 23, 2022

rdblue commented Aug 23, 2022

stevenzwu commented Aug 24, 2022 • edited Loading

stevenzwu commented Aug 24, 2022

aokolnychyi commented Sep 8, 2022

stevenzwu commented Sep 8, 2022

stevenzwu commented Jul 8, 2023

stevenzwu commented Nov 1, 2020 •

edited

Loading

stevenzwu commented Aug 24, 2022 •

edited

Loading