-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a serializer for FileScanTask #1698
Comments
@openinx @JingsongLi @rdblue any comment? |
Thanks @stevenzwu for bringing this up. It's indeed an problem for flink streaming reader because it depends on the java serialization in StreamingReaderOperator now, it's easy to crash when we upgrade iceberg lib version (which changes the CombinedScanTask classes) and restart the flink job. I'd prefer to define the avro schema in |
@stevenzwu @openinx Is there a patch to fix this problem ? |
@stevenzwu, I think that we should introduce a JSON format and parser for these tasks. The information in a |
@rdblue JSON would work, although it is less efficient in terms of space and serialization. But I see the benefit that it can be useful in some other scenarios. I can look into that direction. |
@aokolnychyi would also like to get your input. With the recent changelog scan, we may also need to document the JSON format for those changelog scan tasks in the future. not needed right now especially as we are still iterating on those interfaces. |
I am also +1 on trying to come up with a reasonable JSON representation. Handling job planning via the REST catalog is something I'd be interested to see. |
Anton, thanks a lot for the input. Looks like we have a direction moving forward. |
this is completed via the 3 PRs linked |
For batch/bounded mode, Java Serializable works well as there is no concern of schema evolution. If we are going to support the streaming read with long-running jobs, we need to consider schema evolution for checkpoint state. Otherwise, change in the code might break the Java serialization and ability to restore from checkpoint.
Here are some high-level thoughts.
DataFile
to parent interfaceContentFile
. Extend the schema with the additional fields inDataFile
andDeleteFile
.FileScanTask
whereResidualEvaluator
andPartitionSpec
fields will be defined as string type.CombinedScanTask
schema is straightforward. it should be just a collection ofFileScanTask
ScanTasks
util class in iceberg-core that handles the serialization and deserialization ofFileScanTask
andCombinedScanTask
One challenge is how to plugin custom field serializers for
ResidualEvaluator
andPartitionSpec
.Overall, this seems like a large change. not sure if there is a simpler way.
The text was updated successfully, but these errors were encountered: