Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a serializer for FileScanTask #1698

Closed
stevenzwu opened this issue Nov 1, 2020 · 10 comments
Closed

Add a serializer for FileScanTask #1698

stevenzwu opened this issue Nov 1, 2020 · 10 comments
Assignees

Comments

@stevenzwu
Copy link
Contributor

stevenzwu commented Nov 1, 2020

For batch/bounded mode, Java Serializable works well as there is no concern of schema evolution. If we are going to support the streaming read with long-running jobs, we need to consider schema evolution for checkpoint state. Otherwise, change in the code might break the Java serialization and ability to restore from checkpoint.

Here are some high-level thoughts.

  • Move most of the schema defined in DataFile to parent interface ContentFile. Extend the schema with the additional fields in DataFile and DeleteFile.
  • Add a schema to FileScanTask where ResidualEvaluator and PartitionSpec fields will be defined as string type.
  • CombinedScanTask schema is straightforward. it should be just a collection of FileScanTask
  • Add ScanTasks util class in iceberg-core that handles the serialization and deserialization of FileScanTask and CombinedScanTask

One challenge is how to plugin custom field serializers for ResidualEvaluator and PartitionSpec.

Overall, this seems like a large change. not sure if there is a simpler way.

@stevenzwu
Copy link
Contributor Author

@openinx @JingsongLi @rdblue any comment?

@openinx
Copy link
Member

openinx commented Jan 7, 2021

Thanks @stevenzwu for bringing this up. It's indeed an problem for flink streaming reader because it depends on the java serialization in StreamingReaderOperator now, it's easy to crash when we upgrade iceberg lib version (which changes the CombinedScanTask classes) and restart the flink job.

I'd prefer to define the avro schema in BaseCombinedScanTask which is similar to BaseDataFile, then we could maintain the binary bytes which is serialized by avro approach to flink state backend. Let me evaluate the work.

@openinx openinx self-assigned this Jan 7, 2021
@coolderli
Copy link
Contributor

@stevenzwu @openinx Is there a patch to fix this problem ?

@stevenzwu
Copy link
Contributor Author

@rdblue @pvary I updated the description with some high-level thoughts on how this can potentially achieved. Can you please share your thoughts?

@rdblue
Copy link
Contributor

rdblue commented Aug 23, 2022

@stevenzwu, I think that we should introduce a JSON format and parser for these tasks. The information in a FileScanTask has been stable for a really long time so it wouldn't be a problem to maintain it. And we've had other projects ask for this before as well, since it is more common to use JSON to serialize in some settings. Trino uses JSON for RPC and we've also recently discussed adding job planning to the REST catalog interface.

@stevenzwu
Copy link
Contributor Author

stevenzwu commented Aug 24, 2022

@rdblue JSON would work, although it is less efficient in terms of space and serialization. But I see the benefit that it can be useful in some other scenarios. I can look into that direction.

@stevenzwu
Copy link
Contributor Author

@aokolnychyi would also like to get your input. With the recent changelog scan, we may also need to document the JSON format for those changelog scan tasks in the future. not needed right now especially as we are still iterating on those interfaces.

@stevenzwu stevenzwu changed the title FileScanTask Serializer for Flink source checkpointing Add a serializer for FileScanTask Aug 24, 2022
@aokolnychyi
Copy link
Contributor

I am also +1 on trying to come up with a reasonable JSON representation. Handling job planning via the REST catalog is something I'd be interested to see.

@stevenzwu
Copy link
Contributor Author

Anton, thanks a lot for the input. Looks like we have a direction moving forward.

@stevenzwu
Copy link
Contributor Author

this is completed via the 3 PRs linked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants