Core: add HadoopConfigurable interface to serialize custom FileIO#2678
Core: add HadoopConfigurable interface to serialize custom FileIO#2678rdblue merged 4 commits intoapache:masterfrom jackye1995:hadoop-config-serialization
Conversation
There was a problem hiding this comment.
This is great @jackye1995. This will potentially be needed for #2607 (passing per catalog overrides to the hadoop configuration) as well. Thank you! 👍
core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| public static byte[] serializeToBytesWithHadoopConfig(Object obj) { |
There was a problem hiding this comment.
Why introduce a separate method here rather than supporting HadoopConfigurable in serializeToBytes? It seems less useful if use of HadoopConfigurable isn't automatic and you need to remember to call the right method.
There was a problem hiding this comment.
The original thinking was that the use of org.apache.iceberg.hadoop.SerializableConfiguration was not always a default. SerializableTable uses a map based serializer, Spark also has its own serializer. So I don't want the user to blindly assume that serializeToBytes would take care of the Hadoop configuration in all places. If we are making it a default, then I will add a documentation for serializeToBytes to make this clear.
There was a problem hiding this comment.
I see what you mean. In that case, maybe it would make more sense to allow passing the serializeConfWith function into serializeToBytes as an option? If you don't pass it, then we could use the current SerializableConfiguration as the default.
core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java
Outdated
Show resolved
Hide resolved
| this.splitOpenFileCost = options.get(SparkReadOptions.FILE_OPEN_COST).map(Long::parseLong).orElse(null); | ||
|
|
||
| if (table.io() instanceof HadoopFileIO) { | ||
| if (table.io() instanceof HadoopConfigurable) { |
There was a problem hiding this comment.
I don't think this is correct. The logic in this if statement assumes that it can create a FileSystem for the table's location if its FileIO is a HadoopFileIO. That's not necessarily the case if the io is just HadoopConfigurable because I might have my own implementation that for some reason uses a Hadoop conf.
There was a problem hiding this comment.
This seems to be related to the discussion in the dev list regarding the locality read configuration. In the use case I am trying to support, it actually needs to run that code path and turn locality read to true by default although it is not a HadoopFileIO. So we have both cases to support, and it is not sufficient to determine preference of locality read purely based on the FileIO implementation and file URI.
In the code path, because it is at reader initialization time, table property seems to be the best place to store this default behavior, although I agree this is not really elegant as it is spark specific.
For now, I think using the HadoopConfigurable check instead of HadoopFileIO check is more flexible, because users who do not need locality for HadoopConfigurable are likely not using HDFS anyway and will fail the check, and they can also use locality option to override the choice to turn it off.
There was a problem hiding this comment.
Actually, instead of table property, Hadoop configuration should be a better place to put this default. Because it requires a file system, a Hadoop configuration is needed anyway in the code path. This can avoid placing Spark specific configs in table property.
There was a problem hiding this comment.
I think it's a reasonable idea to be able to set a property somewhere and use locality. But this specifically should check for HadoopFileIO until we know what that property does and where it is because the locality code path is going to get a FileSystem instance. If you aren't using HadoopFileIO then there is no guarantee that we can get a file system or that it is the same as the IO instance.
Let's fix the locality problem later and keep this commit focused on updates to add HadoopConfigurable. With just that as the goal of this PR, I think this should be left as HadoopFileIO.
|
@jackye1995, I left a couple of comments. The main blocker is that this widens the check for |
|
@rdblue okay I understand the concern, let's address the locality read in another PR, I have reverted the change there. |
|
Thanks @jackye1995! I merged this. And thanks to @kbendick for also reviewing! |
Currently we have special handling for
HadoopFileIOin different code paths to make sure Hadoop configuration can be serialized and deserialized properly. This PR introducesHadoopConfigurableinterface to make it more generic for other custom Hadoop configurableFileIOimplementations to leverage the same code path.