Skip to content

Core: add HadoopConfigurable interface to serialize custom FileIO#2678

Merged
rdblue merged 4 commits intoapache:masterfrom
jackye1995:hadoop-config-serialization
Jun 23, 2021
Merged

Core: add HadoopConfigurable interface to serialize custom FileIO#2678
rdblue merged 4 commits intoapache:masterfrom
jackye1995:hadoop-config-serialization

Conversation

@jackye1995
Copy link
Contributor

Currently we have special handling for HadoopFileIO in different code paths to make sure Hadoop configuration can be serialized and deserialized properly. This PR introduces HadoopConfigurable interface to make it more generic for other custom Hadoop configurable FileIO implementations to leverage the same code path.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @jackye1995. This will potentially be needed for #2607 (passing per catalog overrides to the hadoop configuration) as well. Thank you! 👍

}
}

public static byte[] serializeToBytesWithHadoopConfig(Object obj) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce a separate method here rather than supporting HadoopConfigurable in serializeToBytes? It seems less useful if use of HadoopConfigurable isn't automatic and you need to remember to call the right method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original thinking was that the use of org.apache.iceberg.hadoop.SerializableConfiguration was not always a default. SerializableTable uses a map based serializer, Spark also has its own serializer. So I don't want the user to blindly assume that serializeToBytes would take care of the Hadoop configuration in all places. If we are making it a default, then I will add a documentation for serializeToBytes to make this clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. In that case, maybe it would make more sense to allow passing the serializeConfWith function into serializeToBytes as an option? If you don't pass it, then we could use the current SerializableConfiguration as the default.

this.splitOpenFileCost = options.get(SparkReadOptions.FILE_OPEN_COST).map(Long::parseLong).orElse(null);

if (table.io() instanceof HadoopFileIO) {
if (table.io() instanceof HadoopConfigurable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. The logic in this if statement assumes that it can create a FileSystem for the table's location if its FileIO is a HadoopFileIO. That's not necessarily the case if the io is just HadoopConfigurable because I might have my own implementation that for some reason uses a Hadoop conf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be related to the discussion in the dev list regarding the locality read configuration. In the use case I am trying to support, it actually needs to run that code path and turn locality read to true by default although it is not a HadoopFileIO. So we have both cases to support, and it is not sufficient to determine preference of locality read purely based on the FileIO implementation and file URI.

In the code path, because it is at reader initialization time, table property seems to be the best place to store this default behavior, although I agree this is not really elegant as it is spark specific.

For now, I think using the HadoopConfigurable check instead of HadoopFileIO check is more flexible, because users who do not need locality for HadoopConfigurable are likely not using HDFS anyway and will fail the check, and they can also use locality option to override the choice to turn it off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, instead of table property, Hadoop configuration should be a better place to put this default. Because it requires a file system, a Hadoop configuration is needed anyway in the code path. This can avoid placing Spark specific configs in table property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a reasonable idea to be able to set a property somewhere and use locality. But this specifically should check for HadoopFileIO until we know what that property does and where it is because the locality code path is going to get a FileSystem instance. If you aren't using HadoopFileIO then there is no guarantee that we can get a file system or that it is the same as the IO instance.

Let's fix the locality problem later and keep this commit focused on updates to add HadoopConfigurable. With just that as the goal of this PR, I think this should be left as HadoopFileIO.

@rdblue
Copy link
Contributor

rdblue commented Jun 22, 2021

@jackye1995, I left a couple of comments. The main blocker is that this widens the check for HadoopFileIO and I don't think that is correct. We can change it later if we want to change how locality works.

@jackye1995
Copy link
Contributor Author

@rdblue okay I understand the concern, let's address the locality read in another PR, I have reverted the change there.

@rdblue rdblue merged commit 01393a0 into apache:master Jun 23, 2021
@rdblue
Copy link
Contributor

rdblue commented Jun 23, 2021

Thanks @jackye1995! I merged this. And thanks to @kbendick for also reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants