Core: add HadoopConfigurable interface to serialize custom FileIO by jackye1995 · Pull Request #2678 · apache/iceberg

jackye1995 · 2021-06-04T23:18:08Z

Currently we have special handling for HadoopFileIO in different code paths to make sure Hadoop configuration can be serialized and deserialized properly. This PR introduces HadoopConfigurable interface to make it more generic for other custom Hadoop configurable FileIO implementations to leverage the same code path.

kbendick

This is great @jackye1995. This will potentially be needed for #2607 (passing per catalog overrides to the hadoop configuration) as well. Thank you! 👍

core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java

rdblue · 2021-06-12T23:06:45Z

core/src/main/java/org/apache/iceberg/util/SerializationUtil.java

    }
  }

+  public static byte[] serializeToBytesWithHadoopConfig(Object obj) {


Why introduce a separate method here rather than supporting HadoopConfigurable in serializeToBytes? It seems less useful if use of HadoopConfigurable isn't automatic and you need to remember to call the right method.

The original thinking was that the use of org.apache.iceberg.hadoop.SerializableConfiguration was not always a default. SerializableTable uses a map based serializer, Spark also has its own serializer. So I don't want the user to blindly assume that serializeToBytes would take care of the Hadoop configuration in all places. If we are making it a default, then I will add a documentation for serializeToBytes to make this clear.

I see what you mean. In that case, maybe it would make more sense to allow passing the serializeConfWith function into serializeToBytes as an option? If you don't pass it, then we could use the current SerializableConfiguration as the default.

core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

rdblue · 2021-06-12T23:11:42Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

    this.splitOpenFileCost = options.get(SparkReadOptions.FILE_OPEN_COST).map(Long::parseLong).orElse(null);

-    if (table.io() instanceof HadoopFileIO) {
+    if (table.io() instanceof HadoopConfigurable) {


I don't think this is correct. The logic in this if statement assumes that it can create a FileSystem for the table's location if its FileIO is a HadoopFileIO. That's not necessarily the case if the io is just HadoopConfigurable because I might have my own implementation that for some reason uses a Hadoop conf.

This seems to be related to the discussion in the dev list regarding the locality read configuration. In the use case I am trying to support, it actually needs to run that code path and turn locality read to true by default although it is not a HadoopFileIO. So we have both cases to support, and it is not sufficient to determine preference of locality read purely based on the FileIO implementation and file URI.

In the code path, because it is at reader initialization time, table property seems to be the best place to store this default behavior, although I agree this is not really elegant as it is spark specific.

For now, I think using the HadoopConfigurable check instead of HadoopFileIO check is more flexible, because users who do not need locality for HadoopConfigurable are likely not using HDFS anyway and will fail the check, and they can also use locality option to override the choice to turn it off.

Actually, instead of table property, Hadoop configuration should be a better place to put this default. Because it requires a file system, a Hadoop configuration is needed anyway in the code path. This can avoid placing Spark specific configs in table property.

I think it's a reasonable idea to be able to set a property somewhere and use locality. But this specifically should check for HadoopFileIO until we know what that property does and where it is because the locality code path is going to get a FileSystem instance. If you aren't using HadoopFileIO then there is no guarantee that we can get a file system or that it is the same as the IO instance.

Let's fix the locality problem later and keep this commit focused on updates to add HadoopConfigurable. With just that as the goal of this PR, I think this should be left as HadoopFileIO.

rdblue · 2021-06-22T19:24:03Z

@jackye1995, I left a couple of comments. The main blocker is that this widens the check for HadoopFileIO and I don't think that is correct. We can change it later if we want to change how locality works.

jackye1995 · 2021-06-22T21:42:35Z

@rdblue okay I understand the concern, let's address the locality read in another PR, I have reverted the change there.

rdblue · 2021-06-23T00:26:00Z

Thanks @jackye1995! I merged this. And thanks to @kbendick for also reviewing!

github-actions bot added core MR spark labels Jun 4, 2021

kbendick approved these changes Jun 7, 2021

View reviewed changes

rdblue reviewed Jun 12, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 12, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 12, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 12, 2021

View reviewed changes

Jack Ye added 3 commits June 21, 2021 20:38

Core: add HadoopConfigurable interface to serialize custom FileIO

8260b26

fix checkstyle

988f9ee

address comments

728ed03

update based on comments

6309638

rdblue approved these changes Jun 23, 2021

View reviewed changes

rdblue merged commit 01393a0 into apache:master Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: add HadoopConfigurable interface to serialize custom FileIO#2678

Core: add HadoopConfigurable interface to serialize custom FileIO#2678
rdblue merged 4 commits intoapache:masterfrom
jackye1995:hadoop-config-serialization

jackye1995 commented Jun 4, 2021

Uh oh!

kbendick left a comment •

edited

Loading

Uh oh!

Uh oh!

rdblue Jun 12, 2021

Uh oh!

jackye1995 Jun 22, 2021

Uh oh!

rdblue Jun 22, 2021

Uh oh!

Uh oh!

Uh oh!

rdblue Jun 12, 2021

Uh oh!

jackye1995 Jun 22, 2021

Uh oh!

jackye1995 Jun 22, 2021

Uh oh!

rdblue Jun 22, 2021

Uh oh!

rdblue commented Jun 22, 2021

Uh oh!

jackye1995 commented Jun 22, 2021

Uh oh!

rdblue commented Jun 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jackye1995 commented Jun 4, 2021

Uh oh!

kbendick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue Jun 12, 2021

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rdblue Jun 12, 2021

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 22, 2021

Uh oh!

jackye1995 commented Jun 22, 2021

Uh oh!

rdblue commented Jun 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kbendick left a comment •

edited

Loading