Keep original location in HadoopInputFile if given by wypoon · Pull Request #2170 · apache/iceberg

wypoon · 2021-01-28T03:10:38Z

This is a fix for issue #2169.

In the getInputFile methods of org.apache.iceberg.spark.source.BaseDataReader and org.apache.iceberg.flink.source.DataIterator , the InputFile is looked up in a map by the path string. The keys in the map come from InputFile#location. Currently, when we create a HadoopInputFile using a location string, we do not store the string in a field but construct an org.apache.hadoop.fs.Path from it and store only the Path. HadoopInputFile#location then returns path.toString() for this Path. This causes the string to be normalized. If the path string in the manifest is not normalized, the lookup will fail since the strings are not equal.
The fix is to store the original location string in HadoopInputFile when it is created using a location string, so that we can return this string in HadoopInputFile#location.

This is a fix for issue apache#2169. In `BaseDataReader#getInputFile`, the `InputFile` is looked up in a map by the path string. The keys in the map are normalized paths. If the path string in the manifest is not normalized, the lookup will fail since the strings are not equal. The fix is to normalize the path string before looking it up in the map.

steveloughran

might be best to use path.toURI() as the key, as that has the strictest guarantees of escaping things

steveloughran · 2021-01-28T16:01:49Z

spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java

  protected InputFile getInputFile(FileScanTask task) {
    Preconditions.checkArgument(!task.isDataTask(), "Invalid task type");
-    return inputFiles.get(task.file().path().toString());
+    return getInputFile(task.file().path().toString());


probably best to path().toURI() for best guarantee of handling of spaces

This isn't a Hadoop Path, it is a CharSequence. That's how we generally avoid issues like encoding.

rdblue · 2021-01-28T17:07:52Z

flink/src/main/java/org/apache/iceberg/flink/source/DataIterator.java

  InputFile getInputFile(String location) {
-    return inputFiles.get(location);
+    // normalize the path before looking it up in the map
+    Path path = new Path(location);


I don't think using the Hadoop API directly is a good way to solve the problem. It sounds like we need to fix the keys in the map to match the original location from the input split instead.

For @steveloughran's information, the reason the keys in the inputFiles map became normalized paths is that the files for the scan tasks go through encryption and decryption:

Stream<EncryptedInputFile> encrypted = keyMetadata.entrySet().stream() .map(entry -> EncryptedFiles.encryptedInput(io.newInputFile(entry.getKey()), entry.getValue())); // decrypt with the batch call to avoid multiple RPCs to a key server, if possible Iterable<InputFile> decryptedFiles = encryptionManager.decrypt(encrypted::iterator);

and the keys are the location of the decrypted files. The call to io.newInputFile either passes through HadoopFileIO or S3FileIO (currently the two implementations of FileIO), and the newInputFile process normalized the path. The normalization through HadoopFileIO is exactly what I'm replicating here. I think it will work for an S3 path too.

@rdblue, from the decrypted files, I do not see a natural way to recover the original path string written in the manifest. Instead, can we add a method to the FileIO interface to return the normalized path for a path String, and then HadoopFileIO and S3FileIO will have to implement it?

Sounds like we will need to fix the interfaces that are modifying the file location and pass the original.

I think that HadoopInputFile just needs to be updated so that fromLocation preserves the original location rather than using path.toString().

@rdblue thank you for the suggestion; I have implemented it and updated the PR description accordingly.

According to the javadoc for InputFile#location, this is "The fully-qualified location of the input file as a String." The only concern I have is if we ever call FileIO#newInputFile or HadoopInputFile.fromLocation with a location that is not fully-qualified.

…string if known.

flink/src/main/java/org/apache/iceberg/flink/source/DataIterator.java

spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java

rdblue · 2021-01-29T01:21:04Z

Thanks, @wypoon! I'll merge this when tests are passing.

rdblue · 2021-01-29T17:58:22Z

Merged. Thank you for fixing this, @wypoon! And thanks for the review, @steveloughran!

github-actions bot added the spark label Jan 28, 2021

wypoon mentioned this pull request Jan 28, 2021

Lookup of data file by path string is not robust to non-normalized paths #2169

Closed

Same fix for flink DataIterator.

8c6f01e

github-actions bot added the flink label Jan 28, 2021

steveloughran reviewed Jan 28, 2021

View reviewed changes

rdblue reviewed Jan 28, 2021

View reviewed changes

Ensure that HadoopInputFile#location() returns the original location …

116006f

…string if known.

github-actions bot added the core label Jan 28, 2021

wypoon changed the title ~~Make lookup of data file by file path robust.~~ Keep original location in HadoopInputFile if given Jan 28, 2021

rdblue reviewed Jan 29, 2021

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/source/DataIterator.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 29, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java Outdated Show resolved Hide resolved

rdblue approved these changes Jan 29, 2021

View reviewed changes

Revert DataIterator and BaseDataReader to their original state.

4318d2e

wypoon force-pushed the normalize_file_path branch from 6ee16a8 to 4318d2e Compare January 29, 2021 04:06

rdblue merged commit f63fa5c into apache:master Jan 29, 2021

wypoon deleted the normalize_file_path branch September 16, 2021 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep original location in HadoopInputFile if given#2170

Keep original location in HadoopInputFile if given#2170
rdblue merged 4 commits intoapache:masterfrom
wypoon:normalize_file_path

wypoon commented Jan 28, 2021 •

edited

Loading

Uh oh!

steveloughran left a comment

Uh oh!

steveloughran Jan 28, 2021

Uh oh!

rdblue Jan 28, 2021

Uh oh!

rdblue Jan 28, 2021

Uh oh!

wypoon Jan 28, 2021

Uh oh!

rdblue Jan 28, 2021

Uh oh!

rdblue Jan 28, 2021

Uh oh!

wypoon Jan 28, 2021

Uh oh!

wypoon Jan 28, 2021

Uh oh!

Uh oh!

Uh oh!

rdblue commented Jan 29, 2021

Uh oh!

rdblue commented Jan 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wypoon commented Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rdblue commented Jan 29, 2021

Uh oh!

rdblue commented Jan 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wypoon commented Jan 28, 2021 •

edited

Loading