Skip to content

Keep original location in HadoopInputFile if given#2170

Merged
rdblue merged 4 commits intoapache:masterfrom
wypoon:normalize_file_path
Jan 29, 2021
Merged

Keep original location in HadoopInputFile if given#2170
rdblue merged 4 commits intoapache:masterfrom
wypoon:normalize_file_path

Conversation

@wypoon
Copy link
Contributor

@wypoon wypoon commented Jan 28, 2021

This is a fix for issue #2169.

In the getInputFile methods of org.apache.iceberg.spark.source.BaseDataReader and org.apache.iceberg.flink.source.DataIterator , the InputFile is looked up in a map by the path string. The keys in the map come from InputFile#location. Currently, when we create a HadoopInputFile using a location string, we do not store the string in a field but construct an org.apache.hadoop.fs.Path from it and store only the Path. HadoopInputFile#location then returns path.toString() for this Path. This causes the string to be normalized. If the path string in the manifest is not normalized, the lookup will fail since the strings are not equal.
The fix is to store the original location string in HadoopInputFile when it is created using a location string, so that we can return this string in HadoopInputFile#location.

This is a fix for issue apache#2169.
In `BaseDataReader#getInputFile`, the `InputFile` is looked up in a map
by the path string. The keys in the map are normalized paths. If the
path string in the manifest is not normalized, the lookup will fail
since the strings are not equal. The fix is to normalize the path string
before looking it up in the map.
@github-actions github-actions bot added the flink label Jan 28, 2021
Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be best to use path.toURI() as the key, as that has the strictest guarantees of escaping things

protected InputFile getInputFile(FileScanTask task) {
Preconditions.checkArgument(!task.isDataTask(), "Invalid task type");
return inputFiles.get(task.file().path().toString());
return getInputFile(task.file().path().toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably best to path().toURI() for best guarantee of handling of spaces

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a Hadoop Path, it is a CharSequence. That's how we generally avoid issues like encoding.

InputFile getInputFile(String location) {
return inputFiles.get(location);
// normalize the path before looking it up in the map
Path path = new Path(location);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using the Hadoop API directly is a good way to solve the problem. It sounds like we need to fix the keys in the map to match the original location from the input split instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For @steveloughran's information, the reason the keys in the inputFiles map became normalized paths is that the files for the scan tasks go through encryption and decryption:

    Stream<EncryptedInputFile> encrypted = keyMetadata.entrySet().stream()
        .map(entry -> EncryptedFiles.encryptedInput(io.newInputFile(entry.getKey()), entry.getValue()));

    // decrypt with the batch call to avoid multiple RPCs to a key server, if possible
    Iterable<InputFile> decryptedFiles = encryptionManager.decrypt(encrypted::iterator);

and the keys are the location of the decrypted files. The call to io.newInputFile either passes through HadoopFileIO or S3FileIO (currently the two implementations of FileIO), and the newInputFile process normalized the path. The normalization through HadoopFileIO is exactly what I'm replicating here. I think it will work for an S3 path too.

@rdblue, from the decrypted files, I do not see a natural way to recover the original path string written in the manifest. Instead, can we add a method to the FileIO interface to return the normalized path for a path String, and then HadoopFileIO and S3FileIO will have to implement it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we will need to fix the interfaces that are modifying the file location and pass the original.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that HadoopInputFile just needs to be updated so that fromLocation preserves the original location rather than using path.toString().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue thank you for the suggestion; I have implemented it and updated the PR description accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the javadoc for InputFile#location, this is "The fully-qualified location of the input file as a String." The only concern I have is if we ever call FileIO#newInputFile or HadoopInputFile.fromLocation with a location that is not fully-qualified.

@github-actions github-actions bot added the core label Jan 28, 2021
@wypoon wypoon changed the title Make lookup of data file by file path robust. Keep original location in HadoopInputFile if given Jan 28, 2021
@rdblue
Copy link
Contributor

rdblue commented Jan 29, 2021

Thanks, @wypoon! I'll merge this when tests are passing.

@wypoon wypoon force-pushed the normalize_file_path branch from 6ee16a8 to 4318d2e Compare January 29, 2021 04:06
@rdblue rdblue merged commit f63fa5c into apache:master Jan 29, 2021
@rdblue
Copy link
Contributor

rdblue commented Jan 29, 2021

Merged. Thank you for fixing this, @wypoon! And thanks for the review, @steveloughran!

@wypoon wypoon deleted the normalize_file_path branch September 16, 2021 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants