Core: Update version-hint.txt atomically#1559
Conversation
core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java
Outdated
Show resolved
Hide resolved
|
If I understand correctly, these operations should be applied regardless of existing version file - consider the case when two concurrent writers both see there's no existing version file and try to write to the path directly. That said, delete -> rename can be (should be?) replaced with rename with overwrite = true. If I'm not missing anything, this ensures last one wins (there's no way to ensure atomicity via atomic rename when the file may exist, so that's a best effort) and partial file is not exposed. |
I think we could not use rename with overwrite since it behaves differently on different FileSystems: That is why we decided to use delete + rename which works consistently for every FS |
|
@pvary Thanks for the review. I've committed an improved version of the fix. |
I guess it's not explaining the case when overwrite = true. (If then HDFS is breaking contract. I agree local filesystem is breaking contract without the option so can't simply rely on contract.) delete -> rename would work if we simply do delete regardless of when the file exists or not, so not a big deal though. Just wanted to try to ensure the last one wins, but on concurrent writers it wouldn't be weird anyone wins. |
core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java
Outdated
Show resolved
Hide resolved
|
+1 |
|
If there is a rename with overwrite option, then I would prefer to use that for the cases where it is atomic. That would be safer. (Side note: local FS doesn't provide atomic rename, even without overwrite, so you shouldn't be using Hadoop tables with it except for tests.) Also, we don't care whether the version hint is completely up to date. We just want to be close to the real version. Being a version or two behind is okay. |
|
Since the FileSystem API doesn't provide atomic renames, that's why we choose to create new file + delete old file + rename new file path. |
|
@rdblue When you have some time, could you please review my change? Thanks |
|
Does the file system API provide a rename with overwrite option? If so, we should use that regardless of whether it is atomic. If not, then we should move forward with delete and then rename. |
|
@rdblue I agree, rename with overwrite would be the best solution but on HDFS the rename fails without raising any exception if destination exists. |
|
@rdblue According to the Filesystem rename API doc What if I try the rename with overwrite, and if it returns false I fallback to delete and rename? What do you think? |
|
To clarify, the API I mentioned was I don't have HDFS cluster now, but the result of the rename operation against local filesystem via FileContext is quite different from the document says. Below is the code you can run with spark-shell against Hadoop 2.7 & Hadoop 3.2. It correctly fails on existing file in destination, and correctly overwrites the new file if the overwrite option is provided. I also looked through the code path on how namenode handles rename, and it redirects me to
|
|
(It probably depends on the preference of using FileContext though.) |
| } | ||
|
|
||
| private void writeVersionToPath(FileSystem fs, Path path, int versionToWrite) throws IOException { | ||
| try (FSDataOutputStream out = fs.create(path, false)) { |
There was a problem hiding this comment.
Boolean parameters need inline comments to explain what they are. What does false here do?
|
Good to understand the points about rename with overwrite flag. Since there isn't an overwrite flag in the API we use, let's go with delete and rename for this. The current approach looks good to me. I think the only thing that needs to be updated is that the boolean args need to be documented. I agree with the choice to fail the create if the unique file already exists, but the comment should be there so it can be understood easily. |
|
@rdblue Thanks for the review. I updated the code with an inline comment. |
| try { | ||
| Path tempVersionHintFile = metadataPath(UUID.randomUUID().toString() + "-version-hint.temp"); | ||
| writeVersionToPath(fs, tempVersionHintFile, versionToWrite); | ||
| fs.delete(versionHintFile, false); |
There was a problem hiding this comment.
This one needs a comment to clarify false as well.
There was a problem hiding this comment.
Thanks. Fixed it.
|
Thanks, @lcspinter! Merged. |
|
@lcspinter @rdblue we observed a few times in our job that when the program exits say due to OOM right after the delete but before the rename, we will no longer be able to find the table. What we have to do is manually find the latest version and craft a version-hint file by hand. Since we do this delete and rename operation, can we also generate a version-hint by file listing when the file is not found? |
Under issue #1496 it was already discussed that we should find a way to update version-hint.txt atomically. At the moment, there is no know operation that would support atomic updates. Based on this I think our best solution would be: