-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify intermediate data in Iceberg sink; use manifest files #31086
Conversation
import org.junit.runners.JUnit4; | ||
|
||
@RunWith(JUnit4.class) | ||
public class FileWriteResultTest implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only tests of the coder. And now the whole class is one autovalue thing that we don't even implement ourselves, except for helpers which are very well covered by testing of the sink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one suggestion
} catch (IOException exc) { | ||
throw new RuntimeException("Error decoding manifest file bytes"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include the table identifier in this error message for debuggability? or is it PII?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PII
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to include the manifest file name. I think with the LGTM we have it would be worthwhile to merge so it is safe to release and then improve error message if we have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM.
public DataFile dataFile() { | ||
return icebergDataWriter.toDataFile(); | ||
public ManifestFile getManifestFile() throws IOException { | ||
String manifestFilename = FileFormat.AVRO.addExtension(absoluteFilename + ".manifest"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this file will end up being named <absoluteFilename>.manifest.avro
. Just pointing out in case we want to avoid double extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deliberately want the double extension. Or at least for the name to have manifest
in it. And the extension is required by Iceberg because it detects filetype by extension in the decode path.
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
Previously, the Iceberg FileWriteResult included use of SerializableCoder and some hacks around Avro.
Now it leverages Iceberg manifest files and uses AutoValueSchema on a very basic POJO, and the use of Iceberg's APIs is more robust and uses entrypoints that should continue to work indefinitely by writing and reading manifest files that point to data files.
This does double the number of files written, which is unfortunate. But it fixes the intermediate encoding to something that should continue to work for a wide variety of implementation choices.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.