Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify intermediate data in Iceberg sink; use manifest files #31086

Merged
merged 1 commit into from
Apr 23, 2024

Conversation

kennknowles
Copy link
Member

Previously, the Iceberg FileWriteResult included use of SerializableCoder and some hacks around Avro.

Now it leverages Iceberg manifest files and uses AutoValueSchema on a very basic POJO, and the use of Iceberg's APIs is more robust and uses entrypoints that should continue to work indefinitely by writing and reading manifest files that point to data files.

This does double the number of files written, which is unfortunate. But it fixes the intermediate encoding to something that should continue to work for a wide variety of implementation choices.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

import org.junit.runners.JUnit4;

@RunWith(JUnit4.class)
public class FileWriteResultTest implements Serializable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only tests of the coder. And now the whole class is one autovalue thing that we don't even implement ourselves, except for helpers which are very well covered by testing of the sink.

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one suggestion

Comment on lines +55 to +57
} catch (IOException exc) {
throw new RuntimeException("Error decoding manifest file bytes");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the table identifier in this error message for debuggability? or is it PII?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PII

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to include the manifest file name. I think with the LGTM we have it would be worthwhile to merge so it is safe to release and then improve error message if we have time.

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.

public DataFile dataFile() {
return icebergDataWriter.toDataFile();
public ManifestFile getManifestFile() throws IOException {
String manifestFilename = FileFormat.AVRO.addExtension(absoluteFilename + ".manifest");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this file will end up being named <absoluteFilename>.manifest.avro. Just pointing out in case we want to avoid double extension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deliberately want the double extension. Or at least for the name to have manifest in it. And the extension is required by Iceberg because it detects filetype by extension in the decode path.

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@kennknowles kennknowles merged commit 485c519 into apache:master Apr 23, 2024
22 checks passed
@kennknowles kennknowles deleted the filewriteresult branch April 23, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants