New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-9113] [connectors] Use raw local file system for bucketing sink to prevent data loss #5861
Conversation
…k to prevent data loss
It seems that for Hadoop 2.8.3 truncating is supported for the raw local filesystems. I will need to adapt the test for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good! 👍
I had one comment about an outdated comment, you could change that while merging.
I'm assuming the updated tests failed without the fix?
@@ -1245,6 +1246,12 @@ else if (scheme != null && authority == null) { | |||
} | |||
|
|||
fs.initialize(fsUri, finalConf); | |||
|
|||
// By default we don't perform checksums on Hadoop's local filesystem and use the raw filesystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "by default" is not necessary anymore. We now always use the raw filesystem. This is a leftover from the previous version that allowed changing this.
…k to prevent data loss This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush. Negative effect: Existing checksums are not maintained anymore and thus become invalid. This closes #5861.
…k to prevent data loss This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush. Negative effect: Existing checksums are not maintained anymore and thus become invalid. This closes apache#5861.
…k to prevent data loss This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush. Negative effect: Existing checksums are not maintained anymore and thus become invalid. This closes apache#5861.
…k to prevent data loss This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush. Negative effect: Existing checksums are not maintained anymore and thus become invalid. This closes apache#5861.
…k to prevent data loss This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush. Negative effect: Existing checksums are not maintained anymore and thus become invalid. This closes apache#5861.
…k to prevent data loss This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush. Negative effect: Existing checksums are not maintained anymore and thus become invalid. This closes apache#5861.
…k to prevent data loss This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush. Negative effect: Existing checksums are not maintained anymore and thus become invalid. This closes apache#5861.
What is the purpose of the change
This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.
Negative effect: Existing checksums are not maintained anymore and thus become invalid.
Brief change log
Verifying this change
Added a check for verifying the file length and file size.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation