Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-9113] [connectors] Use raw local file system for bucketing sink to prevent data loss #5861

Closed
wants to merge 3 commits into from

Conversation

twalthr
Copy link
Contributor

@twalthr twalthr commented Apr 17, 2018

What is the purpose of the change

This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.

Negative effect: Existing checksums are not maintained anymore and thus become invalid.

Brief change log

  • Replace local filesystem by raw filesystem

Verifying this change

Added a check for verifying the file length and file size.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@twalthr
Copy link
Contributor Author

twalthr commented Apr 17, 2018

It seems that for Hadoop 2.8.3 truncating is supported for the raw local filesystems. I will need to adapt the test for that.

Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good! 👍

I had one comment about an outdated comment, you could change that while merging.

I'm assuming the updated tests failed without the fix?

@@ -1245,6 +1246,12 @@ else if (scheme != null && authority == null) {
}

fs.initialize(fsUri, finalConf);

// By default we don't perform checksums on Hadoop's local filesystem and use the raw filesystem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "by default" is not necessary anymore. We now always use the raw filesystem. This is a leftover from the previous version that allowed changing this.

@asfgit asfgit closed this in 96f675c Apr 19, 2018
asfgit pushed a commit that referenced this pull request Apr 19, 2018
…k to prevent data loss

This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.

Negative effect: Existing checksums are not maintained anymore and thus become invalid.

This closes #5861.
twalthr added a commit to twalthr/flink that referenced this pull request Apr 23, 2018
…k to prevent data loss

This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.

Negative effect: Existing checksums are not maintained anymore and thus become invalid.

This closes apache#5861.
glaksh100 pushed a commit to lyft/flink that referenced this pull request Jun 5, 2018
…k to prevent data loss

This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.

Negative effect: Existing checksums are not maintained anymore and thus become invalid.

This closes apache#5861.
glaksh100 pushed a commit to lyft/flink that referenced this pull request Jun 5, 2018
…k to prevent data loss

This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.

Negative effect: Existing checksums are not maintained anymore and thus become invalid.

This closes apache#5861.
glaksh100 pushed a commit to lyft/flink that referenced this pull request Jun 6, 2018
…k to prevent data loss

This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.

Negative effect: Existing checksums are not maintained anymore and thus become invalid.

This closes apache#5861.
glaksh100 pushed a commit to lyft/flink that referenced this pull request Jun 6, 2018
…k to prevent data loss

This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.

Negative effect: Existing checksums are not maintained anymore and thus become invalid.

This closes apache#5861.
sampathBhat pushed a commit to sampathBhat/flink that referenced this pull request Jul 26, 2018
…k to prevent data loss

This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.

Negative effect: Existing checksums are not maintained anymore and thus become invalid.

This closes apache#5861.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants