[FLINK-9113] [connectors] Use raw local file system for bucketing sink to prevent data loss #5861

twalthr · 2018-04-17T13:20:59Z

What is the purpose of the change

This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush.

Negative effect: Existing checksums are not maintained anymore and thus become invalid.

Brief change log

Replace local filesystem by raw filesystem

Verifying this change

Added a check for verifying the file length and file size.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

…k to prevent data loss

twalthr · 2018-04-17T17:01:54Z

It seems that for Hadoop 2.8.3 truncating is supported for the raw local filesystems. I will need to adapt the test for that.

aljoscha

Changes look good! 👍

I had one comment about an outdated comment, you could change that while merging.

I'm assuming the updated tests failed without the fix?

aljoscha · 2018-04-18T14:40:42Z

...lesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java

@@ -1245,6 +1246,12 @@ else if (scheme != null && authority == null) {
 			}

 			fs.initialize(fsUri, finalConf);
+
+			// By default we don't perform checksums on Hadoop's local filesystem and use the raw filesystem.


The "by default" is not necessary anymore. We now always use the raw filesystem. This is a leftover from the previous version that allowed changing this.

…k to prevent data loss This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush. Negative effect: Existing checksums are not maintained anymore and thus become invalid. This closes #5861.

…k to prevent data loss This change replaces Hadoop's LocalFileSystem (which is a checksumming filesystem) with the RawFileSystem implementation. For performing checksums the default filesystem only flushes in 512 byte intervals which might lead to data loss during checkpointing. In order to guarantee exact results we skip the checksum computation and perform a raw flush. Negative effect: Existing checksums are not maintained anymore and thus become invalid. This closes apache#5861.

[FLINK-9113] [connectors] Use raw local file system for bucketing sin…

17b85bd

…k to prevent data loss

twalthr added 2 commits April 18, 2018 11:24

Update and refactor tests

ce85894

Disable truncating for BucketingSinkMigrationTest

2c5ba22

aljoscha approved these changes Apr 18, 2018

View reviewed changes

asfgit closed this in 96f675c Apr 19, 2018

rmetzger added the component=Connectors/Common label Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-9113] [connectors] Use raw local file system for bucketing sink to prevent data loss #5861

[FLINK-9113] [connectors] Use raw local file system for bucketing sink to prevent data loss #5861

twalthr commented Apr 17, 2018

twalthr commented Apr 17, 2018

aljoscha left a comment

aljoscha Apr 18, 2018

[FLINK-9113] [connectors] Use raw local file system for bucketing sink to prevent data loss #5861

[FLINK-9113] [connectors] Use raw local file system for bucketing sink to prevent data loss #5861

Conversation

twalthr commented Apr 17, 2018

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

twalthr commented Apr 17, 2018

aljoscha left a comment

Choose a reason for hiding this comment

aljoscha Apr 18, 2018

Choose a reason for hiding this comment