Skip to content

Conversation

@pnowojski
Copy link
Contributor

@pnowojski pnowojski commented Oct 20, 2017

What is the purpose of the change

It depends whether to call hsync or hflush on the underlying file system and user preferences. Normally hflush is enough to protect against single machine HDFS failures and against TaskManagers failures. However if user is using S3 like file system, or wants to protect against whole HDFS rack power loss hsync must be used instead.

This is a stop gap solution until proper fix waiting for https://issues.apache.org/jira/browse/FLINK-5789

Verifying this change

Testing syncOnFlush is hard :( One could think about writing a unit test using mocks, but that would only copy the implementation.

I have added tests for duplicate() (and fixed StringWriter::duplicate which was not preserving the charset)

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

It depends whether to call hsync or hflush on the underlying file system
and user preferences. Normally hflush is enough to protect against single
machine HDFS failures and against TaskManagers failures. However if user is
using S3 like file system, or wants to protect againt whole HDFS rack power
loss hsync must be used instead.
@aljoscha
Copy link
Contributor

Excellent changes, I'll merge along with a bunch of other stuff. 👍

@aljoscha
Copy link
Contributor

I merged this, could you please close the PR?

@pnowojski
Copy link
Contributor Author

Thanks!

@pnowojski pnowojski closed this Oct 24, 2017
@StephanEwen
Copy link
Contributor

A quick post-mortem comment here:

This adds a lot of equals() and hashCode() on classes where these are ill-defined.

For example: StreamWriterBase defines equals() and hashCode() just on a subset of configuration fields (here the sync field) and ignore the associated stream, because equals and hash is ill-defined on the stream. To me, the correct conclusion is that equals() and hashCode() are ill-defined on StreamWriterBase and should not be there!

Adding such methods just to make assertion statement in tests more compact wrongly pushes some specific test logic in to the main classes. The correct way is to adjust the assertions in the test, or, if there is a lot of repetitive checking, create a Matcher that matches "equality based on some fields" and replace assertEquals(X, Y) with assertThat(matcher, X, Y).

I think in this case, it is actually a reason for a follow-up patch that changes this.

@pnowojski
Copy link
Contributor Author

I assumed that it is up to concrete implementations of StreamWriterBase to implement equals completely, for example based on configuration. Alternatively including outStream in equality/hash implementation is also a solution, with slightly different semantic.

@StephanEwen
Copy link
Contributor

But do you understand what I mean? Semantics of code in the main scope should not be quirked to make assertions in tests shorter to write.

Equals/hashCode is usually not implemented on I/O classes, like the output stream, because it is not well defined.

@pnowojski
Copy link
Contributor Author

Yes I know what you mean. However including outStream to hashCode and equals wouldn't add any quirks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants