Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix three bugs with segment publishing. #6155

Merged
merged 6 commits into from Aug 15, 2018

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Aug 11, 2018

  1. In AppenderatorImpl: always use a unique path if requested, even if the segment
    was already pushed. This is important because if we don't do this, it causes
    the issue mentioned in KafkaIndexTask can delete published segments on restart #6124.
  2. In IndexerSQLMetadataStorageCoordinator: Fix a bug that could cause it to return
    a "not published" result instead of throwing an exception, when there was one
    metadata update failure, followed by some random exception. This is done by
    resetting the AtomicBoolean that tracks what case we're in, each time the
    callback runs.
  3. In BaseAppenderatorDriver: Only kill segments if we get an affirmative false
    publish result. Skip killing if we just got some exception. The reason for this
    is that we want to avoid killing segments if they are in an unknown state.

Two other changes to clarify the contracts a bit and hopefully prevent future bugs:

  1. Return SegmentPublishResult from TransactionalSegmentPublisher, to make it
    more similar to announceHistoricalSegments.
  2. Make it explicit, at multiple levels of javadocs, that a "false" publish result
    must indicate that the publish definitely did not happen. Unknown states must be
    exceptions. This helps BaseAppenderatorDriver do the right thing.

1. In AppenderatorImpl: always use a unique path if requested, even if the segment
   was already pushed. This is important because if we don't do this, it causes
   the issue mentioned in apache#6124.
2. In IndexerSQLMetadataStorageCoordinator: Fix a bug that could cause it to return
   a "not published" result instead of throwing an exception, when there was one
   metadata update failure, followed by some random exception. This is done by
   resetting the AtomicBoolean that tracks what case we're in, each time the
   callback runs.
3. In BaseAppenderatorDriver: Only kill segments if we get an affirmative false
   publish result. Skip killing if we just got some exception. The reason for this
   is that we want to avoid killing segments if they are in an unknown state.

Two other changes to clarify the contracts a bit and hopefully prevent future bugs:

1. Return SegmentPublishResult from TransactionalSegmentPublisher, to make it
more similar to announceHistoricalSegments.
2. Make it explicit, at multiple levels of javadocs, that a "false" publish result
must indicate that the publish _definitely_ did not happen. Unknown states must be
exceptions. This helps BaseAppenderatorDriver do the right thing.
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me 👍

@fjy
Copy link
Contributor

fjy commented Aug 11, 2018

👍

Copy link
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 after travis.

@jihoonson
Copy link
Contributor

Please check this:

[INFO] PMD Failure: io.druid.segment.realtime.appenderator.TransactionalSegmentPublisher:22 Rule:UnusedImports Priority:4 Avoid unused imports such as 'io.druid.indexing.overlord.DataSourceMetadata'.

@@ -323,6 +323,8 @@ public SegmentPublishResult inTransaction(
final TransactionStatus transactionStatus
) throws Exception
{
definitelyNotUpdated.set(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a comment that this overwrites definitelyNotUpdated on retrying?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

log.info("Our segments really do exist, awaiting handoff.");
} else {
throw new ISE("Failed to publish segments[%s]", segmentsAndMetadata.getSegments());
throw new ISE("Failed to publish segments.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change for removing too large logs? I feel sometimes this log helps..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it's not necessary, since they will get logged in the catch statement via:

              log.warn(e, "Failed publish, not removing segments: %s", segmentsAndMetadata.getSegments());

if (txnFailure.get()) {
return new SegmentPublishResult(ImmutableSet.of(), false);
if (definitelyNotUpdated.get()) {
return SegmentPublishResult.fail();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding an exception to SegmentPublishResult on failure, so that callers can figure out why it failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessary, since there is supposed to be only one reason: compare-and-swap failure with the metadata update.

@gianm
Copy link
Contributor Author

gianm commented Aug 13, 2018

[INFO] PMD Failure: io.druid.segment.realtime.appenderator.TransactionalSegmentPublisher:22 Rule:UnusedImports Priority:4 Avoid unused imports such as 'io.druid.indexing.overlord.DataSourceMetadata'.

This is happening because I referenced it in a javadoc. Apparently that's not good enough for the plugin. I removed the reference.

@jihoonson
Copy link
Contributor

@gianm please check the build failure.

/home/travis/build/apache/incubator-druid/server/src/test/java/io/druid/segment/realtime/appenderator/StreamAppenderatorDriverTest.java:362: error: incompatible types: bad return type in lambda expression
    return (segments, commitMetadata) -> true;
                                         ^
    boolean cannot be converted to SegmentPublishResult
/home/travis/build/apache/incubator-druid/server/src/test/java/io/druid/segment/realtime/appenderator/StreamAppenderatorDriverTest.java:371: error: incompatible types: bad return type in lambda expression
      return false;
             ^
    boolean cannot be converted to SegmentPublishResult

@gianm
Copy link
Contributor Author

gianm commented Aug 14, 2018

@jihoonson thanks, I pushed again.

@fjy
Copy link
Contributor

fjy commented Aug 14, 2018

👍

@jihoonson
Copy link
Contributor

There are still some build failures.

[ERROR] COMPILATION ERROR : 
[ERROR] /home/travis/build/apache/incubator-druid/server/src/test/java/io/druid/segment/realtime/appenderator/BatchAppenderatorDriverTest.java:[197,42] incompatible types: bad return type in lambda expression
    boolean cannot be converted to io.druid.indexing.overlord.SegmentPublishResult
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:testCompile (default-testCompile) on project druid-server: Compilation failure
[ERROR] /home/travis/build/apache/incubator-druid/server/src/test/java/io/druid/segment/realtime/appenderator/BatchAppenderatorDriverTest.java:[197,42] incompatible types: bad return type in lambda expression

@gianm
Copy link
Contributor Author

gianm commented Aug 15, 2018

OMG, sorry, I'll check more thoroughly before I push again.

@jihoonson
Copy link
Contributor

Hmm, now some unit tests are failing and looks legitimate.

Failed tests: 
  StreamAppenderatorDriverFailTest.testFailDuringPublish 
Expected: (an instance of java.util.concurrent.ExecutionException and exception with cause an instance of io.druid.java.util.common.ISE and exception with message a string containing "Failed to publish segments[[DataSegment{size=0, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[], dimensions=[], version='abc123', loadSpec={}, interval=2000-01-01T00:00:00.000Z/2000-01-01T01:00:00.000Z, dataSource='foo', binaryVersion='0'}, DataSegment{size=0, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[], dimensions=[], version='abc123', loadSpec={}, interval=2000-01-01T01:00:00.000Z/2000-01-01T02:00:00.000Z, dataSource='foo', binaryVersion='0'}]]")
     but: exception with message a string containing "Failed to publish segments[[DataSegment{size=0, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[], dimensions=[], version='abc123', loadSpec={}, interval=2000-01-01T00:00:00.000Z/2000-01-01T01:00:00.000Z, dataSource='foo', binaryVersion='0'}, DataSegment{size=0, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[], dimensions=[], version='abc123', loadSpec={}, interval=2000-01-01T01:00:00.000Z/2000-01-01T02:00:00.000Z, dataSource='foo', binaryVersion='0'}]]" message was "io.druid.java.util.common.ISE: Failed to publish segments."
Stacktrace was: java.util.concurrent.ExecutionException: io.druid.java.util.common.ISE: Failed to publish segments.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm thanks for the quick fix!

@fjy fjy merged commit 5ce3185 into apache:master Aug 15, 2018
gianm added a commit to implydata/druid-public that referenced this pull request Aug 16, 2018
* Fix three bugs with segment publishing.

1. In AppenderatorImpl: always use a unique path if requested, even if the segment
   was already pushed. This is important because if we don't do this, it causes
   the issue mentioned in apache#6124.
2. In IndexerSQLMetadataStorageCoordinator: Fix a bug that could cause it to return
   a "not published" result instead of throwing an exception, when there was one
   metadata update failure, followed by some random exception. This is done by
   resetting the AtomicBoolean that tracks what case we're in, each time the
   callback runs.
3. In BaseAppenderatorDriver: Only kill segments if we get an affirmative false
   publish result. Skip killing if we just got some exception. The reason for this
   is that we want to avoid killing segments if they are in an unknown state.

Two other changes to clarify the contracts a bit and hopefully prevent future bugs:

1. Return SegmentPublishResult from TransactionalSegmentPublisher, to make it
more similar to announceHistoricalSegments.
2. Make it explicit, at multiple levels of javadocs, that a "false" publish result
must indicate that the publish _definitely_ did not happen. Unknown states must be
exceptions. This helps BaseAppenderatorDriver do the right thing.

* Remove javadoc-only import.

* Updates.

* Fix test.

* Fix tests.
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Aug 17, 2018
jon-wei added a commit that referenced this pull request Aug 18, 2018
* [Backport] Fix three bugs with segment publishing. (#6155)

* Fix KafkaIndexTask
jon-wei added a commit to implydata/druid-public that referenced this pull request Aug 18, 2018
…che#6187)

* [Backport] Fix three bugs with segment publishing. (apache#6155)

* Fix KafkaIndexTask
leventov pushed a commit to metamx/druid that referenced this pull request Aug 19, 2018
…che#6187)

* [Backport] Fix three bugs with segment publishing. (apache#6155)

* Fix KafkaIndexTask
@gianm gianm deleted the fix-isql-msc-txn-fail-bug branch September 23, 2022 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants