Skip to content

SAMZA-1359: Handle phantom container notifications cleanly during an RM fail-over#243

Closed
vjagadish1989 wants to merge 9 commits into
apache:masterfrom
vjagadish1989:am-bug
Closed

SAMZA-1359: Handle phantom container notifications cleanly during an RM fail-over#243
vjagadish1989 wants to merge 9 commits into
apache:masterfrom
vjagadish1989:am-bug

Conversation

@vjagadish1989
Copy link
Copy Markdown
Contributor

  1. Improved our container handling logic to be resilient to phantom notifications.
  2. Added a new metric to Samza's ContainerProcessManager module that tracks the number of such invalid notifications.
  3. Add a couple of tests that simulate this exact scenario above that we encountered during the cluster upgrade. (container starts -> container fails -> legitimate notification for the failure - container re-start -> RM fail-over -> phantom notification with a different exit code)
  4. As an aside, there are a whole bunch of tests in ContainerProcessManager that rely on Thread.sleep to ensure that threads get to run in a certain order. Removed this non-determinism and made them predictable.

val mFailedContainers = newGauge("failed-containers", () => state.failedContainers.get())
val mReleasedContainers = newGauge("released-containers", () => state.releasedContainers.get())
val mContainers = newGauge("container-count", () => state.containerCount)
val mInvalidNotifications = newGauge("invalid-notifications", () => state.invalidNotifications.get())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, these notifications are not "invalid". They are "duplicate", "redundant" or "ignored", but not "invalid"

I'd suggest a rename, but it's your call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll rename them to be duplicate.

* @throws InterruptedException if the current thread is interrupted while waiting
*/
void awaitContainersStart(int numExpectedContainers) throws InterruptedException {
latch = new CountDownLatch(numExpectedContainers);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the code guarantee that this latch reassignment doesn't happen AFTER the latch.countDown in runStreamProcessor()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. awaitContainersStart is not meant to be used in a multi-threaded context.

  2. The only guarantee is that, we will wait until 2 containers have started after the call to awaitContainersStart. The caller should ensure that awaitContainerStart is called before any action that triggers the startup. This is just an alternative to doing a Thread.sleep()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly better than Thread.sleep(). Thanks for explaining and +10000 to getting rid of Thread.sleep(). Appreciate it!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest against an unbounded await. How about some reasonable timeout and then fail the test?

Instead of reassigning the count down latch you could use a semaphore and acquire the expected number of permits here and release them as containers start up.

Copy link
Copy Markdown
Contributor

@cpettitt-linkedin cpettitt-linkedin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos for improving the Thread.sleep situation!

* @throws InterruptedException if the current thread is interrupted while waiting
*/
void awaitContainersStart(int numExpectedContainers) throws InterruptedException {
latch = new CountDownLatch(numExpectedContainers);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest against an unbounded await. How about some reasonable timeout and then fail the test?

Instead of reassigning the count down latch you could use a semaphore and acquire the expected number of permits here and release them as containers start up.

@vjagadish1989
Copy link
Copy Markdown
Contributor Author

@santhoshvenkat1988 @jmakes Review please!

Copy link
Copy Markdown
Contributor

@jmakes jmakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@shanthoosh shanthoosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just curious. Wondering if it makes sense to make this layer capable of handling any kind of duplicate(bad) yarn notifications(successful_completion, preempted etc). We check a received container notification against valid possible state transitions before acting on it.

@asfgit asfgit closed this in 35143b6 Jul 25, 2017
asfgit pushed a commit that referenced this pull request Jul 26, 2017
Changes
* Fix checkstyle errors from #243
* Fix failure after bad merge in #244

Author: Shanthoosh Venkataraman <svenkataraman@linkedin.com>

Reviewers: Navina Ramesh <navina@apache.org>

Closes #252 from shanthoosh/fix_NPE_after_master_merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants