New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(TWILL-181) allow setting the maximum number of retries per runnable #23

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@serranom
Contributor

serranom commented Jan 17, 2017

feature is complete. submitting for review.

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Jan 19, 2017

Contributor

Thanks for the PR. Could you kindly add more in the PR description on how the solution is implemented.
This will help review to "compare" the code implementation with the description of what you were trying to do.

Contributor

hsaputra commented Jan 19, 2017

Thanks for the PR. Could you kindly add more in the PR description on how the solution is implemented.
This will help review to "compare" the code implementation with the description of what you were trying to do.

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Jan 19, 2017

Contributor

@hsaputra , it follows the design in the ticket. should those details be repeated here?

Contributor

serranom commented Jan 19, 2017

@hsaputra , it follows the design in the ticket. should those details be repeated here?

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Jan 19, 2017

Contributor

Also, I noticed the test failure and I'm confused about it. It seems to be something intermittent because it succeeds in some cases. When it fails, the AM doesn't restart the containers:

2017-01-17T23:43:19,872Z WARN  o.a.t.i.a.RunningContainers [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] RunningContainers:handleCompleted(RunningContainers.java:476) - Container container_1484695869902_0027_01_000006 exited abnormally with state COMPLETE, exit code 1.
2017-01-17T23:43:19,872Z INFO  o.a.t.i.a.RunningContainers [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] RunningContainers:shouldRetry(RunningContainers.java:518) - 5 of 6 retries for runnable FailingServer.
2017-01-17T23:43:19,873Z INFO  o.a.t.i.a.RunningContainers [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] RunningContainers:handleCompleted(RunningContainers.java:479) - Re-request the container container_1484695869902_0027_01_000006 for exit code 1.
2017-01-17T23:43:19,873Z INFO  o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:handleCompleted(ApplicationMasterService.java:502) - Re-request container for FailingServer with 1 instances.
2017-01-17T23:43:19,873Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:19,873Z INFO  o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:444) - Relaxing provisioning constraints for request f30d39fe-2d65-493d-b670-8b42192bf322
2017-01-17T23:43:20,875Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:21,878Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:22,880Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:23,882Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:24,884Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:24,885Z INFO  o.a.t.i.LogOnlyEventHandler [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] LogOnlyEventHandler:launchTimeout(LogOnlyEventHandler.java:36) - Requested 2 containers for runnable FailingServer, only got 0 after 5012 ms.
2017-01-17T23:43:25,888Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:26,890Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:27,892Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3

Any advice on debugging this?

Contributor

serranom commented Jan 19, 2017

Also, I noticed the test failure and I'm confused about it. It seems to be something intermittent because it succeeds in some cases. When it fails, the AM doesn't restart the containers:

2017-01-17T23:43:19,872Z WARN  o.a.t.i.a.RunningContainers [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] RunningContainers:handleCompleted(RunningContainers.java:476) - Container container_1484695869902_0027_01_000006 exited abnormally with state COMPLETE, exit code 1.
2017-01-17T23:43:19,872Z INFO  o.a.t.i.a.RunningContainers [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] RunningContainers:shouldRetry(RunningContainers.java:518) - 5 of 6 retries for runnable FailingServer.
2017-01-17T23:43:19,873Z INFO  o.a.t.i.a.RunningContainers [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] RunningContainers:handleCompleted(RunningContainers.java:479) - Re-request the container container_1484695869902_0027_01_000006 for exit code 1.
2017-01-17T23:43:19,873Z INFO  o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:handleCompleted(ApplicationMasterService.java:502) - Re-request container for FailingServer with 1 instances.
2017-01-17T23:43:19,873Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:19,873Z INFO  o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:444) - Relaxing provisioning constraints for request f30d39fe-2d65-493d-b670-8b42192bf322
2017-01-17T23:43:20,875Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:21,878Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:22,880Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:23,882Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:24,884Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:24,885Z INFO  o.a.t.i.LogOnlyEventHandler [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] LogOnlyEventHandler:launchTimeout(LogOnlyEventHandler.java:36) - Requested 2 containers for runnable FailingServer, only got 0 after 5012 ms.
2017-01-17T23:43:25,888Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:26,890Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3
2017-01-17T23:43:27,892Z DEBUG o.a.t.i.a.ApplicationMasterService [testing-docker-7dddd7e0-951d-4746-842a-bd7d0e7c25ce] [ApplicationMasterService] ApplicationMasterService:doRun(ApplicationMasterService.java:406) - Runnable container requests: 3

Any advice on debugging this?

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Jan 19, 2017

Contributor

Ah, I see you set out the proposed design as JIRA comment.
Usually the design in JIRA changes once you start implementing, that's why I usually ask for developers to add the actual design or plan implemented in the PR

Contributor

hsaputra commented Jan 19, 2017

Ah, I see you set out the proposed design as JIRA comment.
Usually the design in JIRA changes once you start implementing, that's why I usually ask for developers to add the actual design or plan implemented in the PR

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Jan 19, 2017

Contributor

Gotcha. Here it is, cleaned up for what I actually did:

  • Each runnable can have a configured number of max retries. If not set, then retries are unlimited as before. Retries are scaled according to the number of instances for each Runnable.
  • add withMaxTries(runnableName, int) to TwillPreparer
  • add withMaxTries(runnableName, int) to YarnTwillPreparer. This stores a map from runnableName to maxRetries.
  • this map becomes part of the twillRuntimeSpecification and RuntimeSpecification interface and is added to TwillRuntimeSpecificationCodec
  • ApplicationMasterService.initRunningContainers is updated to pass a map of runnables to maxretries.
  • updated RunningContainers so that it keeps count of the number of retries per runnable and uses this in handleCompleted() to determine if it should retry. Since every instance is the same as any other, if I'm starting 10 instances of a Runnable, and wanted a max retry count of 3, then that would scale the total number of retries to 30. Each instance gets (on average) 3 tries. Since the instances are interchangeable, there is no concept of a discrete instance being retried.
  • updated logging to not have anything special if max wasn't set and to log the number of retries left and when they have been exhausted.
Contributor

serranom commented Jan 19, 2017

Gotcha. Here it is, cleaned up for what I actually did:

  • Each runnable can have a configured number of max retries. If not set, then retries are unlimited as before. Retries are scaled according to the number of instances for each Runnable.
  • add withMaxTries(runnableName, int) to TwillPreparer
  • add withMaxTries(runnableName, int) to YarnTwillPreparer. This stores a map from runnableName to maxRetries.
  • this map becomes part of the twillRuntimeSpecification and RuntimeSpecification interface and is added to TwillRuntimeSpecificationCodec
  • ApplicationMasterService.initRunningContainers is updated to pass a map of runnables to maxretries.
  • updated RunningContainers so that it keeps count of the number of retries per runnable and uses this in handleCompleted() to determine if it should retry. Since every instance is the same as any other, if I'm starting 10 instances of a Runnable, and wanted a max retry count of 3, then that would scale the total number of retries to 30. Each instance gets (on average) 3 tries. Since the instances are interchangeable, there is no concept of a discrete instance being retried.
  • updated logging to not have anything special if max wasn't set and to log the number of retries left and when they have been exhausted.
@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Jan 24, 2017

Contributor

A couple of questions, since it has been a week since this pull request was submitted. Note, I'm not irritated or anything like that... just want to have my expectations in line for planning purposes.

  • What should I generally expect (in terms of time) for feedback on the merging of a pull request?
  • Is there any issue with this pull request (code-wise) that contributes to the time?
  • Does the failing CI check (which I've seen is intermittent) a contributor to the time? That is the fact that it has a failure means it would get less attention.
  • Should I be doing anything different?

Thanks!

Contributor

serranom commented Jan 24, 2017

A couple of questions, since it has been a week since this pull request was submitted. Note, I'm not irritated or anything like that... just want to have my expectations in line for planning purposes.

  • What should I generally expect (in terms of time) for feedback on the merging of a pull request?
  • Is there any issue with this pull request (code-wise) that contributes to the time?
  • Does the failing CI check (which I've seen is intermittent) a contributor to the time? That is the fact that it has a failure means it would get less attention.
  • Should I be doing anything different?

Thanks!

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Jan 24, 2017

Contributor

Hi @serranom , usually in about week you should get review. I apologize for the delay, but sometimes there is delay just due to volunteer time.
As for CI failures, it depends, sometime ZK is flaky so functional tests could fail.

You are doing right thing, we as committers needs to update our level of service. Thanks for your patience

Contributor

hsaputra commented Jan 24, 2017

Hi @serranom , usually in about week you should get review. I apologize for the delay, but sometimes there is delay just due to volunteer time.
As for CI failures, it depends, sometime ZK is flaky so functional tests could fail.

You are doing right thing, we as committers needs to update our level of service. Thanks for your patience

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Jan 24, 2017

Contributor

Picking up this PR for review

Contributor

hsaputra commented Jan 24, 2017

Picking up this PR for review

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Jan 24, 2017

Contributor

Let's trigger new build, @serranom could you close and reopen this PR please?

Contributor

hsaputra commented Jan 24, 2017

Let's trigger new build, @serranom could you close and reopen this PR please?

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Jan 24, 2017

Contributor

Reference: https://issues.apache.org/jira/browse/TWILL-181

Hi @serranom , Could you also update the PR title to prefix with (TWILL-181) as shown in how to contribute page [1]?
This will trigger the Git hook to close the JIRA when PR is merged.

[1] http://twill.apache.org/HowToContribute.html

Contributor

hsaputra commented Jan 24, 2017

Reference: https://issues.apache.org/jira/browse/TWILL-181

Hi @serranom , Could you also update the PR title to prefix with (TWILL-181) as shown in how to contribute page [1]?
This will trigger the Git hook to close the JIRA when PR is merged.

[1] http://twill.apache.org/HowToContribute.html

@serranom serranom changed the title from Twill 181 to (TWILL-181) allow setting the maximum number of retries per runnable Jan 25, 2017

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Jan 25, 2017

Contributor

Thanks Henry! No rush on my account, I was just checking in.

Contributor

serranom commented Jan 25, 2017

Thanks Henry! No rush on my account, I was just checking in.

@poornachandra

@serranom The code changes mostly look good, just one comment

@@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
private final Location applicationLocation;
private final Set<String> runnableNames;
private final Map<String, Map<String, String>> logLevels;
private final Map<String, Integer> maxRetries;

This comment has been minimized.

@poornachandra

poornachandra Jan 26, 2017

Contributor

The max retries count in this map will have to be updated when the instance count of a runnable changes, right?

@poornachandra

poornachandra Jan 26, 2017

Contributor

The max retries count in this map will have to be updated when the instance count of a runnable changes, right?

This comment has been minimized.

@serranom

serranom Jan 26, 2017

Contributor

Good catch. Yes. I missed that. I will update the pull request with a fix. When the number of instances increases, deltaInstances * maxRetries will be added to this instance adjusted maximum. When the number of instances decreases, the deltaInstance*maxRetries will be subtracted from the instance adjusted maximum. Additionally upon a decrease, if the number of retries exceeds the instance adjusted maximum it will be set to the instance adjusted maximum.

@serranom

serranom Jan 26, 2017

Contributor

Good catch. Yes. I missed that. I will update the pull request with a fix. When the number of instances increases, deltaInstances * maxRetries will be added to this instance adjusted maximum. When the number of instances decreases, the deltaInstance*maxRetries will be subtracted from the instance adjusted maximum. Additionally upon a decrease, if the number of retries exceeds the instance adjusted maximum it will be set to the instance adjusted maximum.

This comment has been minimized.

@poornachandra

poornachandra Jan 26, 2017

Contributor

On further thought, I think if we track the restarts per instance id then it would simplify the contract for max retries. This will be consistent with what the javadoc says too -

Sets the maximum number of times (per instance) a runnable will be retried if it exits without success. The default behavior is to retry indefinitely.
@poornachandra

poornachandra Jan 26, 2017

Contributor

On further thought, I think if we track the restarts per instance id then it would simplify the contract for max retries. This will be consistent with what the javadoc says too -

Sets the maximum number of times (per instance) a runnable will be retried if it exits without success. The default behavior is to retry indefinitely.

This comment has been minimized.

@serranom

serranom Jan 26, 2017

Contributor

from my analysis the association of instance ids does not necessarily correspond to specific processes. it looked like there is a pool of requests, a new request being serviced gets the lowest instance id and a failed request gets put back on the queue. this is why i went with an instance adjusted number of retries. did i miss something?

@serranom

serranom Jan 26, 2017

Contributor

from my analysis the association of instance ids does not necessarily correspond to specific processes. it looked like there is a pool of requests, a new request being serviced gets the lowest instance id and a failed request gets put back on the queue. this is why i went with an instance adjusted number of retries. did i miss something?

This comment has been minimized.

@poornachandra

poornachandra Jan 26, 2017

Contributor

Instance ids for runnable instances go from 0 to num-instances - 1. If an instance gets restarted, then it retains the same instance id. Using this instance id we can track the number of times a particular instance was restarted.

You can get the instance id in handleCompleted method by using getInstanceId(RunId runId) method.

@poornachandra

poornachandra Jan 26, 2017

Contributor

Instance ids for runnable instances go from 0 to num-instances - 1. If an instance gets restarted, then it retains the same instance id. Using this instance id we can track the number of times a particular instance was restarted.

You can get the instance id in handleCompleted method by using getInstanceId(RunId runId) method.

This comment has been minimized.

@serranom

serranom Jan 27, 2017

Contributor

I'm taking a look at this.

@serranom

serranom Jan 27, 2017

Contributor

I'm taking a look at this.

This comment has been minimized.

@serranom

serranom Jan 28, 2017

Contributor

@poornachandra, I don't see anything linking one instance request to a retry of that failure. Failures get added to the multiset which then results in a new request for that many instances. This means in the end, there is nothing tying one attempt for a particular instanceId to the next one. So we could keep track of retries by instanceId, but it would end up functionally being the same as keeping track of retries for the entire Runnable. I think it would just add more complexity for little benefit to track by instanceId.

@serranom

serranom Jan 28, 2017

Contributor

@poornachandra, I don't see anything linking one instance request to a retry of that failure. Failures get added to the multiset which then results in a new request for that many instances. This means in the end, there is nothing tying one attempt for a particular instanceId to the next one. So we could keep track of retries by instanceId, but it would end up functionally being the same as keeping track of retries for the entire Runnable. I think it would just add more complexity for little benefit to track by instanceId.

This comment has been minimized.

@serranom

serranom Jan 28, 2017

Contributor

but on second thought, the additional complexity of tracking by instance id would handle the following case better:

  • initially start with x instances, all of which succeed.
  • add 1 more instance, which fails

If tracking by instance id, the new instance will get maxRetries. Without tracking by instance id, the new instance would get maxRetries*(x+1) retries. So I will add the complexity of keeping track by id. It seems worth it.

@serranom

serranom Jan 28, 2017

Contributor

but on second thought, the additional complexity of tracking by instance id would handle the following case better:

  • initially start with x instances, all of which succeed.
  • add 1 more instance, which fails

If tracking by instance id, the new instance will get maxRetries. Without tracking by instance id, the new instance would get maxRetries*(x+1) retries. So I will add the complexity of keeping track by id. It seems worth it.

This comment has been minimized.

@poornachandra

poornachandra Jan 28, 2017

Contributor

👍

@poornachandra

poornachandra Jan 28, 2017

Contributor

👍

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Jan 26, 2017

Contributor

@hsaputra, since I have a bug to fix, the updated request will trigger the rebuild. thanks.

Contributor

serranom commented Jan 26, 2017

@hsaputra, since I have a bug to fix, the updated request will trigger the rebuild. thanks.

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Jan 27, 2017

Contributor

Seems like one of the tests failed related to new one you added: MaxRetriesTestRun.maxRetriesTwoInstances

Could you quickly take a peek?
https://s3.amazonaws.com/archive.travis-ci.org/jobs/192819935/log.txt

Contributor

hsaputra commented Jan 27, 2017

Seems like one of the tests failed related to new one you added: MaxRetriesTestRun.maxRetriesTwoInstances

Could you quickly take a peek?
https://s3.amazonaws.com/archive.travis-ci.org/jobs/192819935/log.txt

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Jan 27, 2017

Contributor

@hsaputra , that is the failure I see as intermittent. My comment above has a relevant log excerpt. I'm planning on trying to dig into it today or tomorrow.

Contributor

serranom commented Jan 27, 2017

@hsaputra , that is the failure I see as intermittent. My comment above has a relevant log excerpt. I'm planning on trying to dig into it today or tomorrow.

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Jan 28, 2017

Contributor

I have discovered the cause of the intermittent failure.

This code (starting on line 703 of ApplicationMasterService, comments added by me):

  /*
   * The provisionRequest will either contain a single container (ALLOCATE_ONE_INSTANCE_AT_A_TIME), or all the
   * containers to satisfy the expectedContainers count. In the later case, the provision request is complete once
   * all the containers have run at which point we poll() to remove the provisioning request.
   */
  if (expectedContainers.getExpected(runnableName) == runningContainers.count(runnableName) ||
    provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME)) {
    provisioning.poll();
  }

There is a case when instances are failing (but not simultaneously) where the retries for the instances will be spread over two invocations of ApplicationMasterService.handleCompleted. This means they will be part of separate RunnableContainerRequests and thus will be provisioned separately. But because the code above does not anticipate this case, the first provisionRequest will never appear to be satisfied, never be polled and the total can never be met. This is an existing bug, since I think it could happen before my new code -- its just that my test tickles it. If there is agreement, I can file a new JIRA and PR request or just fix it with this PR. Let me know.

Contributor

serranom commented Jan 28, 2017

I have discovered the cause of the intermittent failure.

This code (starting on line 703 of ApplicationMasterService, comments added by me):

  /*
   * The provisionRequest will either contain a single container (ALLOCATE_ONE_INSTANCE_AT_A_TIME), or all the
   * containers to satisfy the expectedContainers count. In the later case, the provision request is complete once
   * all the containers have run at which point we poll() to remove the provisioning request.
   */
  if (expectedContainers.getExpected(runnableName) == runningContainers.count(runnableName) ||
    provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME)) {
    provisioning.poll();
  }

There is a case when instances are failing (but not simultaneously) where the retries for the instances will be spread over two invocations of ApplicationMasterService.handleCompleted. This means they will be part of separate RunnableContainerRequests and thus will be provisioned separately. But because the code above does not anticipate this case, the first provisionRequest will never appear to be satisfied, never be polled and the total can never be met. This is an existing bug, since I think it could happen before my new code -- its just that my test tickles it. If there is agreement, I can file a new JIRA and PR request or just fix it with this PR. Let me know.

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Jan 29, 2017

Contributor

I have new changes I will add to this PR once #29 is resolved since the tests will fail intermittently without that fix.

Contributor

serranom commented Jan 29, 2017

I have new changes I will add to this PR once #29 is resolved since the tests will fail intermittently without that fix.

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Feb 2, 2017

Contributor

I think this is set now. I've updated the PR with instance-based maxRetries changes.

Contributor

serranom commented Feb 2, 2017

I think this is set now. I've updated the PR with instance-based maxRetries changes.

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Feb 2, 2017

Contributor

ugh. I'll take a look, I may have messed up the merge.

Contributor

serranom commented Feb 2, 2017

ugh. I'll take a look, I may have messed up the merge.

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Feb 3, 2017

Contributor

I found that there is still a race condition in ApplicationMasterService.launchRunnable. If the number of instances is increased right after the original request is fullfilled, the current logic can result in the original request not being polled resulting in future requests hanging. This seems to be a fairly unlikely case in the real world, but I'll file a JIRA for it. For now, I've reworked the test to wait until launchRunnable has done the polling for the original request. This PR is now complete.

Contributor

serranom commented Feb 3, 2017

I found that there is still a race condition in ApplicationMasterService.launchRunnable. If the number of instances is increased right after the original request is fullfilled, the current logic can result in the original request not being polled resulting in future requests hanging. This seems to be a fairly unlikely case in the real world, but I'll file a JIRA for it. For now, I've reworked the test to wait until launchRunnable has done the polling for the original request. This PR is now complete.

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Feb 3, 2017

Contributor

Thanks much for working on the patch and for the update, @serranom.

Did a pass for review.
LGTM in general, add mini comment. Other than that +1 will merge after comment addressed.

Contributor

hsaputra commented Feb 3, 2017

Thanks much for working on the patch and for the update, @serranom.

Did a pass for review.
LGTM in general, add mini comment. Other than that +1 will merge after comment addressed.

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Feb 3, 2017

Contributor

My pleasure. It was a good way to get to know twill code base.

Contributor

serranom commented Feb 3, 2017

My pleasure. It was a good way to get to know twill code base.

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Feb 4, 2017

Contributor

I am having problem applying the patch, could you kindly rebase the patch into single commit?
Thanks

Contributor

hsaputra commented Feb 4, 2017

I am having problem applying the patch, could you kindly rebase the patch into single commit?
Thanks

@serranom

This comment has been minimized.

Show comment
Hide comment
@serranom

serranom Feb 6, 2017

Contributor

I looked at the latest CI run and the failure is related to some twill-zookeeper project issue -- not yarn

Contributor

serranom commented Feb 6, 2017

I looked at the latest CI run and the failure is related to some twill-zookeeper project issue -- not yarn

@hsaputra

This comment has been minimized.

Show comment
Hide comment
@hsaputra

hsaputra Feb 6, 2017

Contributor

Looks like not related to this PR. Will merge EOD if no more comment.

Thanks again for the hard work, @serranom

Contributor

hsaputra commented Feb 6, 2017

Looks like not related to this PR. Will merge EOD if no more comment.

Thanks again for the hard work, @serranom

@asfgit asfgit closed this in 78fc263 Feb 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment