Change default `machineCreateAttempts` to 3 #487

drigodwin · 2016-12-13T15:51:22Z

In testing I've found that increasing the machineCreateAttempts to 3 makes the provisioning of machines more robust. I think it should be made default.

aledsage · 2016-12-15T12:09:06Z

@drigodwin interesting. I'm not sure who we should optimise for though, with the defaults.

The problem is if someone has misconfigured their location. For example,
we'll wait several minutes for sshable before reporting provisioning as failed. It's annoying that it takes 20 minutes (or whatever) to tell you that provisioning has failed. If we set machineCreateAttempts=3 then it will take 60 minutes before it tells you that it's failed.

It's a particular concern because of problems like https://issues.apache.org/jira/browse/JCLOUDS-1165, which can hit someone using the defaults in a common cloud (e.g. aws-ec2).

Maybe we compromise with machineCreateAttempts=2?

Much longer term, we could try to make the activities view clearer, to show that it's moved onto attempt number 2 (cc @m4rkmckenna @tbouron). That probably would require some significant restructuring of the low-level tasks executed for provisioning though.

tbouron · 2016-12-15T12:57:44Z

@aledsage From a UI perspective, I think the best way would be to have one subtask per attempt. It would then be obvious that attempt 1 failed, and for which reason. If we cannot do that, we could imagine introducing a new status RETRY with a field attempt that will be rendered differently.

@m4rkmckenna thoughts?

m4rkmckenna · 2016-12-15T13:18:37Z

@tbouron A task per try would be make it clear what happened

An hour to fail provisioning (due to misconfiguration) seems like terrible UX so 2 attempts seems like a good compromise

neykov · 2016-12-15T13:24:30Z

What happens to the machines in case provisioning fails? I've seen cases where it would actually create the machine but would fail for some other reason and the machines would be left running, never managed by brooklyn.
If that's covered then +1 for the change.

drigodwin · 2016-12-15T13:54:24Z

I presume you mean problems such as BROOKLYN-264 @neykov? I agree that hanging instances are a problem but I'm not sure it's a good reason to not make this change here. I agree with @aledsage that this could cause a long wait when a location is incorrectly specified so I've reduce it to 2 as suggested.

neykov · 2016-12-15T15:33:57Z

Yes it's a similar problem. Agree.

bostko · 2016-12-15T15:44:43Z

What are the cases when you had to increase this parameter?
For me usually it is for wrong location configuration.
I'd like to fix the location configuration and relaunch it again.

If there is a cloud API which fails to create a VM then this should be fixed on jclouds level.
Also it is on JcloudsLocation#obtain failure then it should be fixed somehow. I do not see when such retry should be used at all.

aledsage · 2016-12-15T15:59:17Z

@bostko clouds will sometimes return a machine that is dead-on-arrival (DOA). It is impossible to ssh to that machine, no matter how long you wait. Sometimes (less often) the VM being provisioned in that cloud might fail, and will report a VM status of "starting" and then "error". It's not as simple as the cloud API having failed to create the VM - a VM exists, but it's unusable.

In those cases, jclouds will not retry. That's fine - we don't want jclouds to do the retry logic, and instead to tell us about the failure. We'll get back the VM id that was provisioned, and can then call jclouds to destroy the VM. Within Brooklyn, we can decide whether to retry.

It varies how often these DOA VMs happen. From my experience, it used to be about 1 in 50 in AWS, but is even less common now. In some other clouds, it's more common.

This param is particularly important if provisioning bigger clusters, or many apps, as obviously it's more likely to happen if provisioning many things.

bostko · 2016-12-15T16:04:15Z

I didn't meet DOA machines.
I would vote for making it two by default.

aledsage · 2016-12-15T16:09:46Z

@neykov @drigodwin for VMs being "orphaned", not managed by Brooklyn, I think it's good enough for now. The only times I know of it happening are:

When Brooklyn was terminated during provisioning (before it got the VM id back from jclouds) - i.e. https://issues.apache.org/jira/browse/BROOKLYN-264
When the cloud gives us an error, when we try to delete the VM (e.g. https://issues.apache.org/jira/browse/BROOKLYN-411)

For (2), there's not much we can do - but we could improve things a bit:

Try to handle such cases in jclouds (but for BROOKLYN-411 we already retried 6 times!)
Raise an event that the VM is "orphaned" (requires a better place to push such events, rather than just the info log; that's a separate topic!)

Of course if someone also sets destroyOnFailure: false then we'll leave behind an orphaned VM!

Change default machineCreateAttempts to 3

784d69e

Update MACHINE_CREATE_ATTEMPTS to 2

ac56b00

asfgit merged commit ac56b00 into apache:master Dec 15, 2016

asfgit pushed a commit that referenced this pull request Dec 15, 2016

This closes #487

3b4f394

drigodwin deleted the patch-1 branch December 15, 2016 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default `machineCreateAttempts` to 3 #487

Change default `machineCreateAttempts` to 3 #487

drigodwin commented Dec 13, 2016

aledsage commented Dec 15, 2016

tbouron commented Dec 15, 2016 •

edited

Loading

m4rkmckenna commented Dec 15, 2016

neykov commented Dec 15, 2016

drigodwin commented Dec 15, 2016

neykov commented Dec 15, 2016

bostko commented Dec 15, 2016

aledsage commented Dec 15, 2016

bostko commented Dec 15, 2016

aledsage commented Dec 15, 2016

Change default machineCreateAttempts to 3 #487

Change default machineCreateAttempts to 3 #487

Conversation

drigodwin commented Dec 13, 2016

aledsage commented Dec 15, 2016

tbouron commented Dec 15, 2016 • edited Loading

m4rkmckenna commented Dec 15, 2016

neykov commented Dec 15, 2016

drigodwin commented Dec 15, 2016

neykov commented Dec 15, 2016

bostko commented Dec 15, 2016

aledsage commented Dec 15, 2016

bostko commented Dec 15, 2016

aledsage commented Dec 15, 2016

Change default `machineCreateAttempts` to 3 #487

Change default `machineCreateAttempts` to 3 #487

tbouron commented Dec 15, 2016 •

edited

Loading