Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default machineCreateAttempts to 3 #487

Merged
merged 2 commits into from
Dec 15, 2016
Merged

Conversation

drigodwin
Copy link
Member

In testing I've found that increasing the machineCreateAttempts to 3 makes the provisioning of machines more robust. I think it should be made default.

@aledsage
Copy link
Contributor

@drigodwin interesting. I'm not sure who we should optimise for though, with the defaults.

The problem is if someone has misconfigured their location. For example,
we'll wait several minutes for sshable before reporting provisioning as failed. It's annoying that it takes 20 minutes (or whatever) to tell you that provisioning has failed. If we set machineCreateAttempts=3 then it will take 60 minutes before it tells you that it's failed.

It's a particular concern because of problems like https://issues.apache.org/jira/browse/JCLOUDS-1165, which can hit someone using the defaults in a common cloud (e.g. aws-ec2).

Maybe we compromise with machineCreateAttempts=2?

Much longer term, we could try to make the activities view clearer, to show that it's moved onto attempt number 2 (cc @m4rkmckenna @tbouron). That probably would require some significant restructuring of the low-level tasks executed for provisioning though.

@tbouron
Copy link
Member

tbouron commented Dec 15, 2016

@aledsage From a UI perspective, I think the best way would be to have one subtask per attempt. It would then be obvious that attempt 1 failed, and for which reason. If we cannot do that, we could imagine introducing a new status RETRY with a field attempt that will be rendered differently.

@m4rkmckenna thoughts?

@m4rkmckenna
Copy link
Member

@tbouron A task per try would be make it clear what happened

An hour to fail provisioning (due to misconfiguration) seems like terrible UX so 2 attempts seems like a good compromise

@neykov
Copy link
Member

neykov commented Dec 15, 2016

What happens to the machines in case provisioning fails? I've seen cases where it would actually create the machine but would fail for some other reason and the machines would be left running, never managed by brooklyn.
If that's covered then +1 for the change.

@drigodwin
Copy link
Member Author

I presume you mean problems such as BROOKLYN-264 @neykov? I agree that hanging instances are a problem but I'm not sure it's a good reason to not make this change here. I agree with @aledsage that this could cause a long wait when a location is incorrectly specified so I've reduce it to 2 as suggested.

@neykov
Copy link
Member

neykov commented Dec 15, 2016

Yes it's a similar problem. Agree.

@bostko
Copy link
Contributor

bostko commented Dec 15, 2016

What are the cases when you had to increase this parameter?
For me usually it is for wrong location configuration.
I'd like to fix the location configuration and relaunch it again.

If there is a cloud API which fails to create a VM then this should be fixed on jclouds level.
Also it is on JcloudsLocation#obtain failure then it should be fixed somehow. I do not see when such retry should be used at all.

@aledsage
Copy link
Contributor

@bostko clouds will sometimes return a machine that is dead-on-arrival (DOA). It is impossible to ssh to that machine, no matter how long you wait. Sometimes (less often) the VM being provisioned in that cloud might fail, and will report a VM status of "starting" and then "error". It's not as simple as the cloud API having failed to create the VM - a VM exists, but it's unusable.

In those cases, jclouds will not retry. That's fine - we don't want jclouds to do the retry logic, and instead to tell us about the failure. We'll get back the VM id that was provisioned, and can then call jclouds to destroy the VM. Within Brooklyn, we can decide whether to retry.

It varies how often these DOA VMs happen. From my experience, it used to be about 1 in 50 in AWS, but is even less common now. In some other clouds, it's more common.

This param is particularly important if provisioning bigger clusters, or many apps, as obviously it's more likely to happen if provisioning many things.

@bostko
Copy link
Contributor

bostko commented Dec 15, 2016

I didn't meet DOA machines.
I would vote for making it two by default.

@aledsage
Copy link
Contributor

@neykov @drigodwin for VMs being "orphaned", not managed by Brooklyn, I think it's good enough for now. The only times I know of it happening are:

  1. When Brooklyn was terminated during provisioning (before it got the VM id back from jclouds) - i.e. https://issues.apache.org/jira/browse/BROOKLYN-264
  2. When the cloud gives us an error, when we try to delete the VM (e.g. https://issues.apache.org/jira/browse/BROOKLYN-411)

For (2), there's not much we can do - but we could improve things a bit:

  1. Try to handle such cases in jclouds (but for BROOKLYN-411 we already retried 6 times!)
  2. Raise an event that the VM is "orphaned" (requires a better place to push such events, rather than just the info log; that's a separate topic!)

Of course if someone also sets destroyOnFailure: false then we'll leave behind an orphaned VM!

@asfgit asfgit merged commit ac56b00 into apache:master Dec 15, 2016
asfgit pushed a commit that referenced this pull request Dec 15, 2016
@drigodwin drigodwin deleted the patch-1 branch December 15, 2016 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants