[SPARK-4983]insert waiting time before tagging ec2 instances #3986

GenTang · 2015-01-09T23:09:44Z

As the boto API doesn't support tag ec2 instances in the same call that launches them. We use exception handling code to wait that ec2 has enough time to propagate the information.

AmplabJenkins · 2015-01-09T23:12:09Z

Can one of the admins verify this patch?

nchammas · 2015-01-09T23:24:05Z

ec2/spark_ec2.py

Would it make sense to insert a small wait here? Say, 0.1 seconds. Otherwise, if it takes even a second for the information about the new instance to propagate, we might spam EC2 with requests very quickly and get rate limited.

Also, can we catch the exception specific to an instance not existing? We don't want to catch other exceptions here, right?

I think that it takes some time for EC2 to return an instance not existing exception . That's why I leave pass in the exception. However, Maybe we should add a small wait time to ensure that we don't submit too much requests to ec2

Yes, Here we just want to catch the exception of instance not existing. You are right, it is better to use specific exception. I will work on this.

nchammas · 2015-01-09T23:24:33Z

Thanks for working on this. I left a couple of comments inline.

cc @JoshRosen

nchammas · 2015-01-10T03:31:24Z

By the way, please also update the title of this PR to match the approach you are taking, since as you noted we can't actually use the same call to launch and tag instances. You can leave the JIRA tag at the beginning as-is.

GenTang · 2015-01-10T14:02:13Z

As all the exception about ec2 will throw out an EC2ResponseError exception, we use error_code to identify the specific error of instance not existing.
If EC2 returns instance not found exception, we wait a short time to relaunch the request. Otherwise, throw out the same error.

nchammas · 2015-01-10T16:24:42Z

ec2/spark_ec2.py

Lower-case sleep() here?

Sorry for the typo.

nchammas · 2015-01-10T16:25:33Z

@GenTang Did you test this on a few cluster launches to make sure it works?

GenTang · 2015-01-10T20:57:28Z

Yes, I reproduced the InvalidInstanceID.NotFound by change the instance ID before add_tag action and then re-change it to the correct id. However, it will print the error information as follow in the screen:
ERROR:boto:400 Bad Request
ERROR:boto:
InvalidInstanceID.NotFoundThe instance ID 'i-5aa99fb1' does not exist7bfc5342-1fc9-4431-b569-190b2f9c9e8c
It is printed by boto package.

nchammas · 2015-01-10T21:21:27Z

Are you saying that boto will print an error to screen even if we catch the exception?

GenTang · 2015-01-10T21:44:57Z

However, I met a really strange error a moment ago.
I launched a cluster containing 1 master and 1 slave with the script.
Add_tag to master succussed after two tries and add_tag to slave succussed without throwing out the error. However, EC2 threw out InvalidInstanceID.NotFound error for slave node at :

for i in cluster_instances:
    i.update()

in wait_for_cluster_state function. It seems that the information of instance has not been propagated for the update action. Meantime, information of instance has reached to certain point that add_tag action can be succussed.
I tried several time, it happened only once. I am not very clear why it happened. As wait_for_cluster_state is used for launch, start(these two need more than 1 minute to reach ssh-ready state), destroy(it needs about 1 second to reach terminated state) action, maybe the workaround this to add some more waiting time to launch update action by making following change:

while True:
    time.sleep(5 * num_attempts + 1)

at line 724

GenTang · 2015-01-10T21:48:36Z

Yes, boto will print the error even we catch the exception but the script will continue and the cluster will be successfully launched.
It is just ugly to have such error information in the screen.
FYI, it used boto 2.34.0

nchammas · 2015-01-10T22:00:02Z

Hmm, sucks that boto still prints that error.

Getting errors on i.update() seems to be part of AWS's general flakiness when propagating metadata info. The tag operations probably got servers that knew about the new instances, but the update operation got one that didn't.

It would be better if we could avoid adding arbitrary waits everywhere just to work around this problem with AWS. Is there something more direct we can do? Though honestly, maybe just adding a second inside wait_for_cluster_state() like you suggested is sufficient.

cc @JoshRosen @shivaram

GenTang · 2015-01-11T10:58:09Z

I understand why it still print exception error information even we catch the exception. It is because that we use

logging.basicConfig()

in the script. So the exception information will be added in the log even being catched.
Maybe it is a stupid question, why do we use logging in this script since we don't add any logging information in the script ?

@nchammas

nchammas · 2015-01-11T18:40:50Z

Hmm, I'm not sure. I wondered about that myself. Maybe this solution will work for us without having to turn off logging completely, though if one of the maintainers chimes in maybe we can remove that basicConfig() line entirely.

GenTang · 2015-01-11T19:11:11Z

Yeah, Thanks for your solution. It works.
However, I prefer to remove the logging package entirely, if we don't really use it.

nchammas · 2015-01-11T19:27:04Z

Agreed. @shivaram / @JoshRosen?

shivaram · 2015-01-24T00:19:53Z

@nchammas @GenTang - The logging.basicConfig seems to have been around since the very beginning [1]. I don't know much about Python so I can't recommend keeping it or removing it. @JoshRosen can comment on that.

Other than that this solution looks fine to me. It is unfortunate that we have so many custom sleep calls across the file, but I don't think there is much else we can do given the EC2 API we have right now.

[1] https://github.com/mesos/spark/blob/08c50ad1fcf323f62c80dfeb8f1caaf164211e0b/ec2/spark_ec2.py#L538

nchammas · 2015-01-29T21:10:05Z

ec2/spark_ec2.py

Revisiting this now, I wonder if it would just be simpler all around to insert a 5-second sleep right here and call it day. That should give AWS enough time to get its act together.

Having the wait/retry logic all over the place seems more a burden than a benefit, honestly. What do you think?

I agree with you. Since we can meet the same problem about propagating information of AWS in the wait_for_cluster_state function, I think that adding waiting time is much better workaround.
The only disadvantage is that we need 5 more seconds to launch a cluster.

Well, in this case I think the 5 seconds will just come out of the time we'll spend waiting anyway for SSH to become available.

Yeah, It is true. I will make a new commit as soon as possible.
I think I will add 5 seconds sleep time before tagging master and make some modification in the comment.

This reverts commit 63fd360.

This reverts commit 692fc2b.

This reverts commit 6adcf6d.

GenTang · 2015-01-31T18:06:55Z

ec2/spark_ec2.py

@nchammas
Maybe we should insert a print here to tell the client that we are waiting for the information to be propagated, since there will be 5 seconds idle between launch master in and wait for the cluster to enter.
How do you think?

Sure. Also, in the comments, reference SPARK-4983.

srowen · 2015-02-06T14:03:28Z

This seems fairly harmless and fixes a problem. Is it ready to merge as far as you guys are concerned?

shivaram · 2015-02-06T18:17:06Z

The latest update is simple workaround and looks good to me

nchammas · 2015-02-06T18:58:24Z

ec2/spark_ec2.py

Nit: Please rephrase this to "Waiting for AWS to propagate instance metadata..."

nchammas · 2015-02-06T19:00:26Z

Left 2 minor comments about phrasing, but otherwise this LGTM.

nchammas · 2015-02-06T19:00:59Z

Can someone trigger tests for this, by the way?

nchammas · 2015-02-06T19:09:03Z

Thank you @GenTang. This LGTM.

We could probably merge this right away, but the correct thing to do is wait for tests to run. @srowen can you trigger them please?

shivaram · 2015-02-06T19:11:03Z

Jenkins, test this please

SparkQA · 2015-02-06T19:12:41Z

Test build #26927 has started for PR 3986 at commit 13e257d.

This patch merges cleanly.

SparkQA · 2015-02-06T20:22:44Z

Test build #26927 has finished for PR 3986 at commit 13e257d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-06T20:22:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26927/
Test PASSed.

nchammas · 2015-02-06T20:42:20Z

This PR is ready to merge. So much discussion and effort ended up getting boiled down to 2 lines of code. :)

Thanks @GenTang for sticking through it. heh

GenTang · 2015-02-06T21:00:12Z

@nchammas You are welcome.
I am more than happy that this PR could finally be done. ^^

JoshRosen · 2015-02-06T21:24:36Z

Thanks for the fix. LGTM, so I'm going to merge this to master (1.4.0), branch-1.3 (1.3.0), and branch-1.2 (1.2.2).

The boto API doesn't support tag EC2 instances in the same call that launches them. We add a five-second wait so EC2 has enough time to propagate the information so that the tagging can succeed. Author: GenTang <gen.tang86@gmail.com> Author: Gen TANG <gen.tang86@gmail.com> Closes #3986 from GenTang/spark-4983 and squashes the following commits: 13e257d [Gen TANG] modification of comments 47f0675 [GenTang] print the information ab7a931 [GenTang] solve the issus spark-4983 by inserting waiting time 3179737 [GenTang] Revert "handling exceptions about adding tags to ec2" 6a8b53b [GenTang] Revert "the improvement of exception handling" 13e97a6 [GenTang] Revert "typo" 63fd360 [GenTang] typo 692fc2b [GenTang] the improvement of exception handling 6adcf6d [GenTang] handling exceptions about adding tags to ec2 (cherry picked from commit 0f3a360) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

handling exceptions about adding tags to ec2

6adcf6d

nchammas reviewed Jan 9, 2015
View reviewed changes

the improvement of exception handling

692fc2b

GenTang changed the title ~~[SPARK-4983]Tag EC2 instances in the same call that launches them~~ [SPARK-4983]exception handling about adding tags to EC2 instance Jan 10, 2015

nchammas reviewed Jan 10, 2015
View reviewed changes

typo

63fd360

GenTang closed this Jan 10, 2015

GenTang reopened this Jan 10, 2015

nchammas reviewed Jan 29, 2015
View reviewed changes

GenTang added 4 commits January 31, 2015 18:21

Revert "typo"

13e97a6

This reverts commit 63fd360.

Revert "the improvement of exception handling"

6a8b53b

This reverts commit 692fc2b.

Revert "handling exceptions about adding tags to ec2"

3179737

This reverts commit 6adcf6d.

solve the issus spark-4983 by inserting waiting time

ab7a931

GenTang reviewed Jan 31, 2015
View reviewed changes

print the information

47f0675

GenTang changed the title ~~[SPARK-4983]exception handling about adding tags to EC2 instance~~ [SPARK-4983]insert waiting time before tagging ec2 instances Feb 1, 2015

nchammas reviewed Feb 6, 2015
View reviewed changes

ec2/spark_ec2.py Outdated

Copy link

Contributor

nchammas Feb 6, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please rephrase this to "Waiting for AWS to propagate instance metadata..."

modification of comments

13e257d

asfgit closed this in 0f3a360 Feb 6, 2015

[SPARK-4983]insert waiting time before tagging ec2 instances #3986

[SPARK-4983]insert waiting time before tagging ec2 instances #3986

Uh oh!

Conversation

GenTang commented Jan 9, 2015

Uh oh!

AmplabJenkins commented Jan 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchammas commented Jan 9, 2015

Uh oh!

nchammas commented Jan 10, 2015

Uh oh!

GenTang commented Jan 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchammas commented Jan 10, 2015

Uh oh!

GenTang commented Jan 10, 2015

Uh oh!

nchammas commented Jan 10, 2015

Uh oh!

GenTang commented Jan 10, 2015

Uh oh!

GenTang commented Jan 10, 2015

Uh oh!

nchammas commented Jan 10, 2015

Uh oh!

GenTang commented Jan 11, 2015

Uh oh!

nchammas commented Jan 11, 2015

Uh oh!

GenTang commented Jan 11, 2015

Uh oh!

nchammas commented Jan 11, 2015

Uh oh!

shivaram commented Jan 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Feb 6, 2015

Uh oh!

shivaram commented Feb 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchammas commented Feb 6, 2015

Uh oh!

nchammas commented Feb 6, 2015

Uh oh!

nchammas commented Feb 6, 2015

Uh oh!

shivaram commented Feb 6, 2015

Uh oh!

SparkQA commented Feb 6, 2015

Uh oh!

SparkQA commented Feb 6, 2015

Uh oh!

AmplabJenkins commented Feb 6, 2015

Uh oh!

nchammas commented Feb 6, 2015

Uh oh!

GenTang commented Feb 6, 2015

Uh oh!