Skip to content

Conversation

@GenTang
Copy link
Contributor

@GenTang GenTang commented Jan 9, 2015

As the boto API doesn't support tag ec2 instances in the same call that launches them. We use exception handling code to wait that ec2 has enough time to propagate the information.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

ec2/spark_ec2.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to insert a small wait here? Say, 0.1 seconds. Otherwise, if it takes even a second for the information about the new instance to propagate, we might spam EC2 with requests very quickly and get rate limited.

Also, can we catch the exception specific to an instance not existing? We don't want to catch other exceptions here, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it takes some time for EC2 to return an instance not existing exception . That's why I leave pass in the exception. However, Maybe we should add a small wait time to ensure that we don't submit too much requests to ec2

Yes, Here we just want to catch the exception of instance not existing. You are right, it is better to use specific exception. I will work on this.

@nchammas
Copy link
Contributor

nchammas commented Jan 9, 2015

Thanks for working on this. I left a couple of comments inline.

cc @JoshRosen

@nchammas
Copy link
Contributor

By the way, please also update the title of this PR to match the approach you are taking, since as you noted we can't actually use the same call to launch and tag instances. You can leave the JIRA tag at the beginning as-is.

@GenTang GenTang changed the title [SPARK-4983]Tag EC2 instances in the same call that launches them [SPARK-4983]exception handling about adding tags to EC2 instance Jan 10, 2015
@GenTang
Copy link
Contributor Author

GenTang commented Jan 10, 2015

As all the exception about ec2 will throw out an EC2ResponseError exception, we use error_code to identify the specific error of instance not existing.
If EC2 returns instance not found exception, we wait a short time to relaunch the request. Otherwise, throw out the same error.

ec2/spark_ec2.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lower-case sleep() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the typo.

@nchammas
Copy link
Contributor

@GenTang Did you test this on a few cluster launches to make sure it works?

@GenTang
Copy link
Contributor Author

GenTang commented Jan 10, 2015

Yes, I reproduced the InvalidInstanceID.NotFound by change the instance ID before add_tag action and then re-change it to the correct id. However, it will print the error information as follow in the screen:
ERROR:boto:400 Bad Request
ERROR:boto:
InvalidInstanceID.NotFoundThe instance ID 'i-5aa99fb1' does not exist7bfc5342-1fc9-4431-b569-190b2f9c9e8c
It is printed by boto package.

@nchammas
Copy link
Contributor

Are you saying that boto will print an error to screen even if we catch the exception?

@GenTang
Copy link
Contributor Author

GenTang commented Jan 10, 2015

However, I met a really strange error a moment ago.
I launched a cluster containing 1 master and 1 slave with the script.
Add_tag to master succussed after two tries and add_tag to slave succussed without throwing out the error. However, EC2 threw out InvalidInstanceID.NotFound error for slave node at :

for i in cluster_instances:
    i.update()

in wait_for_cluster_state function. It seems that the information of instance has not been propagated for the update action. Meantime, information of instance has reached to certain point that add_tag action can be succussed.
I tried several time, it happened only once. I am not very clear why it happened. As wait_for_cluster_state is used for launch, start(these two need more than 1 minute to reach ssh-ready state), destroy(it needs about 1 second to reach terminated state) action, maybe the workaround this to add some more waiting time to launch update action by making following change:

while True:
    time.sleep(5 * num_attempts + 1) 

at line 724

@GenTang GenTang closed this Jan 10, 2015
@GenTang GenTang reopened this Jan 10, 2015
@GenTang
Copy link
Contributor Author

GenTang commented Jan 10, 2015

Yes, boto will print the error even we catch the exception but the script will continue and the cluster will be successfully launched.
It is just ugly to have such error information in the screen.
FYI, it used boto 2.34.0

@nchammas
Copy link
Contributor

Hmm, sucks that boto still prints that error.

Getting errors on i.update() seems to be part of AWS's general flakiness when propagating metadata info. The tag operations probably got servers that knew about the new instances, but the update operation got one that didn't.

It would be better if we could avoid adding arbitrary waits everywhere just to work around this problem with AWS. Is there something more direct we can do? Though honestly, maybe just adding a second inside wait_for_cluster_state() like you suggested is sufficient.

cc @JoshRosen @shivaram

@GenTang
Copy link
Contributor Author

GenTang commented Jan 11, 2015

I understand why it still print exception error information even we catch the exception. It is because that we use

logging.basicConfig()

in the script. So the exception information will be added in the log even being catched.
Maybe it is a stupid question, why do we use logging in this script since we don't add any logging information in the script ?

@nchammas

@nchammas
Copy link
Contributor

Hmm, I'm not sure. I wondered about that myself. Maybe this solution will work for us without having to turn off logging completely, though if one of the maintainers chimes in maybe we can remove that basicConfig() line entirely.

@GenTang
Copy link
Contributor Author

GenTang commented Jan 11, 2015

Yeah, Thanks for your solution. It works.
However, I prefer to remove the logging package entirely, if we don't really use it.

@nchammas
Copy link
Contributor

Agreed. @shivaram / @JoshRosen?

@shivaram
Copy link
Contributor

@nchammas @GenTang - The logging.basicConfig seems to have been around since the very beginning [1]. I don't know much about Python so I can't recommend keeping it or removing it. @JoshRosen can comment on that.

Other than that this solution looks fine to me. It is unfortunate that we have so many custom sleep calls across the file, but I don't think there is much else we can do given the EC2 API we have right now.

[1] https://github.com/mesos/spark/blob/08c50ad1fcf323f62c80dfeb8f1caaf164211e0b/ec2/spark_ec2.py#L538

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this now, I wonder if it would just be simpler all around to insert a 5-second sleep right here and call it day. That should give AWS enough time to get its act together.

Having the wait/retry logic all over the place seems more a burden than a benefit, honestly. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. Since we can meet the same problem about propagating information of AWS in the wait_for_cluster_state function, I think that adding waiting time is much better workaround.
The only disadvantage is that we need 5 more seconds to launch a cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in this case I think the 5 seconds will just come out of the time we'll spend waiting anyway for SSH to become available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, It is true. I will make a new commit as soon as possible.
I think I will add 5 seconds sleep time before tagging master and make some modification in the comment.

ec2/spark_ec2.py Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nchammas
Maybe we should insert a print here to tell the client that we are waiting for the information to be propagated, since there will be 5 seconds idle between launch master in and wait for the cluster to enter.
How do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Also, in the comments, reference SPARK-4983.

@GenTang GenTang changed the title [SPARK-4983]exception handling about adding tags to EC2 instance [SPARK-4983]insert waiting time before tagging ec2 instances Feb 1, 2015
@srowen
Copy link
Member

srowen commented Feb 6, 2015

This seems fairly harmless and fixes a problem. Is it ready to merge as far as you guys are concerned?

@shivaram
Copy link
Contributor

shivaram commented Feb 6, 2015

The latest update is simple workaround and looks good to me

ec2/spark_ec2.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please rephrase this to "Waiting for AWS to propagate instance metadata..."

@nchammas
Copy link
Contributor

nchammas commented Feb 6, 2015

Left 2 minor comments about phrasing, but otherwise this LGTM.

@nchammas
Copy link
Contributor

nchammas commented Feb 6, 2015

Can someone trigger tests for this, by the way?

@nchammas
Copy link
Contributor

nchammas commented Feb 6, 2015

Thank you @GenTang. This LGTM.

We could probably merge this right away, but the correct thing to do is wait for tests to run. @srowen can you trigger them please?

@shivaram
Copy link
Contributor

shivaram commented Feb 6, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26927 has started for PR 3986 at commit 13e257d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26927 has finished for PR 3986 at commit 13e257d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26927/
Test PASSed.

@nchammas
Copy link
Contributor

nchammas commented Feb 6, 2015

This PR is ready to merge. So much discussion and effort ended up getting boiled down to 2 lines of code. :)

Thanks @GenTang for sticking through it. heh

@GenTang
Copy link
Contributor Author

GenTang commented Feb 6, 2015

@nchammas You are welcome.
I am more than happy that this PR could finally be done. ^^

@JoshRosen
Copy link
Contributor

Thanks for the fix. LGTM, so I'm going to merge this to master (1.4.0), branch-1.3 (1.3.0), and branch-1.2 (1.2.2).

asfgit pushed a commit that referenced this pull request Feb 6, 2015
The boto API doesn't support tag EC2 instances in the same call that launches them.
We add a five-second wait so EC2 has enough time to propagate the information so that
the tagging can succeed.

Author: GenTang <gen.tang86@gmail.com>
Author: Gen TANG <gen.tang86@gmail.com>

Closes #3986 from GenTang/spark-4983 and squashes the following commits:

13e257d [Gen TANG] modification of comments
47f0675 [GenTang] print the information
ab7a931 [GenTang] solve the issus spark-4983 by inserting waiting time
3179737 [GenTang] Revert "handling exceptions about adding tags to ec2"
6a8b53b [GenTang] Revert "the improvement of exception handling"
13e97a6 [GenTang] Revert "typo"
63fd360 [GenTang] typo
692fc2b [GenTang] the improvement of exception handling
6adcf6d [GenTang] handling exceptions about adding tags to ec2

(cherry picked from commit 0f3a360)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
asfgit pushed a commit that referenced this pull request Feb 6, 2015
The boto API doesn't support tag EC2 instances in the same call that launches them.
We add a five-second wait so EC2 has enough time to propagate the information so that
the tagging can succeed.

Author: GenTang <gen.tang86@gmail.com>
Author: Gen TANG <gen.tang86@gmail.com>

Closes #3986 from GenTang/spark-4983 and squashes the following commits:

13e257d [Gen TANG] modification of comments
47f0675 [GenTang] print the information
ab7a931 [GenTang] solve the issus spark-4983 by inserting waiting time
3179737 [GenTang] Revert "handling exceptions about adding tags to ec2"
6a8b53b [GenTang] Revert "the improvement of exception handling"
13e97a6 [GenTang] Revert "typo"
63fd360 [GenTang] typo
692fc2b [GenTang] the improvement of exception handling
6adcf6d [GenTang] handling exceptions about adding tags to ec2

(cherry picked from commit 0f3a360)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in 0f3a360 Feb 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants