-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4983]insert waiting time before tagging ec2 instances #3986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
ec2/spark_ec2.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to insert a small wait here? Say, 0.1 seconds. Otherwise, if it takes even a second for the information about the new instance to propagate, we might spam EC2 with requests very quickly and get rate limited.
Also, can we catch the exception specific to an instance not existing? We don't want to catch other exceptions here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it takes some time for EC2 to return an instance not existing exception . That's why I leave pass in the exception. However, Maybe we should add a small wait time to ensure that we don't submit too much requests to ec2
Yes, Here we just want to catch the exception of instance not existing. You are right, it is better to use specific exception. I will work on this.
|
Thanks for working on this. I left a couple of comments inline. cc @JoshRosen |
|
By the way, please also update the title of this PR to match the approach you are taking, since as you noted we can't actually use the same call to launch and tag instances. You can leave the JIRA tag at the beginning as-is. |
|
As all the exception about ec2 will throw out an EC2ResponseError exception, we use error_code to identify the specific error of instance not existing. |
ec2/spark_ec2.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lower-case sleep() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the typo.
|
@GenTang Did you test this on a few cluster launches to make sure it works? |
|
Yes, I reproduced the InvalidInstanceID.NotFound by change the instance ID before add_tag action and then re-change it to the correct id. However, it will print the error information as follow in the screen: |
|
Are you saying that boto will print an error to screen even if we catch the exception? |
|
However, I met a really strange error a moment ago. in wait_for_cluster_state function. It seems that the information of instance has not been propagated for the update action. Meantime, information of instance has reached to certain point that add_tag action can be succussed. at line 724 |
|
Yes, boto will print the error even we catch the exception but the script will continue and the cluster will be successfully launched. |
|
Hmm, sucks that boto still prints that error. Getting errors on It would be better if we could avoid adding arbitrary waits everywhere just to work around this problem with AWS. Is there something more direct we can do? Though honestly, maybe just adding a second inside |
|
I understand why it still print exception error information even we catch the exception. It is because that we use in the script. So the exception information will be added in the log even being catched. |
|
Hmm, I'm not sure. I wondered about that myself. Maybe this solution will work for us without having to turn off logging completely, though if one of the maintainers chimes in maybe we can remove that |
|
Yeah, Thanks for your solution. It works. |
|
Agreed. @shivaram / @JoshRosen? |
|
@nchammas @GenTang - The Other than that this solution looks fine to me. It is unfortunate that we have so many custom sleep calls across the file, but I don't think there is much else we can do given the EC2 API we have right now. [1] https://github.com/mesos/spark/blob/08c50ad1fcf323f62c80dfeb8f1caaf164211e0b/ec2/spark_ec2.py#L538 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting this now, I wonder if it would just be simpler all around to insert a 5-second sleep right here and call it day. That should give AWS enough time to get its act together.
Having the wait/retry logic all over the place seems more a burden than a benefit, honestly. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. Since we can meet the same problem about propagating information of AWS in the wait_for_cluster_state function, I think that adding waiting time is much better workaround.
The only disadvantage is that we need 5 more seconds to launch a cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, in this case I think the 5 seconds will just come out of the time we'll spend waiting anyway for SSH to become available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, It is true. I will make a new commit as soon as possible.
I think I will add 5 seconds sleep time before tagging master and make some modification in the comment.
ec2/spark_ec2.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nchammas
Maybe we should insert a print here to tell the client that we are waiting for the information to be propagated, since there will be 5 seconds idle between launch master in and wait for the cluster to enter.
How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Also, in the comments, reference SPARK-4983.
|
This seems fairly harmless and fixes a problem. Is it ready to merge as far as you guys are concerned? |
|
The latest update is simple workaround and looks good to me |
ec2/spark_ec2.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please rephrase this to "Waiting for AWS to propagate instance metadata..."
|
Left 2 minor comments about phrasing, but otherwise this LGTM. |
|
Can someone trigger tests for this, by the way? |
|
Jenkins, test this please |
|
Test build #26927 has started for PR 3986 at commit
|
|
Test build #26927 has finished for PR 3986 at commit
|
|
Test PASSed. |
|
This PR is ready to merge. So much discussion and effort ended up getting boiled down to 2 lines of code. :) Thanks @GenTang for sticking through it. heh |
|
@nchammas You are welcome. |
|
Thanks for the fix. LGTM, so I'm going to merge this to |
The boto API doesn't support tag EC2 instances in the same call that launches them. We add a five-second wait so EC2 has enough time to propagate the information so that the tagging can succeed. Author: GenTang <gen.tang86@gmail.com> Author: Gen TANG <gen.tang86@gmail.com> Closes #3986 from GenTang/spark-4983 and squashes the following commits: 13e257d [Gen TANG] modification of comments 47f0675 [GenTang] print the information ab7a931 [GenTang] solve the issus spark-4983 by inserting waiting time 3179737 [GenTang] Revert "handling exceptions about adding tags to ec2" 6a8b53b [GenTang] Revert "the improvement of exception handling" 13e97a6 [GenTang] Revert "typo" 63fd360 [GenTang] typo 692fc2b [GenTang] the improvement of exception handling 6adcf6d [GenTang] handling exceptions about adding tags to ec2 (cherry picked from commit 0f3a360) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
The boto API doesn't support tag EC2 instances in the same call that launches them. We add a five-second wait so EC2 has enough time to propagate the information so that the tagging can succeed. Author: GenTang <gen.tang86@gmail.com> Author: Gen TANG <gen.tang86@gmail.com> Closes #3986 from GenTang/spark-4983 and squashes the following commits: 13e257d [Gen TANG] modification of comments 47f0675 [GenTang] print the information ab7a931 [GenTang] solve the issus spark-4983 by inserting waiting time 3179737 [GenTang] Revert "handling exceptions about adding tags to ec2" 6a8b53b [GenTang] Revert "the improvement of exception handling" 13e97a6 [GenTang] Revert "typo" 63fd360 [GenTang] typo 692fc2b [GenTang] the improvement of exception handling 6adcf6d [GenTang] handling exceptions about adding tags to ec2 (cherry picked from commit 0f3a360) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
As the boto API doesn't support tag ec2 instances in the same call that launches them. We use exception handling code to wait that ec2 has enough time to propagate the information.