New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update check_registered_slaves_aws.py ignore recently created asgs and sfrs #1930
Update check_registered_slaves_aws.py ignore recently created asgs and sfrs #1930
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we shouldn't be looking at each box individually, instead of the ASG/SFR? This check would still fire if we drastically scaled up an existing ASG/SFR, or if the SFR took a long time to fulfill the capacity request (i.e. if we're near our AWS instance limits)
@@ -96,6 +97,9 @@ | |||
|
|||
AWS_SPOT_MODIFY_TIMEOUT = 30 | |||
MISSING_SLAVE_PANIC_THRESHOLD = .3 | |||
# Age threshold in seconds that should be met before an asg or sfr should | |||
# exceed before being checked for slave registration. | |||
CHECK_REGISTERED_SLAVE_THRESHOLD = 3600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be defined as a getter on SystemPaastaConfig instead of a constant, so that when we want to tweak this we don't need to push a new version of Paasta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I thought about this problem but didn't know where the best place to make it configurable was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My newest commit uses a new key in /etc/paasta/monitoring.json so we can deploy a tweak via puppet. See https://reviewboard.yelpcorp.com/r/329219 and my latest commit.
Yea good point. Let me rethink this. |
Hmm, we're actually moving away from those S3 config files as we move to clusterman. Now that AWS supports tagging of SFRs (or more correctly passing a tag specification to the instances). So clusterman is just using the AWS API to fetch all the SFRs and then filtering them based on tags. |
@mattmb so with the move to clusterman this will be deprecated? Also not sure if either of you say this CR but can you look at https://reviewboard.yelpcorp.com/r/327638/ as well? |
Well I think we'll have to refactor it to work the same way without the S3 files. But for now, this will work so I say ship. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@EvanKrall my PR has been updated to use a value from /etc/paasta/monitoring.json. As for
I think that's outside the scope of my refactor for the ticket I'm working on. Sound fair? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Don't block on that other suggestion - if that hypothetical becomes a problem we can look into it then.
@ddelnano do you need any more help on this? Would you like me to press the green button? |
Sorry I was just tied up with other work, merging. |
See OPS-13784 for details.
Todo
aws autoscaling describe-auto-scaling-groups --group-name
gives me output unit tests expect https://fluffy.yelpcorp.com/i/Xb526BdnKq03q5TRGlfv40Sd5bCVT6Cm.htmlaws ec2 describe-spot-fleet-requests
gives me the output my unit tests expect https://fluffy.yelpcorp.com/i/GHnrsHK3n3DjxcNFcPJ4zvKB27Tp4QLn.html.I noticed that check_registered_slaves_aws.py looks at /etc/paasta/cluster_autoscaling/{s3_bucket}/{cluster}.json files. However, I was unable to find any spot fleets in any of those files. Is that intended?
Unfortunately I can't prove this fixes OPS-13784 unless it stays on stage for a while. My original plan for testing was going to be build a deb manually, install it on the hosts I care, and disable puppet. However, cep1070 is going to explicitly ignore hosts with puppet disabled. I could put my manually built package on the apt repo so that I can install it through puppet but would rather have a jenkins job do that for me rather than me copying the deb.