Update check_registered_slaves_aws.py ignore recently created asgs and sfrs #1930

ddelnano · 2018-08-07T20:23:19Z

See OPS-13784 for details.

Todo

Build a deb and deploy to stagef/g and verified I did not introduce any regressions with the script https://fluffy.yelpcorp.com/i/QZDhq9XbSL4jZ11z724GhVbwNDC2Rv97.html. Works as long as I update /etc/paasta/monitoring.json with a check_registered_slave_threshold key.
Verified that aws autoscaling describe-auto-scaling-groups --group-name gives me output unit tests expect https://fluffy.yelpcorp.com/i/Xb526BdnKq03q5TRGlfv40Sd5bCVT6Cm.html
Verified that aws ec2 describe-spot-fleet-requests gives me the output my unit tests expect https://fluffy.yelpcorp.com/i/GHnrsHK3n3DjxcNFcPJ4zvKB27Tp4QLn.html.

I noticed that check_registered_slaves_aws.py looks at /etc/paasta/cluster_autoscaling/{s3_bucket}/{cluster}.json files. However, I was unable to find any spot fleets in any of those files. Is that intended?

Unfortunately I can't prove this fixes OPS-13784 unless it stays on stage for a while. My original plan for testing was going to be build a deb manually, install it on the hosts I care, and disable puppet. However, cep1070 is going to explicitly ignore hosts with puppet disabled. I could put my manually built package on the apt repo so that I can install it through puppet but would rather have a jenkins job do that for me rather than me copying the deb.

…d asgs

EvanKrall

I wonder if we shouldn't be looking at each box individually, instead of the ASG/SFR? This check would still fire if we drastically scaled up an existing ASG/SFR, or if the SFR took a long time to fulfill the capacity request (i.e. if we're near our AWS instance limits)

EvanKrall · 2018-08-08T20:12:19Z

paasta_tools/autoscaling/autoscaling_cluster_lib.py

@@ -96,6 +97,9 @@

 AWS_SPOT_MODIFY_TIMEOUT = 30
 MISSING_SLAVE_PANIC_THRESHOLD = .3
+# Age threshold in seconds that should be met before an asg or sfr should
+# exceed before being checked for slave registration.
+CHECK_REGISTERED_SLAVE_THRESHOLD = 3600


This should probably be defined as a getter on SystemPaastaConfig instead of a constant, so that when we want to tweak this we don't need to push a new version of Paasta.

Good call. I thought about this problem but didn't know where the best place to make it configurable was.

My newest commit uses a new key in /etc/paasta/monitoring.json so we can deploy a tweak via puppet. See https://reviewboard.yelpcorp.com/r/329219 and my latest commit.

ddelnano · 2018-08-08T23:27:45Z

Yea good point. Let me rethink this.

mattmb · 2018-08-09T11:12:25Z

Hmm, we're actually moving away from those S3 config files as we move to clusterman. Now that AWS supports tagging of SFRs (or more correctly passing a tag specification to the instances). So clusterman is just using the AWS API to fetch all the SFRs and then filtering them based on tags.

ddelnano · 2018-08-13T15:18:20Z

@mattmb so with the move to clusterman this will be deprecated? Also not sure if either of you say this CR but can you look at https://reviewboard.yelpcorp.com/r/327638/ as well?

mattmb · 2018-08-13T16:00:50Z

Well I think we'll have to refactor it to work the same way without the S3 files. But for now, this will work so I say ship.

mattmb

👍

ddelnano · 2018-08-14T15:50:13Z

@EvanKrall my PR has been updated to use a value from /etc/paasta/monitoring.json.

As for

I wonder if we shouldn't be looking at each box individually, instead of the ASG/SFR? This check would still fire if we drastically scaled up an existing ASG/SFR, or if the SFR took a long time to fulfill the capacity request (i.e. if we're near our AWS instance limits)

I think that's outside the scope of my refactor for the ticket I'm working on. Sound fair?

EvanKrall

Looks good to me! Don't block on that other suggestion - if that hypothetical becomes a problem we can look into it then.

solarkennedy · 2018-08-17T16:59:49Z

@ddelnano do you need any more help on this? Would you like me to press the green button?

ddelnano · 2018-08-17T23:24:47Z

Sorry I was just tied up with other work, merging.

Domenic Del Nano added 2 commits August 7, 2018 13:09

Update check_registered_slaves_aws to ignore recently created sfrs an…

39a1ce1

…d asgs

Use a better way to calculate timestamps

aa0ec01

ghost assigned ddelnano Aug 7, 2018

ghost added the in review label Aug 7, 2018

Domenic Del Nano added 2 commits August 7, 2018 14:43

Fix mypy issues

831d01c

Fix bug I found when testing in stage

28c95a6

EvanKrall reviewed Aug 8, 2018

View reviewed changes

mattmb approved these changes Aug 13, 2018

View reviewed changes

Update to use threshold from /etc/paasta/monitoring.json

a601120

EvanKrall approved these changes Aug 16, 2018

View reviewed changes

ddelnano merged commit 0b42232 into master Aug 17, 2018

ddelnano deleted the u/ddelnano/have-check_registered_slaves_aws-ignore-recently-created-asgs-sfrs branch August 17, 2018 23:24

ghost removed the in review label Aug 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update check_registered_slaves_aws.py ignore recently created asgs and sfrs #1930

Update check_registered_slaves_aws.py ignore recently created asgs and sfrs #1930

ddelnano commented Aug 7, 2018 •

edited

EvanKrall left a comment

EvanKrall Aug 8, 2018

ddelnano Aug 8, 2018

ddelnano Aug 14, 2018

ddelnano commented Aug 8, 2018

mattmb commented Aug 9, 2018

ddelnano commented Aug 13, 2018

mattmb commented Aug 13, 2018

mattmb left a comment

ddelnano commented Aug 14, 2018 •

edited

EvanKrall left a comment

solarkennedy commented Aug 17, 2018

ddelnano commented Aug 17, 2018

Update check_registered_slaves_aws.py ignore recently created asgs and sfrs #1930

Update check_registered_slaves_aws.py ignore recently created asgs and sfrs #1930

Conversation

ddelnano commented Aug 7, 2018 • edited

Todo

EvanKrall left a comment

Choose a reason for hiding this comment

EvanKrall Aug 8, 2018

Choose a reason for hiding this comment

ddelnano Aug 8, 2018

Choose a reason for hiding this comment

ddelnano Aug 14, 2018

Choose a reason for hiding this comment

ddelnano commented Aug 8, 2018

mattmb commented Aug 9, 2018

ddelnano commented Aug 13, 2018

mattmb commented Aug 13, 2018

mattmb left a comment

Choose a reason for hiding this comment

ddelnano commented Aug 14, 2018 • edited

EvanKrall left a comment

Choose a reason for hiding this comment

solarkennedy commented Aug 17, 2018

ddelnano commented Aug 17, 2018

ddelnano commented Aug 7, 2018 •

edited

ddelnano commented Aug 14, 2018 •

edited