Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update check_registered_slaves_aws.py ignore recently created asgs and sfrs #1930

Conversation

ddelnano
Copy link
Contributor

@ddelnano ddelnano commented Aug 7, 2018

See OPS-13784 for details.

Todo

I noticed that check_registered_slaves_aws.py looks at /etc/paasta/cluster_autoscaling/{s3_bucket}/{cluster}.json files. However, I was unable to find any spot fleets in any of those files. Is that intended?

Unfortunately I can't prove this fixes OPS-13784 unless it stays on stage for a while. My original plan for testing was going to be build a deb manually, install it on the hosts I care, and disable puppet. However, cep1070 is going to explicitly ignore hosts with puppet disabled. I could put my manually built package on the apt repo so that I can install it through puppet but would rather have a jenkins job do that for me rather than me copying the deb.

@ghost ghost assigned ddelnano Aug 7, 2018
@ghost ghost added the in review label Aug 7, 2018
Copy link
Member

@EvanKrall EvanKrall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we shouldn't be looking at each box individually, instead of the ASG/SFR? This check would still fire if we drastically scaled up an existing ASG/SFR, or if the SFR took a long time to fulfill the capacity request (i.e. if we're near our AWS instance limits)

@@ -96,6 +97,9 @@

AWS_SPOT_MODIFY_TIMEOUT = 30
MISSING_SLAVE_PANIC_THRESHOLD = .3
# Age threshold in seconds that should be met before an asg or sfr should
# exceed before being checked for slave registration.
CHECK_REGISTERED_SLAVE_THRESHOLD = 3600
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be defined as a getter on SystemPaastaConfig instead of a constant, so that when we want to tweak this we don't need to push a new version of Paasta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I thought about this problem but didn't know where the best place to make it configurable was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My newest commit uses a new key in /etc/paasta/monitoring.json so we can deploy a tweak via puppet. See https://reviewboard.yelpcorp.com/r/329219 and my latest commit.

@ddelnano
Copy link
Contributor Author

ddelnano commented Aug 8, 2018

Yea good point. Let me rethink this.

@mattmb
Copy link
Member

mattmb commented Aug 9, 2018

Hmm, we're actually moving away from those S3 config files as we move to clusterman. Now that AWS supports tagging of SFRs (or more correctly passing a tag specification to the instances). So clusterman is just using the AWS API to fetch all the SFRs and then filtering them based on tags.

@ddelnano
Copy link
Contributor Author

@mattmb so with the move to clusterman this will be deprecated? Also not sure if either of you say this CR but can you look at https://reviewboard.yelpcorp.com/r/327638/ as well?

@mattmb
Copy link
Member

mattmb commented Aug 13, 2018

Well I think we'll have to refactor it to work the same way without the S3 files. But for now, this will work so I say ship.

Copy link
Member

@mattmb mattmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ddelnano
Copy link
Contributor Author

ddelnano commented Aug 14, 2018

@EvanKrall my PR has been updated to use a value from /etc/paasta/monitoring.json.

As for

I wonder if we shouldn't be looking at each box individually, instead of the ASG/SFR? This check would still fire if we drastically scaled up an existing ASG/SFR, or if the SFR took a long time to fulfill the capacity request (i.e. if we're near our AWS instance limits)

I think that's outside the scope of my refactor for the ticket I'm working on. Sound fair?

Copy link
Member

@EvanKrall EvanKrall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Don't block on that other suggestion - if that hypothetical becomes a problem we can look into it then.

@solarkennedy
Copy link
Contributor

@ddelnano do you need any more help on this? Would you like me to press the green button?

@ddelnano
Copy link
Contributor Author

Sorry I was just tied up with other work, merging.

@ddelnano ddelnano merged commit 0b42232 into master Aug 17, 2018
@ddelnano ddelnano deleted the u/ddelnano/have-check_registered_slaves_aws-ignore-recently-created-asgs-sfrs branch August 17, 2018 23:24
@ghost ghost removed the in review label Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants