Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Heartbeat check per sidekiq process #7873

Merged
merged 9 commits into from Aug 26, 2019
Merged

FIX: Heartbeat check per sidekiq process #7873

merged 9 commits into from Aug 26, 2019

Conversation

OsamaSayegh
Copy link
Member

No description provided.

@discoursebot
Copy link

You've signed the CLA, OsamaSayegh. Thank you! This pull request is ready for review.

Copy link
Contributor

@eviltrout eviltrout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this method seems sound. I was a bit worried about having so many queues but there is only one worker per custom queue so it should not be very inefficient.

I only have one minor quibble about the naming of a method.

The one tricky thing is testing. There are a couple of unit tests which is good, but before we merge, how comprehensively have you tested the queues/workers running locally? There is a fair amount that is not tested automatically here.

lib/demon/sidekiq.rb Outdated Show resolved Hide resolved
config/unicorn.conf.rb Outdated Show resolved Hide resolved
lib/demon/sidekiq.rb Outdated Show resolved Hide resolved
lib/demon/sidekiq.rb Outdated Show resolved Hide resolved
lib/demon/sidekiq.rb Outdated Show resolved Hide resolved
lib/demon/sidekiq.rb Outdated Show resolved Hide resolved
@OsamaSayegh
Copy link
Member Author

@eviltrout the way I tested this change is I made sidekiq boot up without any queues locally by passing an empty array to cli.parse here:

cli.parse(options)

that way sidekiq couldn't process the jobs we enqueue here:

class Heartbeat < Jobs::Scheduled
every 3.minute
def execute(args)
Jobs.enqueue(:run_heartbeat, {})
end
end

Then I lowered the heartbeat check interval to 5 seconds (so that I don't have to wait 30 minutes) and made the Heartbeat job run every 2 seconds. After that I ran UNICORN_SIDEKIQS=2 ./bin/unicorn and shortly after that the check_sidekiq_heartbeat method ran and the 2 sidekiq processes that were spawned restarted. Does this sound like correct/enough testing?


There is still one problem that I'd like address before merging this: the problem is that every time the master unicorn boots up, 2 unique keys are created in redis (heartbeat_queues_list_key and queues_last_heartbeat_hash_key). This means that if there are 3 app servers and you hit deploy, 6 redis keys are created per deploy. Over time this would accumulate lots of redis keys that aren't used at all. I'd like to clean up keys as they become unneeded.

One way to do that would to prefix the 2 keys with the server hostname, and then in the before_start method (I just added it) we can iterate over all redis keys and delete the ones that aren't needed. The prefix is needed to ensure no server deletes other servers' keys. That way each time master unicorn boots up, it'd clean up the keys of the previous boot up. Does that sound good? Is there a better approach to this?

@eviltrout
Copy link
Contributor

Does that sound good? Is there a better approach to this?

Keeping track of what keys you've set in the past can be messy, and it wouldn't account for the situation where we retire a host and its previous keys would exist forever.

Instead I recommend giving the keys an EXPIRY, which you can bump whenever the heartbeat runs. If the heartbeat runs every 3 minutes for example, you could set it as expiring in 60 minutes or something just to be very careful.

After one hour of not being used, the keys will vanish.


restart = true
if !restarted
Demon::Sidekiq.heartbeat_queues.each do |queue|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm is this checking all heartbeat queues? Should it not just check heartbeats for child sidekiq processes it has?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demon::Sidekiq.heartbeat_queues returns all sidekiq processes that are spawned by unicorn master and we check their heartbeat one by one here. Is this not what we are supposed to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see... so each process will give a different list here ...

A confusing thing here is that if you call Demon::Sidekiq.heartbeat_queues you will get one list .. and if you call this on a forked child process you will get an empty list. Maybe we call this Demon::Sidekiq.child_process_heartbeat_queues ?

@SamSaffron
Copy link
Member

Did you test with unicorn_sidekiqs set to 2 to see how it properly spawns and monitors? What are the blockers for a merge here, overall this is a fantastic change!

@OsamaSayegh
Copy link
Member Author

OsamaSayegh commented Jul 31, 2019

Did you test with unicorn_sidekiqs set to 2 to see how it properly spawns and monitors? What are the blockers for a merge here, overall this is a fantastic change!

Thanks! Yes I did test with UNICORN_SIDEKIQS, I posted details of how I tested this change above #7873 (comment).

Edit: I'm not aware of any blockers.

@SamSaffron
Copy link
Member

I am really liking this change ... all I seem to have here is superficial comments, I would like you to merge in the beginning of your day, then watch on meta... then deploy another bigger cluster and watch on it as well. Critical that this gets merged when you have a few hours to double check we are not getting a swamp of "restarting child process cause it died"

@SamSaffron SamSaffron added 👍 OP to merge PR author can go ahead and merge! and removed ⚠ DO NOT MERGE labels Aug 1, 2019
@SamSaffron
Copy link
Member

@OsamaSayegh can you have a look to see if you can merge it this week?

@OsamaSayegh
Copy link
Member Author

@SamSaffron I've amended this PR so now it keeps track of the special queues using a constant QUEUE_IDS rather than a redis list. Let me know what you think.

@SamSaffron
Copy link
Member

I think this looks good, can you merge it monday?

@OsamaSayegh
Copy link
Member Author

Sure, will do

@OsamaSayegh OsamaSayegh merged commit 340855d into discourse:master Aug 26, 2019
@OsamaSayegh OsamaSayegh deleted the sidekiq-worker-heartbeat branch August 26, 2019 06:33
OsamaSayegh added a commit that referenced this pull request Aug 27, 2019
SamSaffron added a commit that referenced this pull request Aug 30, 2019
This reverts commit e805d44.
We now have mechanisms in place to ensure heartbeat will always
be scheduled even if the scheduler is overloaded per: 098f938b
davidtaylorhq added a commit that referenced this pull request Aug 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👍 OP to merge PR author can go ahead and merge!
5 participants