New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Heartbeat check per sidekiq process #7873
FIX: Heartbeat check per sidekiq process #7873
Conversation
You've signed the CLA, OsamaSayegh. Thank you! This pull request is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this method seems sound. I was a bit worried about having so many queues but there is only one worker per custom queue so it should not be very inefficient.
I only have one minor quibble about the naming of a method.
The one tricky thing is testing. There are a couple of unit tests which is good, but before we merge, how comprehensively have you tested the queues/workers running locally? There is a fair amount that is not tested automatically here.
@eviltrout the way I tested this change is I made sidekiq boot up without any queues locally by passing an empty array to discourse/lib/demon/sidekiq.rb Line 52 in fb2df0b
that way sidekiq couldn't process the jobs we enqueue here: discourse/app/jobs/scheduled/heartbeat.rb Lines 6 to 12 in fb2df0b
Then I lowered the heartbeat check interval to 5 seconds (so that I don't have to wait 30 minutes) and made the There is still one problem that I'd like address before merging this: the problem is that every time the master unicorn boots up, 2 unique keys are created in redis ( One way to do that would to prefix the 2 keys with the server |
Keeping track of what keys you've set in the past can be messy, and it wouldn't account for the situation where we retire a host and its previous keys would exist forever. Instead I recommend giving the keys an EXPIRY, which you can bump whenever the heartbeat runs. If the heartbeat runs every 3 minutes for example, you could set it as expiring in 60 minutes or something just to be very careful. After one hour of not being used, the keys will vanish. |
config/unicorn.conf.rb
Outdated
|
||
restart = true | ||
if !restarted | ||
Demon::Sidekiq.heartbeat_queues.each do |queue| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm is this checking all heartbeat queues? Should it not just check heartbeats for child sidekiq processes it has?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Demon::Sidekiq.heartbeat_queues
returns all sidekiq processes that are spawned by unicorn master and we check their heartbeat one by one here. Is this not what we are supposed to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see... so each process will give a different list here ...
A confusing thing here is that if you call Demon::Sidekiq.heartbeat_queues
you will get one list .. and if you call this on a forked child process you will get an empty list. Maybe we call this Demon::Sidekiq.child_process_heartbeat_queues
?
Did you test with unicorn_sidekiqs set to 2 to see how it properly spawns and monitors? What are the blockers for a merge here, overall this is a fantastic change! |
Thanks! Yes I did test with Edit: I'm not aware of any blockers. |
I am really liking this change ... all I seem to have here is superficial comments, I would like you to merge in the beginning of your day, then watch on meta... then deploy another bigger cluster and watch on it as well. Critical that this gets merged when you have a few hours to double check we are not getting a swamp of "restarting child process cause it died" |
@OsamaSayegh can you have a look to see if you can merge it this week? |
Co-Authored-By: Régis Hanol <regis@hanol.fr>
Co-Authored-By: Régis Hanol <regis@hanol.fr>
@SamSaffron I've amended this PR so now it keeps track of the special queues using a constant |
I think this looks good, can you merge it monday? |
Sure, will do |
This reverts commit 340855d.
This reverts commit e805d44. We now have mechanisms in place to ensure heartbeat will always be scheduled even if the scheduler is overloaded per: 098f938b
…)""" This reverts commit c349755.
No description provided.