Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

features to pause batch execution #1323

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

pgvsalamander
Copy link
Contributor

@pgvsalamander pgvsalamander commented Apr 12, 2024

Implements #1319

Pausing jobs works as far as enqueueing/runners, unpausing does work but is done by hand. Opening this now so I have the diff to look at and to make it easy to discuss implementation.

job = GoodJob::Job.find( OtherJob.set(good_job_paused: true).perform_later(1234).job_id )
job.serialized_params.merge!('good_job_paused' => false)
job.update(scheduled_at: job.serialized_params['scheduled_at'] || Time.current)
GoodJob::Notifier.notify({ queue_name: job.queue_name, count: 1 })

@bensheldon
Copy link
Owner

bensheldon commented Apr 12, 2024

I think this looks good so far. I think some Readme-driven-development might be in order here.

What do you think about an interface like this:

my_job = MyJob.set(good_job_paused: true).perform_later(1234)
my_job.good_job_unpause

# or
job_id = my_job.job_id
MyJob.good_job_unpause(job_id) # or an array of IDs, or an array of ActiveJob::Base instances

A GoodJob::Batch is the most "public" internal API that GoodJob offers, so that might be a good target to implement #pause, #unpause and #paused? methods. I guess the semantics of a batch that is paused is if any jobs in that batch are paused.

Another thing we'll want to think about is under what job-states a job could be paused (after job creation). Those get a little messy because we'll probably want to take an advisory lock (to prevent pausing a job that is actively executing). But we can defer that too if we just want to focus on creating jobs at create too (we don't have to land everything all at once)

@pgvsalamander
Copy link
Contributor Author

pgvsalamander commented Apr 13, 2024

The snippet in my original post was just a proof of concept, I started with pausing/unpausing single jobs just to get a hand on the mechanics. I'm of the same mind about using the batch for the pause/unpause methods.

I'm keeping the 'pausing an arbitrary batch/jobs' feature in the back of my mind while I'm working, but for the first pass I'm just targeting pausing a batch at creation and adding jobs to it:

I've only skimmed the batch code, my understanding is that the options are automatically distributed to all jobs. If they aren't, then I'd just use paused: for the option as the prefix is a bit redundant in this context

batch = GoodJob::Batch.new(paused: true)
batch.save
batch.add do
  10.times { MyJob.perform_later(1234) } # All jobs added to the batch at this point are paused
end
batch.enqueue # can be called before or after #unpause
batch.unpause # Jobs will now start running

@bensheldon
Copy link
Owner

bensheldon commented Apr 13, 2024

These don't get distributed as options to a job, they are "properties" of the batch. So this won't work:

batch = GoodJob::Batch.new(good_job_paused: true)

@pgvsalamander
Copy link
Contributor Author

OK! I still need to document everything and add/update tests (I think the failures are caused by seed data containing NULL values), but I'd appreciate a once-over before that in case there's anything you'd like changed.

def paused?
# TODO: consider querying to see if any jobs within the batch are paused, and if/how that should be represented if that result does not match properties[:paused]
# I think there are probably going to need to be separate methods for "is the batch set to pause all jobs within it" and "does the batch contain any paused jobs" as those cases aren't always lined up
properties[:paused] || false
Copy link
Owner

@bensheldon bensheldon Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really want to support (at least right now, initially) having a special property of a batch ("paused") have a side effect or meaning to the library; I consider them arbitrary application data.

I realize it makes it a little less optimal to say: "To create a paused batch, enqueue any job that is paused" but I think that's a fine workaround for now. I sort of see the order here being:

  1. Design and implement the functionality to allow jobs to be paused/unpaused
    • I think you have all the changes for allowing scheduled_at to be nil
    • I think we need to handle the unpausing interface, especially in the dashboard
  2. Handle the situation in which a batch has paused jobs in it

And honestly, other than that, this looks great. I can help polish stuff up if whenever you're ok with where things are at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be open to storing a batch's paused state in a separate column/hash/etc to avoid the interference, or would you prefer to remove that entirely?

re: unpausing interface

# public API
j = OtherJob.set(good_job_paused: true).perform_later
j.good_job_unpause # calls unpause(self) on the current adapter

# public API (maybe?) I put this on the adapter by default, not sure where else to put it
GoodJob::Adapter#unpause(jobs_or_ids) # The 'real' method for unpausing arbitrary jobs, loads GoodJob::Job records, calls unpause on individual records

# internal API, basically just what Batch#unpause is doing
r = GoodJob::Job.find(my_job_id)
r.paused?
r.unpause # probably need to provide some option to disable the notifier within the method so the caller can batch notifications

What functionality did you have in mind for the dashboard? I could see a use for unpausing jobs by hand if we also had the capability to pause jobs by hand, but if they're only paused programmatically I don't see as much benefit.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be open to storing a batch's paused state in a separate column/hash/etc to avoid the interference, or would you prefer to remove that entirely?

I'd like to defer that. So not one way or another in perpetuity, but for right now it looks like we have a path to allow for a pausing feature without changing any existing interfaces or database columns. This is mostly about my personal ability to handle complexity, not the technical stuff.

What functionality did you have in mind for the dashboard? I could see a use for unpausing jobs by hand if we also had the capability to pause jobs by hand, but if they're only paused programmatically I don't see as much benefit.

We'd need to show that the jobs are in a paused state on the dashboard, to answer the question of "why isn't this job running?" And it would be nice to also add options (simple ones) to pause and unpause through the UI.

On the programmitic side, I think the pause/unpause methods should be class methods on the job. e.g. OtherJob.unpause(jobs_or_ids)

@pgvsalamander
Copy link
Contributor Author

I noticed there are some situations where jobs will be enqueued with scheduled_at: nil, particularly using GoodJob::Execution.enqueue() and GoodJob::Adapter#enqueue_at(). I recall you mentioning that GoodJob has been setting scheduled_at: Time.current by default since a very old version. Is the intent that setting those defaults occurs at higher levels and the enqueue methods should accept whatever is handed to them, or have I bumped into an edge case that hasn't mattered previously?

@bensheldon
Copy link
Owner

Is the intent that setting those defaults occurs at higher levels and the enqueue methods should accept whatever is handed to them, or have I bumped into an edge case that hasn't mattered previously?

I don't consider those public methods and I would expect that job enqueues would go through the Adapter. Though you bring up a good point that I should doublecheck that scheduled_at does get set on everything (especially bulk enqueue).

Though reading the code now, maybe I'm wrong and it's changed. Though I do think it's intended given breadcrumbs like this:

def make_discrete
self.is_discrete = true
self.id = active_job_id
self.job_class = serialized_params['job_class']
self.executions_count ||= 0
current_time = Time.current
self.created_at ||= current_time
self.scheduled_at ||= current_time
end

We should do a separate PR if there are places that we discover are not reliably setting scheduled_at

@pgvsalamander
Copy link
Contributor Author

It looks like GoodJob::Adapter#enqueue() sets scheduled_at: nil by default and enqueue_at() tolerates nil values as well, I'll extract the updates I made here and submit them separately.

@pgvsalamander pgvsalamander marked this pull request as draft April 22, 2024 13:11
@pgvsalamander pgvsalamander changed the title WIP: features to pause batch execution features to pause batch execution Apr 22, 2024
@pgvsalamander
Copy link
Contributor Author

There are a handful of tests that assert that scheduled_at and created_at have the exact same or nearly the same value. Do you recall if those tests needed to be that strict, or if it was just a way of ensuring that scheduled_at was set to a sane value? Some of the updates to always set scheduled_at are setting the value sooner than before, leading to test failures because the values are a few milliseconds off where they were before.

As far as my understanding goes I don't think this should cause any issues, but a spot-check to make sure my understanding is right would be appreciated.

@bensheldon
Copy link
Owner

fyi, I did some exploratory testing here: #1332

The takeaway is that this feature might have to hold until I get a major release for GoodJob done, because if someone hasn't migrated to the new form of jobs (there is a temporary column called "is_discrete" to denote that migration), the scheduled_at value will be blank. It's not my favorite to defer for a major release, but it seems like time to do a major release anyway (it'll take me ~2 weeks from when I start) #764

There are a handful of tests that assert that scheduled_at and created_at have the exact same or nearly the same value.

Yes, if the job has been enqueued without a scheduled time, the scheduled_at and created_at should have the exact same value. From my exploratory testing, it seems like that maybe only happens exactly the same when the jobs have been migrated to Discrete, so other tests may have assertions from older behavior. I would expect a test I wrote today to either be like within(1.second).of(Time.current) for immediate jobs, and something like within(1.second).of(10.minutes.from_now) for something that's testing scheduled jobs

@pgvsalamander
Copy link
Contributor Author

OK, should I skip the "always set scheduled_at" PR then? Looking over things again it looks like most if not all of the places I set scheduled_at coincide with is_discrete logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

Successfully merging this pull request may close these issues.

2 participants