2 workers picking up the same job #658

Open
carstengehling opened this Issue May 12, 2014 · 14 comments

Comments

Projects
None yet
9 participants

Updated: tested with delayed_job 4.0.1 - problem still persists

With the gems:

rails 3.2.13
delayed_job 4.0.0
delayed_job_active_record 4.0.0
pg 0.17.1

and PostgreSQL v. 9.1.11

Running 2 workers, they pick up and process the same job:

2014-05-12T15:04:00+0200: [Worker(delayed_job.1 host:waldseemuller pid:8470)] Job Class#perform (id=56439) RUNNING
2014-05-12T15:04:00+0200: [Worker(delayed_job.0 host:waldseemuller pid:8464)] Job Class#perform (id=56439) RUNNING
2014-05-12T15:04:00+0200: [Worker(delayed_job.0 host:waldseemuller pid:8464)] Job Class#perform (id=56439) COMPLETED after 0.1915
2014-05-12T15:04:00+0200: [Worker(delayed_job.1 host:waldseemuller pid:8470)] Job Class#perform (id=56439) COMPLETED after 0.2749

Is there a way around this?

Owner

albus522 commented May 12, 2014

First try upgrading. If the problem persists, reopen

@albus522 albus522 closed this May 12, 2014

Sorry - I've tried updating to 4.0.1 and the problem still persist.

2014-05-12T16:57:24+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:24814)] Job Class#perform (id=19361) RUNNING
2014-05-12T16:57:24+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:24807)] Job Class#perform (id=19361) RUNNING
2014-05-12T16:57:27+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:24807)] Job Class#perform (id=19361) COMPLETED after 3.6808
2014-05-12T16:57:28+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:24814)] Job Class#perform (id=19361) COMPLETED after 3.7396

How do I reopen?

Owner

albus522 commented May 12, 2014

What DJ configuration options do you have set?
What is the exact command you are using to run DJ?

@albus522 albus522 reopened this May 12, 2014

From config/initializers/delayed_job.rb:

Delayed::Worker.max_run_time =  24.hours
Delayed::Worker.max_attempts = 1
Delayed::Worker.destroy_failed_jobs = false

The workers are started with:

script/delayed_job start -n 2

I just tried with the latest commits on master branch - same result:

2014-05-12T18:44:24+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Starting job worker
2014-05-12T18:44:24+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Starting job worker
2014-05-12T18:44:26+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19788) RUNNING
2014-05-12T18:44:26+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19788) RUNNING
2014-05-12T18:44:28+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19788) COMPLETED after 1.9184
2014-05-12T18:44:28+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19788) COMPLETED after 1.9570

2014-05-12T18:44:30+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19795) RUNNING
2014-05-12T18:44:30+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19795) RUNNING
2014-05-12T18:44:31+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19795) COMPLETED after 0.6636
2014-05-12T18:44:31+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19795) COMPLETED after 0.7026

The job is defined like this:

class AnalysisJob

  def self.perform(analysis_id)
    analysis = Analysis.find(analysis_id)
    if analysis.transport_form == 'private'
      PrivateTransportAnalysisProcess.new(analysis).run
    else
      PublicTransportAnalysisProcess.new(analysis).run
    end
  end

end

The jobs are created like this:

AnalysisJob.delay.perform(id)
Collaborator

betamatt commented Jun 12, 2014

@carstengehling can you create a small sample app that reproduces the issue?

kubbing commented Nov 16, 2014

similar thing happening to me :/

jnimety commented Mar 27, 2015

I have a lead...

Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 12345

Job runs and fails during row delete, another worker eventually picks it up and runs the job again. Noticed this about 10 times while our queue was processing approximately 4000 jobs. We have 6 workers.

jnimety commented Mar 27, 2015

delayed_job (4.0.6) btw

I've added the following index to match the WHERE clause that DJ is using for the row locking and unlocking. This should help with the speed of locking and unlocking rows.

add_index :delayed_jobs, [:run_at, :locked_at, :locked_by, :failed_at], :name => 'index_delayed_jobs_for_better_row_locking'

defeated commented May 2, 2015

@jnimety

Running multiple workers, we're running into this Deadlock error frequently as well when there are a ton of jobs in the queue.

mysql2
delayed_job 4.0.6
daemons 1.2.2

Worker(host:app01 pid:4191)] Job Searchable::SearchDestroyJob (id=15992398) FAILED (0 prior attempts) with ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 15992398
rake aborted!

ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `delayed_jobs` SET `last_error` = 'Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 15992398\n

which is causing the workers to crash and require a restart. (we're using the --monitor option so this happens automatically after some amount of time)

Looks related to collectiveidea/delayed_job_active_record#63

@dsander dsander referenced this issue in huginn/huginn Aug 21, 2015

Closed

Help with running multi workers #977

I observed the same symptoms.

Take a look at:
https://github.com/collectiveidea/delayed_job_active_record/blob/master/lib/delayed/backend/active_record.rb#L90

I am using the activerecord-postgis-adapter.
So for me:
ActiveRecord::Base.connection.adapter_name
=> "PostGIS"

I monkey patched this and have not seen any 2 workers picking up the same job since then.

hth

@sventantau We just ran into this same thing. Thank you for posting this, I would not have found this so soon. I will put a pull request for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment