Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

2 workers picking up the same job #658

Open
carstengehling opened this Issue · 12 comments

7 participants

@carstengehling

Updated: tested with delayed_job 4.0.1 - problem still persists

With the gems:

rails 3.2.13
delayed_job 4.0.0
delayed_job_active_record 4.0.0
pg 0.17.1

and PostgreSQL v. 9.1.11

Running 2 workers, they pick up and process the same job:

2014-05-12T15:04:00+0200: [Worker(delayed_job.1 host:waldseemuller pid:8470)] Job Class#perform (id=56439) RUNNING
2014-05-12T15:04:00+0200: [Worker(delayed_job.0 host:waldseemuller pid:8464)] Job Class#perform (id=56439) RUNNING
2014-05-12T15:04:00+0200: [Worker(delayed_job.0 host:waldseemuller pid:8464)] Job Class#perform (id=56439) COMPLETED after 0.1915
2014-05-12T15:04:00+0200: [Worker(delayed_job.1 host:waldseemuller pid:8470)] Job Class#perform (id=56439) COMPLETED after 0.2749

Is there a way around this?

@albus522
Owner

First try upgrading. If the problem persists, reopen

@albus522 albus522 closed this
@carstengehling

Sorry - I've tried updating to 4.0.1 and the problem still persist.

2014-05-12T16:57:24+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:24814)] Job Class#perform (id=19361) RUNNING
2014-05-12T16:57:24+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:24807)] Job Class#perform (id=19361) RUNNING
2014-05-12T16:57:27+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:24807)] Job Class#perform (id=19361) COMPLETED after 3.6808
2014-05-12T16:57:28+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:24814)] Job Class#perform (id=19361) COMPLETED after 3.7396
@carstengehling

How do I reopen?

@albus522
Owner

What DJ configuration options do you have set?
What is the exact command you are using to run DJ?

@albus522 albus522 reopened this
@carstengehling

From config/initializers/delayed_job.rb:

Delayed::Worker.max_run_time =  24.hours
Delayed::Worker.max_attempts = 1
Delayed::Worker.destroy_failed_jobs = false

The workers are started with:

script/delayed_job start -n 2
@carstengehling

I just tried with the latest commits on master branch - same result:

2014-05-12T18:44:24+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Starting job worker
2014-05-12T18:44:24+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Starting job worker
2014-05-12T18:44:26+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19788) RUNNING
2014-05-12T18:44:26+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19788) RUNNING
2014-05-12T18:44:28+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19788) COMPLETED after 1.9184
2014-05-12T18:44:28+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19788) COMPLETED after 1.9570

2014-05-12T18:44:30+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19795) RUNNING
2014-05-12T18:44:30+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19795) RUNNING
2014-05-12T18:44:31+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19795) COMPLETED after 0.6636
2014-05-12T18:44:31+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19795) COMPLETED after 0.7026

The job is defined like this:

class AnalysisJob

  def self.perform(analysis_id)
    analysis = Analysis.find(analysis_id)
    if analysis.transport_form == 'private'
      PrivateTransportAnalysisProcess.new(analysis).run
    else
      PublicTransportAnalysisProcess.new(analysis).run
    end
  end

end

The jobs are created like this:

AnalysisJob.delay.perform(id)
@betamatt
Owner

@carstengehling can you create a small sample app that reproduces the issue?

@kubbing

similar thing happening to me :/

@jnimety

I have a lead...

Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 12345

Job runs and fails during row delete, another worker eventually picks it up and runs the job again. Noticed this about 10 times while our queue was processing approximately 4000 jobs. We have 6 workers.

@jnimety

delayed_job (4.0.6) btw

@jhirbour

I've added the following index to match the WHERE clause that DJ is using for the row locking and unlocking. This should help with the speed of locking and unlocking rows.

add_index :delayed_jobs, [:run_at, :locked_at, :locked_by, :failed_at], :name => 'index_delayed_jobs_for_better_row_locking'
@defeated

@jnimety

Running multiple workers, we're running into this Deadlock error frequently as well when there are a ton of jobs in the queue.

mysql2
delayed_job 4.0.6
daemons 1.2.2

Worker(host:app01 pid:4191)] Job Searchable::SearchDestroyJob (id=15992398) FAILED (0 prior attempts) with ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 15992398
rake aborted!

ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `delayed_jobs` SET `last_error` = 'Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 15992398\n

which is causing the workers to crash and require a restart. (we're using the --monitor option so this happens automatically after some amount of time)

Looks related to collectiveidea/delayed_job_active_record#63

@dsander dsander referenced this issue in cantino/huginn
Open

Help with running multi workers #977

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.