Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Mysql2::Error: Deadlock when attempting to lock a job #63

Open
gorism opened this Issue · 106 comments
@gorism

We've just upgraded from 0.3.3 to 0.4.4, to resolve the race condition problem when running multiple workers. We're now seeing occasional MySQL deadlocks, which are unhandled exceptions and end up killing the worker.

For example:

svc1 Jun  5 12:35:17  Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `delayed_jobs` SET `locked_at` = '2013-06-05 12:35:16', `locked_by` = 'delayed_job.2 host:svc1 pid:20498' WHERE `delayed_jobs`.`queue` IN ('batch') AND ((run_at <= '2013-06-05 12:35:16' AND (locked_at IS NULL OR locked_at < '2013-06-05 08:35:16') OR locked_by = 'delayed_job.2 host:svc1 pid:20498') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1

It would seem that, at the very least, this should be handled.

In case it matters, we're using delayed_job 3.0.4.

@jamsi

I get this also with: 4.0.0.beta2

@piotrb

+1 ..

Seems to happen to us when creating a delayed within another job running, not all the time of course, just very intermittently.

@philister

+1
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)

@settinghead

Same here.
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)

@cheneveld

+1

delayed_job (3.0.5)
activesupport (~> 3.0)
delayed_job_active_record (0.4.4)
activerecord (>= 2.1.0, < 4)
delayed_job (~> 3.0)

mysql(5.5.27)

@philister

Can we help anything to solve this? Anyone a patch or hints? We are restarting the workers ever 10 minutes because of this.
Thanks and regards, Ph.

@gorism

@philister We ended up forking and changing the logic to use a row level lock. I suspect the original implementation went to lengths to avoid using a lock, but in the end I'll accept a performance penalty for more reliability. You can find our fork here: https://github.com/doxo/delayed_job_active_record

Note that we do still get occasional deadlocks which result in restarts, but rare in comparison to what we were getting before. It would be even better if the gem could detect the deadlock and retry, but getting the root cause out of the exception raised by the MySQL adapter is not straightforward, and possibly version dependent. So I opted to live with the occasional deadlocks for now.

@philister

@gorism Thanks a lot!

@philister philister referenced this issue in collectiveidea/delayed_job
Closed

DJ dies if an exception happens while reserving the job #536

@zxiest

@gorism great fix! Thanks! -- The original dj's workers were dying after deadlock. I'm happy to see my workers alive after some time.

@cainlevy

edit: see followup at #63 (comment). the new queries may not be a net gain, even without deadlocks.

We were bit by this as well during an upgrade.

I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds.

The output of show engine innodb status says the contention is between the UPDATE query now used to reserve a job, and the DELETE query used to clean up a finished job. Surprising! Apparently the DELETE query first acquires a lock on the primary index (id) and then on the secondary index (priority, run_at). But the UPDATE query is using the (priority, run_at) index to scan the table, and is trying to grab primary key locks as it goes. Eventually the UPDATE and DELETE queries each grab one of two locks for a given row, try to acquire the other, and :boom:. MySQL resolves by killing the UPDATE, which crashes the worker.

The fix I've worked out locally is to replace the index on (priority, run_at) with an index on (priority, run_at, locked_by). This completely stabilizes my concurrency test! My theory is that it allows the UPDATE query's scan to skip over rows held by workers, which takes it out of contention with the DELETE query.

Hope this helps.

:lock::lock:

@cheneveld

+20

Thanks a bunch cainlevy. This did the trick. Great work.

@philister

Finally this did it for us, too. Thanks @cainlevy (Other solutions didn' t work in our case)

@ngan

Upgraded from 0.3.3 to 0.4.4 and we're experiencing deadlock issues as well. We're running a total of 10 workers and usually they manage to keep the job queue down to 1000 or so. On mysql 5.1. @cainlevy's solution didn't work for us :-(.

------------------------
LATEST DETECTED DEADLOCK
------------------------
130811  3:19:18
*** (1) TRANSACTION:
TRANSACTION 2607F0BE7C, ACTIVE 0 sec, process no 7937, OS thread id 1364158784 fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1248, 21 row lock(s)
MySQL thread id 2166505, query id 64381984645 init
UPDATE `delayed_jobs` SET `locked_at` = '2013-08-11 03:19:18', `locked_by` = 'delayed_job host:app3 pid:26732' WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0BE7C lock_mode X locks rec but not gap waiting
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) TRANSACTION:
TRANSACTION 2607F0C031, ACTIVE 0 sec, process no 7937, OS thread id 1363892544 updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1
MySQL thread id 2166501, query id 64381985155 updating
DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 568169197
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 33 n bits 648 index `delayed_jobs_priority` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap waiting
Record lock, heap no 406 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 80000000; asc     ;;
 1: len 8; hex 8000124f11d7316a; asc    O  1j;;
 2: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 3: len 4; hex a1dd92ed; asc     ;;

*** WE ROLL BACK TRANSACTION (2)
mysql> show indexes from delayed_jobs;
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table        | Non_unique | Key_name              | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| delayed_jobs |          0 | PRIMARY               |            1 | id          | A         |        2873 |     NULL | NULL   |      | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            1 | locked_at   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            2 | locked_by   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            1 | priority    | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            2 | run_at      | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            3 | locked_by   | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)
@cainlevy

What do you get from an explain on a query with the same WHERE and ORDER clauses?

EXPLAIN SELECT * WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1
@ngan
mysql> EXPLAIN SELECT * FROM delayed_jobs WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1;
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
| id | select_type | table        | type  | possible_keys                              | key                   | key_len | ref  | rows | Extra       |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
|  1 | SIMPLE      | delayed_jobs | range | delayed_jobs_locking,delayed_jobs_priority | delayed_jobs_priority | 5       | NULL |  589 | Using where |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
1 row in set (0.00 sec)
@cainlevy

I don't think we have the delayed_jobs_locking index on our table, but I guess it's not being used? Only other thing I can think of is maybe MySQL's 5.1 query planner isn't taking advantage of the updated delayed_jobs_priority index.

Hopefully you can reproduce the deadlock safely with a handful of workers running a self-replicating job. Let us know if you find anything more!

@ngan

I added delayed_jobs_locking index for other metrics pulling. It would be nice if DJ-AR allowed us to specify the strategy for claiming jobs, because I would use the old strategy and still get new upgraded features.

@romanlehnert

adding locked_at to the index (as described by @cainlevy) has the side-effekt for me that from 10 workers only a few (2-5) are working at the same time. Maybe this affects the performance on the jobs table?

@t-anjan

Thanks @cainlevy. Adding locked_at to the index seems to have solved the problem for me as well. I did it yesterday, amidst the chaos of these deadlock errors being thrown all over the place in a queue of 100k jobs. As soon as this change was made, those deadlock issues were gone.

@ValMilkevich

+1 Thanks @cainlevy, works great.

@ngan

Is everyone that this is working for on mysql 5.5+?

@michaelglass

@ngan @cainlevy's fix doesn't help me on 5.5

@t-anjan

@ngan - I am on MySQL 5.5.31 and @cainlevy's fix DOES help me.

@cainlevy

MySQL 5.5.x

If my fix isn't working for someone, I'd recommend following the steps I described earlier to set up a repro someplace safe to experiment. It's the only way we're going to learn more.

@michaelglass

I'll publish my test setup in a moment. In the mean time: could not repro on 5.6.13

@michaelglass

my experience with this patch: it does stop the deadlock but performance on my DJ doesn't improve after the deadlock is removed.

However, removing mysql optimizations makes performance jump from about 1job/second/worker to about 10 jobs/second/worker. seeing if upgrading to 5.6 will help one way or the other.

@zxiest

I had originally thought some of the fixes above fixed my deadlock issues but apparently I still got them. I think using multiple processes will cause race conditions with the db so I built a gem that is thread based: a manager thread pulls jobs from the db and distributes them to a threadpool of workers. I would totally welcome contributors to http://github.com/zxiest/delayed_job_active_record_threaded

@cainlevy

@zxiest That sounds like a good way to make DB polling more efficient, but if you're still having trouble with deadlocks, it'll probably come back when you run multiple machines.

@cainlevy

@michaelglass I presume you're measuring performance against a previous version of the gem? Are you measuring real-world throughput, or using an empty job?

@zxiest

@cainlevy yup, this might be an issue, however, if needed in the future, I can achieve mutual exclusivity by processing different queues on different servers at first and if the queues are too large, I could use job_id % servers_number == server_id . I would still use protective locks for safety but avoid locking as much as possible =)

@michaelglass

@cainlevy thanks so much for being such a boss!

Here's my story:

  1. Saw a lot of deadlocks and slow queries in my log. Pushed your fix and and queries no longer were reported to slow query log / no new deadlocks in show innodb status; I thought that everything was better!
  2. Tested DJ throughput against my live web app first by counting rows processed. Ensured that this cap script was reporting equivalent numbers:
  3. With and without the fix, Performance was about 1 job/s.
  4. Forked and removed mysql specific logic. Performance jumped to about 80 jobs/s (with 25 workers). Throwing more workers at the queue didn't improve performance.

In isolation, I created this test harness and ran it in a couple of parallel irbs against default homebrew mysql (5.6.13) and current version in the percona apt repo (5.5.33-31.1). Couldn't repro slowness there.

Haven't run against 5.5.31-30.3-log which is what I'm running in production. Testing that soon.

@cainlevy

@michaelglass So we eventually ran into performance problems as well. Seems like once something backed up the table enough, performance tanked. Limiting workers to specific priorities helped a little bit, but we couldn't fully catch up without switching back to the select/update query pattern.

@michaelglass

@cainlevy yup. Was my experience. I think we should revert those "performance" fixes or revert them for 5.5+?

@cainlevy

@michaelglass I think so, yeah. Did you ever find out what happens under 5.6?

I forgot until afterwards, but we had some performance data that nicely showed the effects of the new reserve. We deployed it to this app on Sept 23, then ran into problems and reverted on Oct 1.

selects performance

updates performance

Green is mean of the 90th percentile, red is the upper bound of the 90th percentile, and purple is the total upper bound.

@michaelglass

looks like I don't see the same performance issues with Mysql 5.6

@michaelglass

lies. same problem with percona mysql 5.6. awful.

@PavelTyk

+1 Thanks @cainlevy, works great.

@joshuaxls

@cainlevy You're the man. Thanks for saving me a bunch of time. Index fixed it all up.

@cainlevy

keep in mind that even if the deadlocks stop, overall table performance might be down!

@michaelglass

@cainlevy might be worth editing the comment that everybody is reading and adding that caveat.

@aaronjensen

we're being bit by this too. What's the latest idea on this? is #75 the best fix?

@cainlevy

@aaronjensen basically. we ended up monkey-patching the Delayed::Backend::ActiveRecord::Job.reserve method so we could still get other library updates.

@aaronjensen

@cainlevy the reason that there are performance issues with your suggestion is that they cause the update to not use an index at all. MySQL will apparently not use an update for an index unless it matches the sort criteria exactly. I don't know why, but it seems to be the case. So the index replacement more or less is the same as deleting the original index outright.

@michaelglass

@aaronjensen my experience is that performance is actually slower than with no index. Is my memory not serving me?

@aaronjensen

@michaelglass well, on 5.6.15 (which is capable of explain update, unlike 5.5.X it seems) it showed that no index was used for that update statement w/ the new index, but the old index was used if it was there. Using indexes on updates for order by is something fairly new to mysql believe it or not, it was added in 5.5.6 i think, so it doesn't surprise me that there are bugs/inconsistencies, and it wouldn't surprise me if you happened to have a version that it did work on :)

@michaelglass

@aaronjensen to respond to your comment from way back: We're using #75 in prod with great success.

@mfenner mfenner referenced this issue in articlemetrics/lagotto
Closed

MySQL deadlock conditions can kill delayed_job workers #38

@joshuapinter

:thumbsup:
@cainlevy's Index Fix works on MySQL 5.5.31.

@aaronjensen

@cainlevy's Index Fix works on MySQL 5.5.31.

I'd be careful w/ that fix, it's not really a fix, it seems to just cause mysql to stop using an index at all (the new index doesn't seem to be used) which means your performance will get worse over time.

@joshuapinter

@aaronjensen You're not kidding, I woke up this morning to find only one worker running a task at any given time... So, I'm a little confused, have we found a fix for this, yet? What's the best advice?

@cainlevy

I ended up monkey-patching my Job.reserve method back to the original select+update strategy.

@joshuapinter

@cainlevy Any chance you have a fork I can pull from?

@cainlevy

I opted to patch the method so I wouldn't need to maintain a fork. If you're concerned about the API changing in a version update, I'd recommend a version check + exception.

https://gist.github.com/cainlevy/c6cfa67d44fe7427dea6

Really, this whole topic probably warrants a pull request with some kind of swappable strategy class.

@aaronjensen aaronjensen referenced this issue from a commit in aaronjensen/delayed_job_active_record
@aaronjensen aaronjensen Retry transactions when deadlocked
Fixes #63
94108b7
@joshuapinter

That's awesome, @cainlevy. Thanks for sharing.

I agree with you. For now, I'll put this patch into production and see how she performs.

@aaronjensen

@joshuapinter I just opened a pull request that fixes it in a different way. It's untested, but it should do the trick. Maybe you could give it a shot? #90

@joshuapinter

@aaronjensen Looks straightforward enough. Seemed to work find in development. Giving it a shot right now in staging. I'll let you know how it goes.

@aaronjensen aaronjensen referenced this issue from a commit in aaronjensen/delayed_job_active_record
@aaronjensen aaronjensen Retry delete when deadlocked
Fixes #63
2abdaf9
@aaronjensen

@joshuapinter it looks like #90 did some unnecessary stuff. #91 just retries deletes since delayed_job handles reservation retries.

@joshuapinter

@aaronjensen Have been using #91 in production for the last few hours. Seems to be working great. Thanks!

@joshuapinter

@aaronjensen Spoke too soon, getting the following error in delayed/backend/active_record.rb:113:

uninitialized constant Delayed::Backend::ActiveRecord::StatementInvalid
@aaronjensen aaronjensen referenced this issue from a commit in aaronjensen/delayed_job_active_record
@aaronjensen aaronjensen Retry delete when deadlocked
Fixes #63
cc268de
@aaronjensen

ah, thanks @joshuapinter we hadn't had a deadlock yet. I just updated, that should fix it.

@cainlevy

hmm, rescuing and retrying deadlocks seems like a bandaid. also, the delete query is only one side of the deadlock, and i'd expect both sides to fail sooner or later.

i'm not sure that the new reserve strategy is worth shoring up. maybe it depends on server version, but i posted a couple graphs above that show it actually performed worse in our system.

@aaronjensen

hmm, rescuing and retrying deadlocks seems like a bandaid.

I disagree, my understanding is that this is a common way of handling deadlocks, especially when you cannot control lock ordering easily as is the case in this.

also, the delete query is only one side of the deadlock, and i'd expect both sides to fail sooner or later.

The other side's failure is already handled here: https://github.com/collectiveidea/delayed_job/blob/master/lib/delayed/worker.rb#L281

i'm not sure that the new reserve strategy is worth shoring up. maybe it depends on server version, but i posted a couple graphs above that show it actually performed worse in our system.

Weren't those graphs when you were running with modified indexes? I could not get the modified indexes to be used at all, mysql refused to use them regardless of version, so it was the same as running without any indexes. I don't have data that says that this method is better (yet... our next big run will be on monday night so I'll have more data then with #91).

@michaelglass

just to reiterate: swapping to resque, if that's an option works pretty well

@zxiest

I see a lot of people are still having issues. I have been using delayed_job_active_record_threaded (which I developed) on production, handling queues of 1M+ size, for the past year without any issues. It doesn't support multiple servers yet but otherwise, if you need to stick to a db solution, give it a shot =)

@cainlevy

@aaronjensen True, it's possible that our version of MySQL wasn't using the new index on the UPDATE...LIMIT query. Unfortunately it's only 5.5.x so I had to use EXPLAIN SELECT to try and get anything back, which would've missed any differences from the actual UPDATE version. And as I understand, your evidence was from 5.6.x, which would've missed any version differences.

I'd be reluctant to depend on versions though, and if the new strategy combined with deadlock retries is more performant than the old select and update, then cool!

@t-anjan

This seems to be a pretty serious issue. I too am seeing the problem of not all workers running together. Currently, I have added
gem 'delayed_job_active_record', :github => 'aaronjensen/delayed_job_active_record', :branch => 'retry-deadlock-on-delete'
to my Gemfile, and I reset the index changes suggested earlier in this issue. But it seems like the "all workers not running" problem still exists.

What are people doing at the moment to overcome this issue?

@aaronjensen

@t-anjan i'm not sure you're looking at the right issue, this issue is about a deadlock, I'm not sure what "all workers not running" is, so I doubt that the deadlock retry branch would help you.

@sg552

I am getting this problem all the time. all my workers will die in an hour when processing 1k jobs in queue.

gem 'delayed_job_active_record', '4.0.0'
gem 'mysql2', '0.3.14'

any ideas?

p.s. currently my solution is using God to control the delayed_job workers to keep them alive. it's definitely ugly. so I am waiting for someone who can help. thanks!

@aaronjensen

@sg552 we have been using #91 and it works well.

@sg552

@aaronjensen the patch (#91) doesn't work for me. Is there something I did incorrect?

BTW, this is what I did :

create a file in lib/delayed/backend/active_record.rb, and it looks like:

require 'active_record/version'
module Delayed
  module Backend
    module ActiveRecord
      # A job object that is persisted to the database.
      # Contains the work object as a YAML field.
      class Job < ::ActiveRecord::Base
        include Delayed::Backend::Base
          # ......
        def destroy
          retries = 0 
          begin
            super
          rescue ::ActiveRecord::StatementInvalid => e
            if e.message =~ /Deadlock found when trying to get lock/ && retries < 100 
              retries +=1 
              retry
            else
              raise
            end 
          end 
        end 
      end 
    end 
  end 
end 

In my config/application.rb:

module MCacheCleaner
  class Application < Rails::Application
    config.autoload_paths += %W(#{config.root}/lib)
    config.time_zone = 'Asia/Harbin'
  end 
end 

Rails version: 4.0.0

@aaronjensen

@sg552 I'm not monkey patching, I'm actually pointing my Gemfile to my fork. If you're going to monkey patch, you should name the file differently and explicitly require it in your application.rb. Replacing the original file name is a bad idea because it probably won't load the original.

@sg552

hi @aaronjensen , thanks for reply, could you please tell me which url I should point to in my Gemfile?
could you give me an example?

thanks!

@michaelglass

see: http://bundler.io/v1.6/git.html
and #90

gem 'delayed_job_active_record', git: 'https://github.com/aaronjensen/delayed_job_active_record.git', branch: 'retry-deadlock-on-delete'
@sg552

@michaelglass thanks! can I rename the name name to be :

gem 'delayed_job_activerecord_retry-deadlock-on-delete', git: 'https://github.com/aaronjensen/delayed_job_active_record.git', branch: 'retry-deadlock-on-delete'

?

@aaronjensen

No, though I'm not sure why you would want to? You should remove the original line from your Gemfile, this takes its place.

@sg552

@michaelglass @aaronjensen got it. thanks! I will let you know if it works on my server.

@michaelglass

whoops. also I had a typol

@t-anjan

@aaronjensen - When I said "all workers are not running together", I was referring to the same issue which others faced. Specifically, @joshuapinter commented in this thread, saying the following:

@aaronjensen You're not kidding, I woke up this morning to find only one worker running a task at any given time... So, I'm a little confused, have we found a fix for this, yet? What's the best advice?

Even after using the retry-deadlock-on-delete branch, I still have the same issue.

@aaronjensen

@t-anjan you're probably facing a different issue than what my branch fixes. My branch fixes an issue, a very specific one, which is that deadlocks can happen between the update and the delete of a job on mysql. Deadlocks are detected by the db and one of transactions is killed so it would not cause only one worker to work at once. That could be some sort of locking issue or something else entirely, I have no idea, but it's most likely not what my branch (and this issue) are about.

@sg552

@aaronjensen : thanks a lot! your branch works perfectly on my server! good job! the works has been running for 2 days without any problems.

@aaronjensen
@csmuc

I created a pull request months ago, which would enable a little cleaner monkey-patch of the reserve method:
#89

I think this is useful to revert to conservative SQL statement (even better: a configuration option). I haven't heard anything yet.

@aaronjensen

the non-conservative sql statement is fundamentally flawed w/o deadlock retry, so either #91 should be merged or the non-conservative method should just be removed. I don't think it should be encouraged to monkey patch it out ultimately.

@csmuc

I have even more issues with the non-conservative SQL statement. MySQL (master-master replication) writes this warning to the log file on every execution of the SQL:

[Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted.

The mysqld.log is flooded with those messages (and probably MySQL doesn't spit out the warnings without a reason, too). I'd really get rid of the optimized version of the MySQL (preferably with a config option)

@aaronjensen

@csmuc yeah we get that too :/

@ndbroadbent

We fixed this issue by reverting to the less efficient job fetching query, where it fetches a batch of 5 jobs, and then tries to lock each one successively until it obtains a lock.

I ran some tests on my local machine with 4 delayed job workers, and 10,000 no-op jobs:

Before: 41 Deadlocks in log/delayed_job.log
After: 0 Deadlocks

Here's the commit that reverts the change: ndbroadbent@d311b55

You can test it out by adding this line to your Gemfile:

gem 'delayed_job_active_record', github: 'ndbroadbent/delayed_job_active_record', branch: :fix_locking_issues
@t-anjan

I don't know if mine is a special case.

Using

gem 'delayed_job_active_record', :github => 'aaronjensen/delayed_job_active_record', :branch => 'retry-deadlock-on-delete'

was still giving me a bucket-load of deadlocks and very slow queue processing. As I had mentioned before, in addition to deadlocks, it appeared that not all my workers were processing the queue simultaneously.

Yesterday, I tried

gem 'delayed_job_active_record', github: 'ndbroadbent/delayed_job_active_record', branch: :fix_locking_issues

and I immediately noticed an improvement. On a queue of about 30,000 real jobs, when I ran

select * from delayed_jobs where locked_at is not null;

I always got back a result pertaining to the exact number of workers I was running. And the whole queue was over in under an hour with zero deadlocks.

Earlier it was taking more than 4 to 5 hours with tens of deadlocks. And the above query was always returning just one (or at most two) records.

@aaronjensen - It looks like the issue I described earlier in the thread is indeed related to this fix.

@ndbroadbent - Thanks! I was so close to switching from DJ completely.

@aaronjensen

@t-anjan that's great, I'm glad you found something that works for you.

@marcoshack

@ndbroadbent, do you know if your fix can help with the following error also?

Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: 
UPDATE `delayed_jobs` SET `delayed_jobs`.`locked_at` = '2014-09-16 22:01:39', 
`delayed_jobs`.`locked_by` = 'delayed_job.1 host:job02 pid:17742' 
WHERE ((run_at <= '2014-09-16 22:01:39' AND (locked_at IS NULL OR
locked_at < '2014-09-16 18:01:39') OR locked_by = 'delayed_job.1 host:job02 pid:17742') 
AND failed_at IS NULL) AND `delayed_jobs`.`queue` IN ('retrieve_deliveries') 
ORDER BY priority ASC, run_at ASC LIMIT 1
@ndbroadbent
@t-anjan

@marcoshack - That's the error we were experiencing too. And it was fixed by @ndbroadbent's change.

@marcoshack

That's awesome @ndbroadbent and @t-anjan, thank you for the feedback.

@yigitbacakoglu

First of all thanks @ndbroadbent !
We were having almost same problem with @t-anjan.
We're running an avg of 25 workers and usually they manage to keep the job queue down to 1000, but when it passes 1000 threshold, it grows exponentially and takes almost 5 hours to process ~500k jobs (We see an avg of 35k jobs in queue during this 5 hours).

Also we've noticed that even we have 50+ workers, only ~10 workers got job. So we made the changes which @ndbroadbent mentioned above and also increased read_ahead to 25. After these changes 100k jobs processed within 30minutes.

How does it work? -->

In such a system, where 20+ workers run, each worker tries to reserve a job, and when almost all workers try at the same time, one of them reserves job and others get deadlock and waits 5 second to reserve new one. This situation raises significant delay on busy systems.

Changing this logic from:
-> "give me 1 job and I will try to reserve it, otherwise I will wait 5 seconds' to
-> "give me 25 jobs and I will try to reserve one of them asap"
fixed our issue.

@jdelStrother

With @ndbroadbent's reverted commit, it does fix deadlocks, but the select times are pretty bad. With a jobs table containing 2500 'live' jobs and 170k failed jobs, the SELECTs would typically take around 2200ms to complete.

Delayed::Backend::ActiveRecord::Job Load (2443.3ms)  SELECT  `delayed_jobs`.* FROM `delayed_jobs`  WHERE ((run_at <= '2014-12-17 10:24:09' AND (locked_at IS NULL OR locked_at < '2014-12-17 06:24:09') OR locked_by = 'delayed_job.3 host:dj2.audioboo.cloud.ec pid:13318') AND failed_at IS NULL)  ORDER BY priority ASC, run_at ASC LIMIT 5

If I wipe out all the old failed jobs, the SELECTs drop to a few milliseconds. Possibly I should just be more regularly clearing out failed jobs. Alternatively, I wonder if it might help the query if when a job is marked as failed, we also bump its priority to an arbitrary high value.

@yigitbacakoglu

@jdelStrother which mysql version do you use? Also can you try to restart mysql and observe response time again

@jdelStrother

We're still on 5.5.39. Restarting mysql on our live site isn't really an option for us, but I'll see what I can do about reproducing it locally

@ventsislaf ventsislaf referenced this issue from a commit in notonthehighstreet/delayed_job_active_record
@ventsislaf ventsislaf [#85070062] Sync the repo with the original repo
https://www.pivotaltracker.com/story/show/85070062

In order to use the gem in Rails 4.2:

* Use the code from the original repository
* Apply the patch from
  ndbroadbent@d311b55
  which fixes
  collectiveidea#63
bbd79f2
@ventsislaf ventsislaf referenced this issue in notonthehighstreet/delayed_job_active_record
Closed

[#85070062] Sync the repo with the original repo #1

@wolfpakz

I'm experiencing this problem and the deadlocks are crippling. The gem is effectively BROKEN as released. The commit from ndbroadbent@d311b55 does resolve the problem, but I've had to fork and apply his patch in order to get on with business.

It's a real bummer having to live on a fork of this gem. I'd greatly prefer to be on the mainline release, but it's not possible unless this is resolved.

What do you need to see for this to move forward?

@csmuc

The goal of my commit 0eecb8f was to at least enable a rather "clean" monkey-patch. We use this in production:

module Delayed
  module Backend
    module ActiveRecord
      class Job
        def self.reserve_with_scope(ready_scope, worker, now)
          reserve_with_scope_using_default_sql(ready_scope, worker, now)
        end
      end
    end
  end
end

Of course it would be much better to have an official config option or something.

@wolfpakz

@csmuc Thanks! That's a nice way back to the mainline release.

@aaronjensen

I'll repeat myself here: #63 (comment)

Requiring a monkey patch for a fundamentally flawed method (it is only flawed if you do not have deadlock retry) of reserving jobs makes no sense whatsoever. It should either be reverted to the method that works or we should add deadlock retry to the current method.

I honestly do not understand what is happening here.

@gorism

I'm the person that originally opened this issue. Just wanted to say that I've now stopped hoping that someone will actually do something to address this in the gem. It is rather clear from the number of comments on this issue that the current approach is quite broken for a large number of users. In spite of many suggestions to fix the issue, and many pull requests with suggested fixes, it appears as broken today as it did nearly two years ago.

I ended up forking the gem, and just going with the "tried and true" original job reservation approach. I was initially concerned about the performance impacts (after all, this whole ordeal was initiated based on a desire to improve the reservation performance), but I have seen no noticeable performance degradation under load. Note that we only keep 7 days worth of failed jobs around, so perhaps at some point the performance would measurably suffer, if jobs aren't occasionally removed.

I'll check back in occasionally, because I am sort of curious just how long this will be allowed to remain an issue on the master branch. But at least I don't have to worry about cleaning up after deadlocked jobs anymore.

@yyyc514

Does no one at Collective Idea care about Delayed Job AR anymore? Do we need a new maintainer? Sorry for the mass CC but I've also followed this thread for over a year and after reading @gorism and his frustration (plus all the other comments over time) I decided to escalate this a bit more.

It seems Collective Idea should take some action:

  1. Close this issue and explain why.
  2. Accept one of the various pull requests to fix this issue then close it.
  3. Find another maintainer of the project
  4. Officially say it's no longer supported and allow #3 to happen organically.

Ping Collective Idea: @danielmorrison @bryckbost @gaffneyc @laserlemon @jcarpenter88 @tbugai @ersatzryan @albus522 @jasonroelofs @brianhempel @emilford @mercilessrobot @manlycode @trestrantham @spncr2 @lauramosher

@albus522
Owner

@yyyc514 this project is working quite well except for a minority of people using an edge case scenario on mysql. This issue primarily comes up when running a lot of really fast jobs against a mysql database, at which point why are the jobs being delayed at all.

About the only solution that stands a chance at working long term, is setting up a configuration option that allows users to change the job locker to the old style. If someone wants to put together a pull request for that, we can look at getting that merged, but it is not a priority item we will be working on directly right now.

@aaronjensen

@albus522 Thank you for chiming in. It sounds like you agree that deadlocks are possible given the current method for locking jobs, which is true.

There are really only two good ways to handle operations that deadlock:

  1. Change them to operations that cannot deadlock by changing the order the database acquires locks.
  2. Retry when there is a deadlock.

The second option has been working very well for us for nearly a year now. Could you reconsider that as an option? I have submitted a pull request that does this, and would be happy to tweak anything you'd like in it: #91

Thanks!

@gorism

@albus522 I can't refute or support your assessment of how much of an edge case minority I'm in. I certainly can describe what I consider valid reasons for having a task be run in the background independent of how long the task takes to execute.

But as a member of the edge case minority, I will offer that understanding your position on the issue would have been appreciated. In the absence of that information, quite a few people have spent what seems like a good chunk of time trying to come up with a solution and then present that solution for consideration. Those solutions apparently weren't going to stand a chance for inclusion, so those of us working on them could have just focused on what you were thinking was a reasonable approach instead.

@csmuc

Part of the minority as well :)

@albus522 "This issue primarily comes up when running a lot of really fast jobs against a mysql database, at which point why are the jobs being delayed at all."

One of the reasons I use ActiveRecord-backed DJ a lot: It helps me building robust distributed applications. I call remote services with DJs, not to delay them, but to replace the network calls with "INSERT INTO" statements, which are part of the local DB transaction. The DJ then takes care of doing the network call (and only this). I hope you guys don't mind that I plug my blog here, but those who are interested can read more here: http://blog.seiler.cc/post/112879365221/using-a-background-job-queue-to-implement-robust

@defeated defeated referenced this issue in collectiveidea/delayed_job
Open

2 workers picking up the same job #658

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.