Mysql2::Error: Deadlock when attempting to lock a job #63

Open
gorism opened this Issue Jun 5, 2013 · 126 comments

Comments

Projects
None yet
@gorism

gorism commented Jun 5, 2013

We've just upgraded from 0.3.3 to 0.4.4, to resolve the race condition problem when running multiple workers. We're now seeing occasional MySQL deadlocks, which are unhandled exceptions and end up killing the worker.

For example:

svc1 Jun  5 12:35:17  Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `delayed_jobs` SET `locked_at` = '2013-06-05 12:35:16', `locked_by` = 'delayed_job.2 host:svc1 pid:20498' WHERE `delayed_jobs`.`queue` IN ('batch') AND ((run_at <= '2013-06-05 12:35:16' AND (locked_at IS NULL OR locked_at < '2013-06-05 08:35:16') OR locked_by = 'delayed_job.2 host:svc1 pid:20498') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1

It would seem that, at the very least, this should be handled.

In case it matters, we're using delayed_job 3.0.4.

@jamsi

This comment has been minimized.

Show comment Hide comment
@jamsi

jamsi Jun 5, 2013

I get this also with: 4.0.0.beta2

jamsi commented Jun 5, 2013

I get this also with: 4.0.0.beta2

@piotrb

This comment has been minimized.

Show comment Hide comment
@piotrb

piotrb Jun 6, 2013

+1 ..

Seems to happen to us when creating a delayed within another job running, not all the time of course, just very intermittently.

piotrb commented Jun 6, 2013

+1 ..

Seems to happen to us when creating a delayed within another job running, not all the time of course, just very intermittently.

@philister

This comment has been minimized.

Show comment Hide comment
@philister

philister Jun 24, 2013

+1
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)

+1
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)

@settinghead

This comment has been minimized.

Show comment Hide comment
@settinghead

settinghead Jun 25, 2013

Same here.
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)

Same here.
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)

@cheneveld

This comment has been minimized.

Show comment Hide comment
@cheneveld

cheneveld Jul 4, 2013

+1

delayed_job (3.0.5)
activesupport (> 3.0)
delayed_job_active_record (0.4.4)
activerecord (>= 2.1.0, < 4)
delayed_job (
> 3.0)

mysql(5.5.27)

+1

delayed_job (3.0.5)
activesupport (> 3.0)
delayed_job_active_record (0.4.4)
activerecord (>= 2.1.0, < 4)
delayed_job (
> 3.0)

mysql(5.5.27)

@philister

This comment has been minimized.

Show comment Hide comment
@philister

philister Jul 11, 2013

Can we help anything to solve this? Anyone a patch or hints? We are restarting the workers ever 10 minutes because of this.
Thanks and regards, Ph.

Can we help anything to solve this? Anyone a patch or hints? We are restarting the workers ever 10 minutes because of this.
Thanks and regards, Ph.

@gorism

This comment has been minimized.

Show comment Hide comment
@gorism

gorism Jul 12, 2013

@philister We ended up forking and changing the logic to use a row level lock. I suspect the original implementation went to lengths to avoid using a lock, but in the end I'll accept a performance penalty for more reliability. You can find our fork here: https://github.com/doxo/delayed_job_active_record

Note that we do still get occasional deadlocks which result in restarts, but rare in comparison to what we were getting before. It would be even better if the gem could detect the deadlock and retry, but getting the root cause out of the exception raised by the MySQL adapter is not straightforward, and possibly version dependent. So I opted to live with the occasional deadlocks for now.

gorism commented Jul 12, 2013

@philister We ended up forking and changing the logic to use a row level lock. I suspect the original implementation went to lengths to avoid using a lock, but in the end I'll accept a performance penalty for more reliability. You can find our fork here: https://github.com/doxo/delayed_job_active_record

Note that we do still get occasional deadlocks which result in restarts, but rare in comparison to what we were getting before. It would be even better if the gem could detect the deadlock and retry, but getting the root cause out of the exception raised by the MySQL adapter is not straightforward, and possibly version dependent. So I opted to live with the occasional deadlocks for now.

@philister

This comment has been minimized.

Show comment Hide comment
@philister

philister Jul 15, 2013

@gorism Thanks a lot!

@gorism Thanks a lot!

@philister philister referenced this issue in collectiveidea/delayed_job Jul 15, 2013

Closed

DJ dies if an exception happens while reserving the job #536

@zxiest

This comment has been minimized.

Show comment Hide comment
@zxiest

zxiest Jul 17, 2013

@gorism great fix! Thanks! -- The original dj's workers were dying after deadlock. I'm happy to see my workers alive after some time.

zxiest commented Jul 17, 2013

@gorism great fix! Thanks! -- The original dj's workers were dying after deadlock. I'm happy to see my workers alive after some time.

@cainlevy

This comment has been minimized.

Show comment Hide comment
@cainlevy

cainlevy Jul 23, 2013

edit: see followup at #63 (comment). the new queries may not be a net gain, even without deadlocks.

We were bit by this as well during an upgrade.

I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds.

The output of show engine innodb status says the contention is between the UPDATE query now used to reserve a job, and the DELETE query used to clean up a finished job. Surprising! Apparently the DELETE query first acquires a lock on the primary index (id) and then on the secondary index (priority, run_at). But the UPDATE query is using the (priority, run_at) index to scan the table, and is trying to grab primary key locks as it goes. Eventually the UPDATE and DELETE queries each grab one of two locks for a given row, try to acquire the other, and 💥. MySQL resolves by killing the UPDATE, which crashes the worker.

The fix I've worked out locally is to replace the index on (priority, run_at) with an index on (priority, run_at, locked_by). This completely stabilizes my concurrency test! My theory is that it allows the UPDATE query's scan to skip over rows held by workers, which takes it out of contention with the DELETE query.

Hope this helps.

🔒🔒

edit: see followup at #63 (comment). the new queries may not be a net gain, even without deadlocks.

We were bit by this as well during an upgrade.

I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds.

The output of show engine innodb status says the contention is between the UPDATE query now used to reserve a job, and the DELETE query used to clean up a finished job. Surprising! Apparently the DELETE query first acquires a lock on the primary index (id) and then on the secondary index (priority, run_at). But the UPDATE query is using the (priority, run_at) index to scan the table, and is trying to grab primary key locks as it goes. Eventually the UPDATE and DELETE queries each grab one of two locks for a given row, try to acquire the other, and 💥. MySQL resolves by killing the UPDATE, which crashes the worker.

The fix I've worked out locally is to replace the index on (priority, run_at) with an index on (priority, run_at, locked_by). This completely stabilizes my concurrency test! My theory is that it allows the UPDATE query's scan to skip over rows held by workers, which takes it out of contention with the DELETE query.

Hope this helps.

🔒🔒

@cheneveld

This comment has been minimized.

Show comment Hide comment
@cheneveld

cheneveld Jul 24, 2013

+20

Thanks a bunch cainlevy. This did the trick. Great work.

+20

Thanks a bunch cainlevy. This did the trick. Great work.

@philister

This comment has been minimized.

Show comment Hide comment
@philister

philister Jul 26, 2013

Finally this did it for us, too. Thanks @cainlevy (Other solutions didn' t work in our case)

Finally this did it for us, too. Thanks @cainlevy (Other solutions didn' t work in our case)

@ngan

This comment has been minimized.

Show comment Hide comment
@ngan

ngan Aug 11, 2013

Upgraded from 0.3.3 to 0.4.4 and we're experiencing deadlock issues as well. We're running a total of 10 workers and usually they manage to keep the job queue down to 1000 or so. On mysql 5.1. @cainlevy's solution didn't work for us :-(.

------------------------
LATEST DETECTED DEADLOCK
------------------------
130811  3:19:18
*** (1) TRANSACTION:
TRANSACTION 2607F0BE7C, ACTIVE 0 sec, process no 7937, OS thread id 1364158784 fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1248, 21 row lock(s)
MySQL thread id 2166505, query id 64381984645 init
UPDATE `delayed_jobs` SET `locked_at` = '2013-08-11 03:19:18', `locked_by` = 'delayed_job host:app3 pid:26732' WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0BE7C lock_mode X locks rec but not gap waiting
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) TRANSACTION:
TRANSACTION 2607F0C031, ACTIVE 0 sec, process no 7937, OS thread id 1363892544 updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1
MySQL thread id 2166501, query id 64381985155 updating
DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 568169197
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 33 n bits 648 index `delayed_jobs_priority` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap waiting
Record lock, heap no 406 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 80000000; asc     ;;
 1: len 8; hex 8000124f11d7316a; asc    O  1j;;
 2: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 3: len 4; hex a1dd92ed; asc     ;;

*** WE ROLL BACK TRANSACTION (2)
mysql> show indexes from delayed_jobs;
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table        | Non_unique | Key_name              | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| delayed_jobs |          0 | PRIMARY               |            1 | id          | A         |        2873 |     NULL | NULL   |      | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            1 | locked_at   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            2 | locked_by   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            1 | priority    | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            2 | run_at      | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            3 | locked_by   | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)

ngan commented Aug 11, 2013

Upgraded from 0.3.3 to 0.4.4 and we're experiencing deadlock issues as well. We're running a total of 10 workers and usually they manage to keep the job queue down to 1000 or so. On mysql 5.1. @cainlevy's solution didn't work for us :-(.

------------------------
LATEST DETECTED DEADLOCK
------------------------
130811  3:19:18
*** (1) TRANSACTION:
TRANSACTION 2607F0BE7C, ACTIVE 0 sec, process no 7937, OS thread id 1364158784 fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1248, 21 row lock(s)
MySQL thread id 2166505, query id 64381984645 init
UPDATE `delayed_jobs` SET `locked_at` = '2013-08-11 03:19:18', `locked_by` = 'delayed_job host:app3 pid:26732' WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0BE7C lock_mode X locks rec but not gap waiting
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) TRANSACTION:
TRANSACTION 2607F0C031, ACTIVE 0 sec, process no 7937, OS thread id 1363892544 updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1
MySQL thread id 2166501, query id 64381985155 updating
DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 568169197
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 33 n bits 648 index `delayed_jobs_priority` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap waiting
Record lock, heap no 406 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 80000000; asc     ;;
 1: len 8; hex 8000124f11d7316a; asc    O  1j;;
 2: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 3: len 4; hex a1dd92ed; asc     ;;

*** WE ROLL BACK TRANSACTION (2)
mysql> show indexes from delayed_jobs;
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table        | Non_unique | Key_name              | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| delayed_jobs |          0 | PRIMARY               |            1 | id          | A         |        2873 |     NULL | NULL   |      | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            1 | locked_at   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            2 | locked_by   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            1 | priority    | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            2 | run_at      | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            3 | locked_by   | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)
@cainlevy

This comment has been minimized.

Show comment Hide comment
@cainlevy

cainlevy Aug 11, 2013

What do you get from an explain on a query with the same WHERE and ORDER clauses?

EXPLAIN SELECT * WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1

What do you get from an explain on a query with the same WHERE and ORDER clauses?

EXPLAIN SELECT * WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1
@ngan

This comment has been minimized.

Show comment Hide comment
@ngan

ngan Aug 11, 2013

mysql> EXPLAIN SELECT * FROM delayed_jobs WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1;
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
| id | select_type | table        | type  | possible_keys                              | key                   | key_len | ref  | rows | Extra       |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
|  1 | SIMPLE      | delayed_jobs | range | delayed_jobs_locking,delayed_jobs_priority | delayed_jobs_priority | 5       | NULL |  589 | Using where |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
1 row in set (0.00 sec)

ngan commented Aug 11, 2013

mysql> EXPLAIN SELECT * FROM delayed_jobs WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1;
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
| id | select_type | table        | type  | possible_keys                              | key                   | key_len | ref  | rows | Extra       |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
|  1 | SIMPLE      | delayed_jobs | range | delayed_jobs_locking,delayed_jobs_priority | delayed_jobs_priority | 5       | NULL |  589 | Using where |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
1 row in set (0.00 sec)
@cainlevy

This comment has been minimized.

Show comment Hide comment
@cainlevy

cainlevy Aug 11, 2013

I don't think we have the delayed_jobs_locking index on our table, but I guess it's not being used? Only other thing I can think of is maybe MySQL's 5.1 query planner isn't taking advantage of the updated delayed_jobs_priority index.

Hopefully you can reproduce the deadlock safely with a handful of workers running a self-replicating job. Let us know if you find anything more!

I don't think we have the delayed_jobs_locking index on our table, but I guess it's not being used? Only other thing I can think of is maybe MySQL's 5.1 query planner isn't taking advantage of the updated delayed_jobs_priority index.

Hopefully you can reproduce the deadlock safely with a handful of workers running a self-replicating job. Let us know if you find anything more!

@ngan

This comment has been minimized.

Show comment Hide comment
@ngan

ngan Aug 11, 2013

I added delayed_jobs_locking index for other metrics pulling. It would be nice if DJ-AR allowed us to specify the strategy for claiming jobs, because I would use the old strategy and still get new upgraded features.

ngan commented Aug 11, 2013

I added delayed_jobs_locking index for other metrics pulling. It would be nice if DJ-AR allowed us to specify the strategy for claiming jobs, because I would use the old strategy and still get new upgraded features.

@romanlehnert

This comment has been minimized.

Show comment Hide comment
@romanlehnert

romanlehnert Sep 11, 2013

adding locked_at to the index (as described by @cainlevy) has the side-effekt for me that from 10 workers only a few (2-5) are working at the same time. Maybe this affects the performance on the jobs table?

adding locked_at to the index (as described by @cainlevy) has the side-effekt for me that from 10 workers only a few (2-5) are working at the same time. Maybe this affects the performance on the jobs table?

@t-anjan

This comment has been minimized.

Show comment Hide comment
@t-anjan

t-anjan Sep 12, 2013

Thanks @cainlevy. Adding locked_at to the index seems to have solved the problem for me as well. I did it yesterday, amidst the chaos of these deadlock errors being thrown all over the place in a queue of 100k jobs. As soon as this change was made, those deadlock issues were gone.

t-anjan commented Sep 12, 2013

Thanks @cainlevy. Adding locked_at to the index seems to have solved the problem for me as well. I did it yesterday, amidst the chaos of these deadlock errors being thrown all over the place in a queue of 100k jobs. As soon as this change was made, those deadlock issues were gone.

@ValMilkevich

This comment has been minimized.

Show comment Hide comment
@ValMilkevich

ValMilkevich Sep 12, 2013

+1 Thanks @cainlevy, works great.

+1 Thanks @cainlevy, works great.

@ngan

This comment has been minimized.

Show comment Hide comment
@ngan

ngan Sep 12, 2013

Is everyone that this is working for on mysql 5.5+?

ngan commented Sep 12, 2013

Is everyone that this is working for on mysql 5.5+?

@michaelglass

This comment has been minimized.

Show comment Hide comment
@michaelglass

michaelglass Sep 13, 2013

@ngan @cainlevy's fix doesn't help me on 5.5

@ngan @cainlevy's fix doesn't help me on 5.5

@t-anjan

This comment has been minimized.

Show comment Hide comment
@t-anjan

t-anjan Sep 13, 2013

@ngan - I am on MySQL 5.5.31 and @cainlevy's fix DOES help me.

t-anjan commented Sep 13, 2013

@ngan - I am on MySQL 5.5.31 and @cainlevy's fix DOES help me.

@cainlevy

This comment has been minimized.

Show comment Hide comment
@cainlevy

cainlevy Sep 13, 2013

MySQL 5.5.x

If my fix isn't working for someone, I'd recommend following the steps I described earlier to set up a repro someplace safe to experiment. It's the only way we're going to learn more.

MySQL 5.5.x

If my fix isn't working for someone, I'd recommend following the steps I described earlier to set up a repro someplace safe to experiment. It's the only way we're going to learn more.

@michaelglass

This comment has been minimized.

Show comment Hide comment
@michaelglass

michaelglass Sep 17, 2013

I'll publish my test setup in a moment. In the mean time: could not repro on 5.6.13

I'll publish my test setup in a moment. In the mean time: could not repro on 5.6.13

@michaelglass

This comment has been minimized.

Show comment Hide comment
@michaelglass

michaelglass Sep 17, 2013

my experience with this patch: it does stop the deadlock but performance on my DJ doesn't improve after the deadlock is removed.

However, removing mysql optimizations makes performance jump from about 1job/second/worker to about 10 jobs/second/worker. seeing if upgrading to 5.6 will help one way or the other.

my experience with this patch: it does stop the deadlock but performance on my DJ doesn't improve after the deadlock is removed.

However, removing mysql optimizations makes performance jump from about 1job/second/worker to about 10 jobs/second/worker. seeing if upgrading to 5.6 will help one way or the other.

@zxiest

This comment has been minimized.

Show comment Hide comment
@zxiest

zxiest Sep 18, 2013

I had originally thought some of the fixes above fixed my deadlock issues but apparently I still got them. I think using multiple processes will cause race conditions with the db so I built a gem that is thread based: a manager thread pulls jobs from the db and distributes them to a threadpool of workers. I would totally welcome contributors to http://github.com/zxiest/delayed_job_active_record_threaded

zxiest commented Sep 18, 2013

I had originally thought some of the fixes above fixed my deadlock issues but apparently I still got them. I think using multiple processes will cause race conditions with the db so I built a gem that is thread based: a manager thread pulls jobs from the db and distributes them to a threadpool of workers. I would totally welcome contributors to http://github.com/zxiest/delayed_job_active_record_threaded

@cainlevy

This comment has been minimized.

Show comment Hide comment
@cainlevy

cainlevy Sep 18, 2013

@zxiest That sounds like a good way to make DB polling more efficient, but if you're still having trouble with deadlocks, it'll probably come back when you run multiple machines.

@zxiest That sounds like a good way to make DB polling more efficient, but if you're still having trouble with deadlocks, it'll probably come back when you run multiple machines.

@cainlevy

This comment has been minimized.

Show comment Hide comment
@cainlevy

cainlevy Sep 18, 2013

@michaelglass I presume you're measuring performance against a previous version of the gem? Are you measuring real-world throughput, or using an empty job?

@michaelglass I presume you're measuring performance against a previous version of the gem? Are you measuring real-world throughput, or using an empty job?

@zxiest

This comment has been minimized.

Show comment Hide comment
@zxiest

zxiest Sep 18, 2013

@cainlevy yup, this might be an issue, however, if needed in the future, I can achieve mutual exclusivity by processing different queues on different servers at first and if the queues are too large, I could use job_id % servers_number == server_id . I would still use protective locks for safety but avoid locking as much as possible =)

zxiest commented Sep 18, 2013

@cainlevy yup, this might be an issue, however, if needed in the future, I can achieve mutual exclusivity by processing different queues on different servers at first and if the queues are too large, I could use job_id % servers_number == server_id . I would still use protective locks for safety but avoid locking as much as possible =)

@michaelglass

This comment has been minimized.

Show comment Hide comment
@michaelglass

michaelglass Sep 18, 2013

@cainlevy thanks so much for being such a boss!

Here's my story:

  1. Saw a lot of deadlocks and slow queries in my log. Pushed your fix and and queries no longer were reported to slow query log / no new deadlocks in show innodb status; I thought that everything was better!
  2. Tested DJ throughput against my live web app first by counting rows processed. Ensured that this cap script was reporting equivalent numbers:
  3. With and without the fix, Performance was about 1 job/s.
  4. Forked and removed mysql specific logic. Performance jumped to about 80 jobs/s (with 25 workers). Throwing more workers at the queue didn't improve performance.

In isolation, I created this test harness and ran it in a couple of parallel irbs against default homebrew mysql (5.6.13) and current version in the percona apt repo (5.5.33-31.1). Couldn't repro slowness there.

Haven't run against 5.5.31-30.3-log which is what I'm running in production. Testing that soon.

@cainlevy thanks so much for being such a boss!

Here's my story:

  1. Saw a lot of deadlocks and slow queries in my log. Pushed your fix and and queries no longer were reported to slow query log / no new deadlocks in show innodb status; I thought that everything was better!
  2. Tested DJ throughput against my live web app first by counting rows processed. Ensured that this cap script was reporting equivalent numbers:
  3. With and without the fix, Performance was about 1 job/s.
  4. Forked and removed mysql specific logic. Performance jumped to about 80 jobs/s (with 25 workers). Throwing more workers at the queue didn't improve performance.

In isolation, I created this test harness and ran it in a couple of parallel irbs against default homebrew mysql (5.6.13) and current version in the percona apt repo (5.5.33-31.1). Couldn't repro slowness there.

Haven't run against 5.5.31-30.3-log which is what I'm running in production. Testing that soon.

@cainlevy

This comment has been minimized.

Show comment Hide comment
@cainlevy

cainlevy Oct 1, 2013

@michaelglass So we eventually ran into performance problems as well. Seems like once something backed up the table enough, performance tanked. Limiting workers to specific priorities helped a little bit, but we couldn't fully catch up without switching back to the select/update query pattern.

cainlevy commented Oct 1, 2013

@michaelglass So we eventually ran into performance problems as well. Seems like once something backed up the table enough, performance tanked. Limiting workers to specific priorities helped a little bit, but we couldn't fully catch up without switching back to the select/update query pattern.

@michaelglass

This comment has been minimized.

Show comment Hide comment
@michaelglass

michaelglass Oct 13, 2013

@cainlevy yup. Was my experience. I think we should revert those "performance" fixes or revert them for 5.5+?

@cainlevy yup. Was my experience. I think we should revert those "performance" fixes or revert them for 5.5+?

@cainlevy

This comment has been minimized.

Show comment Hide comment
@cainlevy

cainlevy Oct 14, 2013

@michaelglass I think so, yeah. Did you ever find out what happens under 5.6?

I forgot until afterwards, but we had some performance data that nicely showed the effects of the new reserve. We deployed it to this app on Sept 23, then ran into problems and reverted on Oct 1.

selects performance

updates performance

Green is mean of the 90th percentile, red is the upper bound of the 90th percentile, and purple is the total upper bound.

@michaelglass I think so, yeah. Did you ever find out what happens under 5.6?

I forgot until afterwards, but we had some performance data that nicely showed the effects of the new reserve. We deployed it to this app on Sept 23, then ran into problems and reverted on Oct 1.

selects performance

updates performance

Green is mean of the 90th percentile, red is the upper bound of the 90th percentile, and purple is the total upper bound.

@michaelglass

This comment has been minimized.

Show comment Hide comment
@michaelglass

michaelglass Oct 16, 2013

looks like I don't see the same performance issues with Mysql 5.6

looks like I don't see the same performance issues with Mysql 5.6

@michaelglass

This comment has been minimized.

Show comment Hide comment
@michaelglass

michaelglass Oct 23, 2013

lies. same problem with percona mysql 5.6. awful.

lies. same problem with percona mysql 5.6. awful.

@yigitbacakoglu

This comment has been minimized.

Show comment Hide comment
@yigitbacakoglu

yigitbacakoglu Dec 17, 2014

@jdelStrother which mysql version do you use? Also can you try to restart mysql and observe response time again

@jdelStrother which mysql version do you use? Also can you try to restart mysql and observe response time again

@jdelStrother

This comment has been minimized.

Show comment Hide comment
@jdelStrother

jdelStrother Dec 17, 2014

We're still on 5.5.39. Restarting mysql on our live site isn't really an option for us, but I'll see what I can do about reproducing it locally

We're still on 5.5.39. Restarting mysql on our live site isn't really an option for us, but I'll see what I can do about reproducing it locally

ventsislaf added a commit to notonthehighstreet/delayed_job_active_record that referenced this issue Jan 15, 2015

[#85070062] Sync the repo with the original repo
https://www.pivotaltracker.com/story/show/85070062

In order to use the gem in Rails 4.2:

* Use the code from the original repository
* Apply the patch from
  ndbroadbent@d311b55
  which fixes
  collectiveidea#63

@ventsislaf ventsislaf referenced this issue in notonthehighstreet/delayed_job_active_record Jan 15, 2015

Closed

[#85070062] Sync the repo with the original repo #1

@wolfpakz

This comment has been minimized.

Show comment Hide comment
@wolfpakz

wolfpakz Mar 9, 2015

I'm experiencing this problem and the deadlocks are crippling. The gem is effectively BROKEN as released. The commit from ndbroadbent@d311b55 does resolve the problem, but I've had to fork and apply his patch in order to get on with business.

It's a real bummer having to live on a fork of this gem. I'd greatly prefer to be on the mainline release, but it's not possible unless this is resolved.

What do you need to see for this to move forward?

wolfpakz commented Mar 9, 2015

I'm experiencing this problem and the deadlocks are crippling. The gem is effectively BROKEN as released. The commit from ndbroadbent@d311b55 does resolve the problem, but I've had to fork and apply his patch in order to get on with business.

It's a real bummer having to live on a fork of this gem. I'd greatly prefer to be on the mainline release, but it's not possible unless this is resolved.

What do you need to see for this to move forward?

@csmuc

This comment has been minimized.

Show comment Hide comment
@csmuc

csmuc Mar 10, 2015

Contributor

The goal of my commit 0eecb8f was to at least enable a rather "clean" monkey-patch. We use this in production:

module Delayed
  module Backend
    module ActiveRecord
      class Job
        def self.reserve_with_scope(ready_scope, worker, now)
          reserve_with_scope_using_default_sql(ready_scope, worker, now)
        end
      end
    end
  end
end

Of course it would be much better to have an official config option or something.

Contributor

csmuc commented Mar 10, 2015

The goal of my commit 0eecb8f was to at least enable a rather "clean" monkey-patch. We use this in production:

module Delayed
  module Backend
    module ActiveRecord
      class Job
        def self.reserve_with_scope(ready_scope, worker, now)
          reserve_with_scope_using_default_sql(ready_scope, worker, now)
        end
      end
    end
  end
end

Of course it would be much better to have an official config option or something.

@wolfpakz

This comment has been minimized.

Show comment Hide comment
@wolfpakz

wolfpakz Mar 20, 2015

@csmuc Thanks! That's a nice way back to the mainline release.

@csmuc Thanks! That's a nice way back to the mainline release.

@aaronjensen

This comment has been minimized.

Show comment Hide comment
@aaronjensen

aaronjensen Mar 20, 2015

I'll repeat myself here: #63 (comment)

Requiring a monkey patch for a fundamentally flawed method (it is only flawed if you do not have deadlock retry) of reserving jobs makes no sense whatsoever. It should either be reverted to the method that works or we should add deadlock retry to the current method.

I honestly do not understand what is happening here.

I'll repeat myself here: #63 (comment)

Requiring a monkey patch for a fundamentally flawed method (it is only flawed if you do not have deadlock retry) of reserving jobs makes no sense whatsoever. It should either be reverted to the method that works or we should add deadlock retry to the current method.

I honestly do not understand what is happening here.

@gorism

This comment has been minimized.

Show comment Hide comment
@gorism

gorism Apr 15, 2015

I'm the person that originally opened this issue. Just wanted to say that I've now stopped hoping that someone will actually do something to address this in the gem. It is rather clear from the number of comments on this issue that the current approach is quite broken for a large number of users. In spite of many suggestions to fix the issue, and many pull requests with suggested fixes, it appears as broken today as it did nearly two years ago.

I ended up forking the gem, and just going with the "tried and true" original job reservation approach. I was initially concerned about the performance impacts (after all, this whole ordeal was initiated based on a desire to improve the reservation performance), but I have seen no noticeable performance degradation under load. Note that we only keep 7 days worth of failed jobs around, so perhaps at some point the performance would measurably suffer, if jobs aren't occasionally removed.

I'll check back in occasionally, because I am sort of curious just how long this will be allowed to remain an issue on the master branch. But at least I don't have to worry about cleaning up after deadlocked jobs anymore.

gorism commented Apr 15, 2015

I'm the person that originally opened this issue. Just wanted to say that I've now stopped hoping that someone will actually do something to address this in the gem. It is rather clear from the number of comments on this issue that the current approach is quite broken for a large number of users. In spite of many suggestions to fix the issue, and many pull requests with suggested fixes, it appears as broken today as it did nearly two years ago.

I ended up forking the gem, and just going with the "tried and true" original job reservation approach. I was initially concerned about the performance impacts (after all, this whole ordeal was initiated based on a desire to improve the reservation performance), but I have seen no noticeable performance degradation under load. Note that we only keep 7 days worth of failed jobs around, so perhaps at some point the performance would measurably suffer, if jobs aren't occasionally removed.

I'll check back in occasionally, because I am sort of curious just how long this will be allowed to remain an issue on the master branch. But at least I don't have to worry about cleaning up after deadlocked jobs anymore.

@yyyc514

This comment has been minimized.

Show comment Hide comment
@yyyc514

yyyc514 Apr 17, 2015

Does no one at Collective Idea care about Delayed Job AR anymore? Do we need a new maintainer? Sorry for the mass CC but I've also followed this thread for over a year and after reading @gorism and his frustration (plus all the other comments over time) I decided to escalate this a bit more.

It seems Collective Idea should take some action:

  1. Close this issue and explain why.
  2. Accept one of the various pull requests to fix this issue then close it.
  3. Find another maintainer of the project
  4. Officially say it's no longer supported and allow #3 to happen organically.

Ping Collective Idea: @danielmorrison @bryckbost @gaffneyc @laserlemon @jcarpenter88 @tbugai @ersatzryan @albus522 @jasonroelofs @brianhempel @emilford @mercilessrobot @manlycode @trestrantham @spncr2 @lauramosher

yyyc514 commented Apr 17, 2015

Does no one at Collective Idea care about Delayed Job AR anymore? Do we need a new maintainer? Sorry for the mass CC but I've also followed this thread for over a year and after reading @gorism and his frustration (plus all the other comments over time) I decided to escalate this a bit more.

It seems Collective Idea should take some action:

  1. Close this issue and explain why.
  2. Accept one of the various pull requests to fix this issue then close it.
  3. Find another maintainer of the project
  4. Officially say it's no longer supported and allow #3 to happen organically.

Ping Collective Idea: @danielmorrison @bryckbost @gaffneyc @laserlemon @jcarpenter88 @tbugai @ersatzryan @albus522 @jasonroelofs @brianhempel @emilford @mercilessrobot @manlycode @trestrantham @spncr2 @lauramosher

@albus522

This comment has been minimized.

Show comment Hide comment
@albus522

albus522 Apr 17, 2015

Member

@yyyc514 this project is working quite well except for a minority of people using an edge case scenario on mysql. This issue primarily comes up when running a lot of really fast jobs against a mysql database, at which point why are the jobs being delayed at all.

About the only solution that stands a chance at working long term, is setting up a configuration option that allows users to change the job locker to the old style. If someone wants to put together a pull request for that, we can look at getting that merged, but it is not a priority item we will be working on directly right now.

Member

albus522 commented Apr 17, 2015

@yyyc514 this project is working quite well except for a minority of people using an edge case scenario on mysql. This issue primarily comes up when running a lot of really fast jobs against a mysql database, at which point why are the jobs being delayed at all.

About the only solution that stands a chance at working long term, is setting up a configuration option that allows users to change the job locker to the old style. If someone wants to put together a pull request for that, we can look at getting that merged, but it is not a priority item we will be working on directly right now.

@aaronjensen

This comment has been minimized.

Show comment Hide comment
@aaronjensen

aaronjensen Apr 17, 2015

@albus522 Thank you for chiming in. It sounds like you agree that deadlocks are possible given the current method for locking jobs, which is true.

There are really only two good ways to handle operations that deadlock:

  1. Change them to operations that cannot deadlock by changing the order the database acquires locks.
  2. Retry when there is a deadlock.

The second option has been working very well for us for nearly a year now. Could you reconsider that as an option? I have submitted a pull request that does this, and would be happy to tweak anything you'd like in it: #91

Thanks!

@albus522 Thank you for chiming in. It sounds like you agree that deadlocks are possible given the current method for locking jobs, which is true.

There are really only two good ways to handle operations that deadlock:

  1. Change them to operations that cannot deadlock by changing the order the database acquires locks.
  2. Retry when there is a deadlock.

The second option has been working very well for us for nearly a year now. Could you reconsider that as an option? I have submitted a pull request that does this, and would be happy to tweak anything you'd like in it: #91

Thanks!

@gorism

This comment has been minimized.

Show comment Hide comment
@gorism

gorism Apr 17, 2015

@albus522 I can't refute or support your assessment of how much of an edge case minority I'm in. I certainly can describe what I consider valid reasons for having a task be run in the background independent of how long the task takes to execute.

But as a member of the edge case minority, I will offer that understanding your position on the issue would have been appreciated. In the absence of that information, quite a few people have spent what seems like a good chunk of time trying to come up with a solution and then present that solution for consideration. Those solutions apparently weren't going to stand a chance for inclusion, so those of us working on them could have just focused on what you were thinking was a reasonable approach instead.

gorism commented Apr 17, 2015

@albus522 I can't refute or support your assessment of how much of an edge case minority I'm in. I certainly can describe what I consider valid reasons for having a task be run in the background independent of how long the task takes to execute.

But as a member of the edge case minority, I will offer that understanding your position on the issue would have been appreciated. In the absence of that information, quite a few people have spent what seems like a good chunk of time trying to come up with a solution and then present that solution for consideration. Those solutions apparently weren't going to stand a chance for inclusion, so those of us working on them could have just focused on what you were thinking was a reasonable approach instead.

@csmuc

This comment has been minimized.

Show comment Hide comment
@csmuc

csmuc Apr 18, 2015

Contributor

Part of the minority as well :)

@albus522 "This issue primarily comes up when running a lot of really fast jobs against a mysql database, at which point why are the jobs being delayed at all."

One of the reasons I use ActiveRecord-backed DJ a lot: It helps me building robust distributed applications. I call remote services with DJs, not to delay them, but to replace the network calls with "INSERT INTO" statements, which are part of the local DB transaction. The DJ then takes care of doing the network call (and only this). I hope you guys don't mind that I plug my blog here, but those who are interested can read more here: http://blog.seiler.cc/post/112879365221/using-a-background-job-queue-to-implement-robust

Contributor

csmuc commented Apr 18, 2015

Part of the minority as well :)

@albus522 "This issue primarily comes up when running a lot of really fast jobs against a mysql database, at which point why are the jobs being delayed at all."

One of the reasons I use ActiveRecord-backed DJ a lot: It helps me building robust distributed applications. I call remote services with DJs, not to delay them, but to replace the network calls with "INSERT INTO" statements, which are part of the local DB transaction. The DJ then takes care of doing the network call (and only this). I hope you guys don't mind that I plug my blog here, but those who are interested can read more here: http://blog.seiler.cc/post/112879365221/using-a-background-job-queue-to-implement-robust

@defeated defeated referenced this issue in collectiveidea/delayed_job May 2, 2015

Open

2 workers picking up the same job #658

@whitslar

This comment has been minimized.

Show comment Hide comment
@whitslar

whitslar Jun 16, 2015

Minority here as well. We are using @csmuc 's monkeypatch which got rid of the deadlocks but then our select times got crazy slow (~200k rows, 2-3 seconds, MySQL 5.6.22, AWS RDS db.m1.small).

After monkeying around with indexes we found a pretty weird combination that seems to work well:

add_index :delayed_jobs, [:locked_by, :locked_at, :run_at] #has to be in this order
add_index :delayed_jobs, :run_at

Somehow this causes MySQL to do an index_merge with a sort_union and brings the select times down to a couple ms.

screen_shot_2015-06-16_at_2 01 20_pm

screen_shot_2015-06-16_at_2 05 39_pm

screenshot_06-16-2015-at-15 23 23-pm

Disclaimer: We haven't used this in production but I'll let you all know how it goes...

Minority here as well. We are using @csmuc 's monkeypatch which got rid of the deadlocks but then our select times got crazy slow (~200k rows, 2-3 seconds, MySQL 5.6.22, AWS RDS db.m1.small).

After monkeying around with indexes we found a pretty weird combination that seems to work well:

add_index :delayed_jobs, [:locked_by, :locked_at, :run_at] #has to be in this order
add_index :delayed_jobs, :run_at

Somehow this causes MySQL to do an index_merge with a sort_union and brings the select times down to a couple ms.

screen_shot_2015-06-16_at_2 01 20_pm

screen_shot_2015-06-16_at_2 05 39_pm

screenshot_06-16-2015-at-15 23 23-pm

Disclaimer: We haven't used this in production but I'll let you all know how it goes...

@mwalsher

This comment has been minimized.

Show comment Hide comment
@mwalsher

mwalsher Jul 25, 2015

Same problem here, even with a relatively small number of jobs (~500 recurring hourly - some small, some large), 3 servers, 6 workers each, and only ~2500 rows in the DJ table. Although I even see the deadlocks if only one server is running. These deadlocks seem to cause our DB CPU to hit and maintain ~100% utilization in perpetuity. This is obviously a big problem...

Is there any sort of consensus on which fork or monkey patch solves the problem? @albus522?

@whitslar how have those indexes been working for you over time with @csmuc's monkey patch?

Same problem here, even with a relatively small number of jobs (~500 recurring hourly - some small, some large), 3 servers, 6 workers each, and only ~2500 rows in the DJ table. Although I even see the deadlocks if only one server is running. These deadlocks seem to cause our DB CPU to hit and maintain ~100% utilization in perpetuity. This is obviously a big problem...

Is there any sort of consensus on which fork or monkey patch solves the problem? @albus522?

@whitslar how have those indexes been working for you over time with @csmuc's monkey patch?

@csmuc

This comment has been minimized.

Show comment Hide comment
@csmuc

csmuc Jul 25, 2015

Contributor

The other week we tried out the MySQL version a couple of hours in production (by removing my monkey patch) and almost immediately experienced deadlocks.

I also would like to get rid of the monkey patch by having a configuration option. I guess the chances are low to get that merged into this repo, so I'm also interested in moving to a fork as well.

Contributor

csmuc commented Jul 25, 2015

The other week we tried out the MySQL version a couple of hours in production (by removing my monkey patch) and almost immediately experienced deadlocks.

I also would like to get rid of the monkey patch by having a configuration option. I guess the chances are low to get that merged into this repo, so I'm also interested in moving to a fork as well.

@whitslar

This comment has been minimized.

Show comment Hide comment
@whitslar

whitslar Jul 30, 2015

They've been working great. They've been in production for about 4 weeks now and we have had no issues. No deadlocks and the select times are great. Before the indexes, our production db (db.m3.xlarge, MySQL 5.6.22, ~200k rows, 9 workers) had been running between 70-80% cpu pretty regularly and now it's steady at 5-10%.

They've been working great. They've been in production for about 4 weeks now and we have had no issues. No deadlocks and the select times are great. Before the indexes, our production db (db.m3.xlarge, MySQL 5.6.22, ~200k rows, 9 workers) had been running between 70-80% cpu pretty regularly and now it's steady at 5-10%.

@csmuc

This comment has been minimized.

Show comment Hide comment
@csmuc

csmuc Aug 4, 2015

Contributor

PR to add a configuration option #111

Contributor

csmuc commented Aug 4, 2015

PR to add a configuration option #111

@vijay-bv vijay-bv referenced this issue in collectiveidea/delayed_job Sep 11, 2015

Closed

Something is unstable about latest master #839

@smikkelsen

This comment has been minimized.

Show comment Hide comment
@smikkelsen

smikkelsen Sep 16, 2015

Ok, this is a huge gotcha. I just traced some major DB issues down to this problem here. Long story short, I just found a 47GB mysql-slow query log that completely filled up my hard drive and took down my app. This is one of the many side effects i've been noticing since I added multiple app servers. I only have 3 application servers and 1 worker per server and this is consistently happening. Is there a fix for this in master? or do I need to use a fork?

I'm really not super good with DB tuning, so i'm not sure I follow all of the info posted above, but maybe this can add to the discussion. This is what mysql is flooding my error log with:

[Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted. Statement: UPDATE `delayed_jobs` SET `delayed_jobs`.`locked_at` = '2015-09-16 04:30:59.000000', `delayed_jobs`.`locked_by` = 'delayed_job host:linkthrottle5 pid:808' WHERE ((run_at <= '2015-09-16 04:30:59.585578' AND (locked_at IS NULL OR locked_at < '2015-09-16 00:30:59.585623') OR locked_by = 'delayed_job host:linkthrottle5 pid:808') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1

I think this is a different problem in a way because it is a replication specific problem. But something that should contribute to the decision made here I believe.

Ok, this is a huge gotcha. I just traced some major DB issues down to this problem here. Long story short, I just found a 47GB mysql-slow query log that completely filled up my hard drive and took down my app. This is one of the many side effects i've been noticing since I added multiple app servers. I only have 3 application servers and 1 worker per server and this is consistently happening. Is there a fix for this in master? or do I need to use a fork?

I'm really not super good with DB tuning, so i'm not sure I follow all of the info posted above, but maybe this can add to the discussion. This is what mysql is flooding my error log with:

[Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted. Statement: UPDATE `delayed_jobs` SET `delayed_jobs`.`locked_at` = '2015-09-16 04:30:59.000000', `delayed_jobs`.`locked_by` = 'delayed_job host:linkthrottle5 pid:808' WHERE ((run_at <= '2015-09-16 04:30:59.585578' AND (locked_at IS NULL OR locked_at < '2015-09-16 00:30:59.585623') OR locked_by = 'delayed_job host:linkthrottle5 pid:808') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1

I think this is a different problem in a way because it is a replication specific problem. But something that should contribute to the decision made here I believe.

schmitzc added a commit to TeachingChannel/delayed_job_active_record that referenced this issue Sep 24, 2015

schmitzc added a commit to TeachingChannel/delayed_job_active_record that referenced this issue Sep 24, 2015

schmitzc added a commit to TeachingChannel/delayed_job_active_record that referenced this issue Sep 24, 2015

@danielristic

This comment has been minimized.

Show comment Hide comment
@danielristic

danielristic Dec 1, 2015

I'm also part of the growing minority of people who are experiencing this issue on real-life production setups. Deadlocks kill delayed jobs processes and we end up having far less workers after spikes of jobs. We're considering monkey-patching, as the different forks do not contain the latest additions to delayed_jobs, or just move to resque.

I'm also part of the growing minority of people who are experiencing this issue on real-life production setups. Deadlocks kill delayed jobs processes and we end up having far less workers after spikes of jobs. We're considering monkey-patching, as the different forks do not contain the latest additions to delayed_jobs, or just move to resque.

@jdelStrother

This comment has been minimized.

Show comment Hide comment
@jdelStrother

jdelStrother Dec 3, 2015

Another data point - here's what happened when we started monkeypatching to use the reserve_with_scope_using_default_sql method:

screen shot 2015-12-03 at 13 50 36

Another data point - here's what happened when we started monkeypatching to use the reserve_with_scope_using_default_sql method:

screen shot 2015-12-03 at 13 50 36

@csmuc

This comment has been minimized.

Show comment Hide comment
@csmuc

csmuc Dec 3, 2015

Contributor

There is an open PR to add a config option: #111

Contributor

csmuc commented Dec 3, 2015

There is an open PR to add a config option: #111

@danielristic

This comment has been minimized.

Show comment Hide comment
@danielristic

danielristic Dec 7, 2015

Wow. I've been using the reserve_with_scope_using_default_sql method on our production server running MySQL 5.5+ for the past 4 days and the processing time has been greatly reduced and no delayed_jobs worker process has been killed since I pushed the change. This PR really needs to be merged.

Wow. I've been using the reserve_with_scope_using_default_sql method on our production server running MySQL 5.5+ for the past 4 days and the processing time has been greatly reduced and no delayed_jobs worker process has been killed since I pushed the change. This PR really needs to be merged.

@klausmeyer

This comment has been minimized.

Show comment Hide comment
@klausmeyer

klausmeyer Dec 29, 2015

Contributor

Good news:
The PR (#111) was merged today and the configuration-option will be part of the next release 😎

Contributor

klausmeyer commented Dec 29, 2015

Good news:
The PR (#111) was merged today and the configuration-option will be part of the next release 😎

@aaronjensen

This comment has been minimized.

Show comment Hide comment
@aaronjensen

aaronjensen Dec 29, 2015

Awesome, now you can choose between a slower option and a fundamentally flawed option (see my above comments to see why it is fundamentally flawed). Not sure why I'm still subscribed to this thread... unsubscribing. good luck all.

Awesome, now you can choose between a slower option and a fundamentally flawed option (see my above comments to see why it is fundamentally flawed). Not sure why I'm still subscribed to this thread... unsubscribing. good luck all.

@whitslar

This comment has been minimized.

Show comment Hide comment
@whitslar

whitslar Dec 30, 2015

Ahh I dunno, the original locking mechanism + the indexes I described above completely fixes the issue for us. And if I were to bet on it, it would fix the root issue for everyone else here, given they were on MySQL >= 5.6 (there were some weird things happening with index merge/sort unions on 5.5, but I haven't tested so I can't confirm, mainly because it only crops it head up under a decent size load) Has anyone else been using the indexes I described? Either way, very happy to have PR #111 merged!

Ahh I dunno, the original locking mechanism + the indexes I described above completely fixes the issue for us. And if I were to bet on it, it would fix the root issue for everyone else here, given they were on MySQL >= 5.6 (there were some weird things happening with index merge/sort unions on 5.5, but I haven't tested so I can't confirm, mainly because it only crops it head up under a decent size load) Has anyone else been using the indexes I described? Either way, very happy to have PR #111 merged!

@csmuc

This comment has been minimized.

Show comment Hide comment
@csmuc

csmuc Feb 8, 2016

Contributor

@whitslar haven't tried in production, but the indexes made the queries actually a bit slower on my dev box (MySQL 5.6.13, 2k records). Maybe I'll spend some time and play around with other indexes

Contributor

csmuc commented Feb 8, 2016

@whitslar haven't tried in production, but the indexes made the queries actually a bit slower on my dev box (MySQL 5.6.13, 2k records). Maybe I'll spend some time and play around with other indexes

@smikkelsen

This comment has been minimized.

Show comment Hide comment
@smikkelsen

smikkelsen Mar 3, 2016

Not sure what i'm doing wrong, but I tried master branch with the default lock method (legacy) and my logs were immediately flooded with deadlock errors:

Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction

I put this in gemfile:
gem 'delayed_job_active_record', git: 'https://github.com/collectiveidea/delayed_job_active_record.git', branch: :master

and this in config/initializers/delayed_job_config.rb
Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql

Any ideas why this is still a problem / perhaps got worse when I tried the fix?

Thanks

Not sure what i'm doing wrong, but I tried master branch with the default lock method (legacy) and my logs were immediately flooded with deadlock errors:

Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction

I put this in gemfile:
gem 'delayed_job_active_record', git: 'https://github.com/collectiveidea/delayed_job_active_record.git', branch: :master

and this in config/initializers/delayed_job_config.rb
Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql

Any ideas why this is still a problem / perhaps got worse when I tried the fix?

Thanks

@albus522

This comment has been minimized.

Show comment Hide comment
@albus522

albus522 Mar 3, 2016

Member

Because that method was rarely better, which is why it isn't the default.

Member

albus522 commented Mar 3, 2016

Because that method was rarely better, which is why it isn't the default.

@smikkelsen

This comment has been minimized.

Show comment Hide comment
@smikkelsen

smikkelsen Mar 3, 2016

So bottom line delayed jobs will not work with mysql and multiple workers?

So bottom line delayed jobs will not work with mysql and multiple workers?

@csmuc

This comment has been minimized.

Show comment Hide comment
@csmuc

csmuc Mar 4, 2016

Contributor

@smikkelsen the SQL must look something like this:

Delayed::Backend::ActiveRecord::Job Load (5.2ms)  SELECT `delayed_jobs`.* FROM `delayed_jobs` WHERE `delayed_jobs`.`queue` IN ('sms') AND ((run_at <= '2016-03-03 23:01:09' AND (locked_at IS NULL OR locked_at < '2016-03-03 19:01:09') OR locked_by = 'www02 pid:2225') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC, id ASC LIMIT 5

My Rails initializer:

Delayed::Backend::ActiveRecord.configure do |config|
  config.reserve_sql_strategy = :default_sql
end

I successfully use it in production (MySQL 5.6, 8 workers) and ran into deadlocks permanently using the other SQL strategy (which also littered the mysql log files with replication warnings).

Contributor

csmuc commented Mar 4, 2016

@smikkelsen the SQL must look something like this:

Delayed::Backend::ActiveRecord::Job Load (5.2ms)  SELECT `delayed_jobs`.* FROM `delayed_jobs` WHERE `delayed_jobs`.`queue` IN ('sms') AND ((run_at <= '2016-03-03 23:01:09' AND (locked_at IS NULL OR locked_at < '2016-03-03 19:01:09') OR locked_by = 'www02 pid:2225') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC, id ASC LIMIT 5

My Rails initializer:

Delayed::Backend::ActiveRecord.configure do |config|
  config.reserve_sql_strategy = :default_sql
end

I successfully use it in production (MySQL 5.6, 8 workers) and ran into deadlocks permanently using the other SQL strategy (which also littered the mysql log files with replication warnings).

@ljfranklin ljfranklin referenced this issue in TalentBox/delayed_job_sequel Aug 2, 2017

Merged

Fix mysql deadlocks while reserving jobs #6

@brettwgreen

This comment has been minimized.

Show comment Hide comment
@brettwgreen

brettwgreen Dec 6, 2017

Running a ton of small jobs asynchronously is a perfectly reasonable thing to do, as is running multiple workers to offload the actual work. The queueing mechanism should work regardless... period. Shouldn't matter whether there are 5 big slow jobs or 10,000 fast ones.

Running a ton of small jobs asynchronously is a perfectly reasonable thing to do, as is running multiple workers to offload the actual work. The queueing mechanism should work regardless... period. Shouldn't matter whether there are 5 big slow jobs or 10,000 fast ones.

@albus522

This comment has been minimized.

Show comment Hide comment
@albus522

albus522 Dec 6, 2017

Member

@brettwgreen It is also reasonable to assume you optimize for your primary use case. No tool will ever work ideally for all situations.

Member

albus522 commented Dec 6, 2017

@brettwgreen It is also reasonable to assume you optimize for your primary use case. No tool will ever work ideally for all situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment