Jobs randomly get stuck #228

mspanc · 2017-01-05T18:57:26Z

I have found cases when jobs get stuck.

This is the sample job (data retreived via exq_ui):

PID: vault:<0.15483.6>
queue: heavy
module: Vault.Jobs.Data.Processing.AnalysisAudioDurationJob
args: 9b091696-5ce7-4e2c-a2f5-a904f784d964,true
started at: Wed Jan 04 2017 23:00:52 GMT+0100 (CET)

It has ended it with an exception and never got cleared from the queue, blocking slot for other jobs.

4 Jan 2017 14:13:40.956 234 <190>1 2017-01-04T13:13:40.543771+00:00 app web.1 - - 13:13:40.543 [info] [Elixir.Vault.Jobs.Data.Processing.AnalysisAudioDurationJob #PID<0.19642.2> 9b091696-5ce7-4e2c-a2f5-a904f784d964] Saving
» 4 Jan 2017 14:13:40.957 158 <190>1 2017-01-04T13:13:40.544708+00:00 app web.1 - - 13:13:40.544 [error] Process #PID<0.19642.2> raised an exception
» 4 Jan 2017 14:13:40.957 144 <190>1 2017-01-04T13:13:40.544710+00:00 app web.1 - - ** (DBConnection.ConnectionError) ssl send: closed
» 4 Jan 2017 14:13:40.957 194 <190>1 2017-01-04T13:13:40.544711+00:00 app web.1 - - (ecto) lib/ecto/adapters/postgres/connection.ex:115: Ecto.Adapters.Postgres.Connection.execute/4
» 4 Jan 2017 14:13:40.957 163 <190>1 2017-01-04T13:13:40.544712+00:00 app web.1 - - (ecto) lib/ecto/adapters/sql.ex:243: Ecto.Adapters.SQL.sql_call/6
» 4 Jan 2017 14:13:40.957 171 <190>1 2017-01-04T13:13:40.544712+00:00 app web.1 - - (ecto) lib/ecto/adapters/sql.ex:441: Ecto.Adapters.SQL.execute_or_reset/7
» 4 Jan 2017 14:13:40.957 166 <190>1 2017-01-04T13:13:40.544713+00:00 app web.1 - - (ecto) lib/ecto/repo/queryable.ex:130: Ecto.Repo.Queryable.execute/5
» 4 Jan 2017 14:13:40.957 161 <190>1 2017-01-04T13:13:40.544714+00:00 app web.1 - - (ecto) lib/ecto/repo/queryable.ex:35: Ecto.Repo.Queryable.all/4
» 4 Jan 2017 14:13:40.957 161 <190>1 2017-01-04T13:13:40.544715+00:00 app web.1 - - (ecto) lib/ecto/repo/queryable.ex:68: Ecto.Repo.Queryable.one/4
» 4 Jan 2017 14:13:40.958 213 <190>1 2017-01-04T13:13:40.544716+00:00 app web.1 - - (vault) web/jobs/data/processing/result_parser.ex:68: Vault.Jobs.Data.Processing.ResultParser.parse_metadata_list/3
» 4 Jan 2017 14:13:40.958 199 <190>1 2017-01-04T13:13:40.544717+00:00 app web.1 - - (vault) web/jobs/data/processing/result_parser.ex:50: Vault.Jobs.Data.Processing.ResultParser.parse/3
» 4 Jan 2017 14:13:40.958 188 <190>1 2017-01-04T13:13:40.544767+00:00 app web.1 - - 13:13:40.544 [info] Queueing job 1eb24985-d7fc-4fb6-a327-e3606179649c to retry in 44.0 seconds
» 4 Jan 2017 14:13:40.958 174 <190>1 2017-01-04T13:13:40.547998+00:00 app web.1 - - 13:13:40.547 [info] {%DBConnection.ConnectionError{message: "ssl send: closed"},
» 4 Jan 2017 14:13:40.958 144 <190>1 2017-01-04T13:13:40.548000+00:00 app web.1 - - [{Ecto.Adapters.Postgres.Connection, :execute, 4,
» 4 Jan 2017 14:13:40.958 160 <190>1 2017-01-04T13:13:40.548001+00:00 app web.1 - - [file: 'lib/ecto/adapters/postgres/connection.ex', line: 115]},
» 4 Jan 2017 14:13:40.958 129 <190>1 2017-01-04T13:13:40.548002+00:00 app web.1 - - {Ecto.Adapters.SQL, :sql_call, 6,
» 4 Jan 2017 14:13:40.958 144 <190>1 2017-01-04T13:13:40.548003+00:00 app web.1 - - [file: 'lib/ecto/adapters/sql.ex', line: 243]},
» 4 Jan 2017 14:13:40.958 137 <190>1 2017-01-04T13:13:40.548003+00:00 app web.1 - - {Ecto.Adapters.SQL, :execute_or_reset, 7,
» 4 Jan 2017 14:13:40.959 144 <190>1 2017-01-04T13:13:40.548004+00:00 app web.1 - - [file: 'lib/ecto/adapters/sql.ex', line: 441]},
» 4 Jan 2017 14:13:40.959 130 <190>1 2017-01-04T13:13:40.548004+00:00 app web.1 - - {Ecto.Repo.Queryable, :execute, 5,
» 4 Jan 2017 14:13:40.959 146 <190>1 2017-01-04T13:13:40.548005+00:00 app web.1 - - [file: 'lib/ecto/repo/queryable.ex', line: 130]},
» 4 Jan 2017 14:13:40.959 126 <190>1 2017-01-04T13:13:40.548005+00:00 app web.1 - - {Ecto.Repo.Queryable, :all, 4,
» 4 Jan 2017 14:13:40.959 145 <190>1 2017-01-04T13:13:40.548006+00:00 app web.1 - - [file: 'lib/ecto/repo/queryable.ex', line: 35]},
» 4 Jan 2017 14:13:40.959 126 <190>1 2017-01-04T13:13:40.548007+00:00 app web.1 - - {Ecto.Repo.Queryable, :one, 4,
» 4 Jan 2017 14:13:40.959 145 <190>1 2017-01-04T13:13:40.548007+00:00 app web.1 - - [file: 'lib/ecto/repo/queryable.ex', line: 68]},
» 4 Jan 2017 14:13:40.959 162 <190>1 2017-01-04T13:13:40.548008+00:00 app web.1 - - {Vault.Jobs.Data.Processing.ResultParser, :parse_metadata_list, 3,
» 4 Jan 2017 14:13:40.959 160 <190>1 2017-01-04T13:13:40.548009+00:00 app web.1 - - [file: 'web/jobs/data/processing/result_parser.ex', line: 68]},
» 4 Jan 2017 14:13:40.959 148 <190>1 2017-01-04T13:13:40.548009+00:00 app web.1 - - {Vault.Jobs.Data.Processing.ResultParser, :parse, 3,
» 4 Jan 2017 14:13:40.959 161 <190>1 2017-01-04T13:13:40.548010+00:00 app web.1 - - [file: 'web/jobs/data/processing/result_parser.ex', line: 50]}]}
» 4 Jan 2017 14:13:40.959 229 <190>1 2017-01-04T13:13:40.548041+00:00 app web.1 - - 13:13:40.547 [info] Elixir.Vault.Jobs.Data.Processing.AnalysisAudioDurationJob[1eb24985-d7fc-4fb6-a327-e3606179649c] fail: 138109ms sec

It seems that PID in the log and exq_ui do not match.

The text was updated successfully, but these errors were encountered:

akira · 2017-01-06T05:13:25Z

never got cleared from the queue, blocking slot for other jobs

How are you seeing this? Is this from the UI only, or actually tasks executing. What if concurrency is 1 and this happens? Or is just manifested by #227

mspanc · 2017-01-06T10:25:19Z

I see that no jobs are happening, also don't see anything in the logs.

I am unable to test this with concurrency 1.

akira · 2017-01-21T22:51:13Z

@mspanc I can't seem to replicate this. Could it be possibly caused by #231? I am seeing a memory issue under heavy load. I'm still investigating, but have a fix for the stats part of it. I wonder if you were seeing increased memory as well.

mspanc · 2017-01-23T12:51:01Z

I don't have environment any more which I can use to test this out.

jbrowning · 2017-03-08T15:34:35Z

We are experiencing this in production with Exq 0.8.3. Occasionally, jobs will get stuck in a perpetual processing state and hoard a process slot. Restarting does not fix and I have to manually clear out the process list via Exq.Api.clear_processes/1.

Let me know if there's anything you'd like me do to try to narrow down the issue.

akira · 2017-03-09T06:37:33Z

@jbrowning can you try Exq version 0.8.4, I recently added something that could potentially fix this.

jbrowning · 2017-03-09T15:12:10Z

Ok, sounds good. Will let you know.

Thanks for the quick response @akira

jbrowning · 2017-03-27T16:53:57Z

@akira we've had 0.8.4 in production for a few weeks now and still end up with stuck jobs. It seems to occur most often when a large number of jobs (2000+) are added to the queue at once.

akira · 2017-03-29T17:14:34Z

@jbrowning thanks for update. Will take a look.

akira · 2017-04-03T06:35:25Z

@jbrowning I haven't had a chance to narrow down yet.

Can you give me following info if possible:

mix.lock file or versions of dependent libraries that Exq uses (like Redix).
Roughly, how many messages do you process (per hour/day, etc)
Is there a pattern in how long the system is up before this happens?
are any particular type of exceptions that happens that you may see this in the log to cause this (if you workers raise anything, or if they have clean runs).
If you have any custom middleware, etc.
If you run ExqUI maybe you can paste stats of dashboard and get info from there.

Thanks!

j-mcnally · 2017-04-03T17:57:09Z

Based on past experience ive had with sidekiq exhibiting the behaviour, it was often the case of Blocking IO that never finished and didnt have timeout code. The classic example being network / http IO. Where a HTTP request would just wait forever and not realize it was disconnect because Ruby. Could it be similar? Curious if you are doing webscraping / http or any blocking io?

akira · 2017-04-06T14:51:37Z

@jbrowning I was able to reproduce this in one case, I will try to get a fix asap.

akira · 2017-04-10T06:19:50Z

@jbrowning If you can upgrade to Exq 0.8.6 - there was a situation where Process.exit would have caused #251 - so this is one possibility.

However, as @j-mcnally the other possibility is that the worker is stuck waiting on something application specific (http request, etc). In this case, the fix would not help. You should be careful about setting timeouts for everything. You can also set a timeout in your worker by calling :timer.kill_after(:timer.seconds(600)) before your processing which would exit process after 10 minutes in this case.

defmodule MyWorker do 
  def perform do 
    :timer.kill_after(:timer.seconds(600))
    # continue processing
  end 
end

jbrowning · 2017-04-24T13:55:26Z

@akira thanks again. Will deploy 0.8.6 today and see how it goes.

lpil · 2017-04-26T13:47:49Z

How did it go @jbrowning ? :)

akira · 2017-04-26T19:45:41Z

@jbrowning FYI I'm going to have some more updates that can potentially help, hoping to get them this weekend.

jbrowning · 2017-06-05T11:11:15Z

@akira does 0.8.7 contain those updates? We had this occur again this weekend.

mspanc · 2017-06-05T11:30:04Z

@jbrowning I know that advertising another project here is not the most gentle thing for the author of exq, but I see you're struggling for quite a while. When I had the same issues with exq I ended up with writing https://github.com/mspanc/jumbo and it works for us for really high loads for a few months already.

stevedomin · 2017-06-06T08:34:47Z

@mspanc Great job on the library. Not having the jobs persisted is a bit of a killer in this case, no? Do you use hot code reload for deploys?

lpil · 2017-06-06T09:31:42Z

Possibly a good conversation for the Jumbo repo? :)

stevedomin · 2017-06-06T11:55:06Z

@lpil definitely, my bad.

mspanc · 2017-06-06T15:27:02Z

@stevedomin mspanc/jumbo#1

akira · 2017-06-11T01:20:32Z

@jbrowning Yes, they had the fix to Exq running under high load (fixing issue with memory increasing). With those changes I was able to run Exq for sustained periods with maximum amount of messages going through.

I was originally suspecting you were having that issue, however, at this point I am thinking that this is not due to load, probably type of message that is causing this issue. It would be great to have a reproducible case for this. By any change, do you see any stack trace for middleware in your exception logs?

jbrowning · 2017-06-29T13:24:06Z

Again, thanks for all your help on this @akira. After deploying 0.9.0 we haven't seen any more "stuck jobs" but I have been seeing enqueue calls timing out.

{:timeout, {GenServer, :call, [Exq, {:enqueue, "default", Worker, [params], []}, 5000]}}

Were there any changes in 0.9.0 that would have caused the Exq service to block on these calls?

Happy to open this as a new issue if you'd rather do that.

lpil · 2017-06-29T19:40:57Z

Perhaps we should close this issue and open a new one for that timeout? :)

hubertlepicki · 2019-06-19T07:59:00Z

For people who stumble upon this error on production systems:

I believe the error:

  {:timeout, {GenServer, :call, [Exq, {:enqueue, "default", Worker, [params], []}, 5000]}}

may be happening on production because the params is something that can't be serialized to JSON. For some reason I am getting the same error when this happens on production, while in dev the error is different:

Exq.enqueue(Exq, "mailgun_mailer", Worker, [[{"hej", 1}]])
09:52:34.249 [error] GenServer Exq.Enqueuer terminating
** (Protocol.UndefinedError) protocol Jason.Encoder not implemented for {"hej", 1}, Jason.Encoder protocol must always be explicitly implemented. This protocol is implemented for: Ecto.Association.NotLoaded, Ecto.Schema.Metadata, Integer, Date, List, Decimal, NaiveDateTime, Time, Jason.Fragment, Float, BitString, Any, DateTime, Map, Atom            
    (jason) lib/jason.ex:150: Jason.encode!/2
    (exq) lib/exq/redis/job_queue.ex:447: Exq.Redis.JobQueue.to_job_serialized/5
    (exq) lib/exq/redis/job_queue.ex:22: Exq.Redis.JobQueue.enqueue/6
    (exq) lib/exq/enqueuer/server.ex:39: Exq.Enqueuer.Server.handle_cast/2
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:enqueue, {#PID<0.496.0>, #Reference<0.91093133.1862008833.102800>}, "mailgun_mailer", Worker, [[{"hej", 1}]], []}}                              
State: %Exq.Enqueuer.Server.State{namespace: "exq", redis: Exq.Redis.Client}
** (exit) exited in: GenServer.call(Exq, {:enqueue, "mailgun_mailer", Worker, [[{"hej", 1}]], []}, 5000)                                                                       
    ** (EXIT) time out
    (elixir) lib/gen_server.ex:989: GenServer.call/3

while on production:

iex> Exq.enqueue(Exq, "mailgun_mailer", Worker, [[{"hej", 1}]])                                                                                   
** (exit) exited in: GenServer.call(Exq, {:enqueue, "mailgun_mailer", Worker, [[{"hej", 1}]], []}, 5000)                                                                        
    ** (EXIT) time out
    (elixir) lib/gen_server.ex:989: GenServer.call/3
iex> 09:52:34.249 [error] GenServer Exq.Enqueuer terminating

and then I have only timeout reported to my exception tracking (appsignal) and the original error is there in logs but at least AppSignal doesn't associate it with the exception.

akira closed this as completed Jun 30, 2017

akira mentioned this issue Jun 30, 2017

Intermittent timeout on enqueue call #277

Open

taylordowns2000 mentioned this issue Mar 30, 2018

Unable to kill a single process while leaving others running #307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs randomly get stuck #228

Jobs randomly get stuck #228

mspanc commented Jan 5, 2017

akira commented Jan 6, 2017 •

edited

mspanc commented Jan 6, 2017

akira commented Jan 21, 2017

mspanc commented Jan 23, 2017

jbrowning commented Mar 8, 2017

akira commented Mar 9, 2017

jbrowning commented Mar 9, 2017

jbrowning commented Mar 27, 2017

akira commented Mar 29, 2017

akira commented Apr 3, 2017 •

edited

j-mcnally commented Apr 3, 2017

akira commented Apr 6, 2017

akira commented Apr 10, 2017 •

edited

jbrowning commented Apr 24, 2017

lpil commented Apr 26, 2017

akira commented Apr 26, 2017

jbrowning commented Jun 5, 2017

mspanc commented Jun 5, 2017

stevedomin commented Jun 6, 2017

lpil commented Jun 6, 2017

stevedomin commented Jun 6, 2017

mspanc commented Jun 6, 2017 •

edited

akira commented Jun 11, 2017

jbrowning commented Jun 29, 2017

lpil commented Jun 29, 2017

hubertlepicki commented Jun 19, 2019

Jobs randomly get stuck #228

Jobs randomly get stuck #228

Comments

mspanc commented Jan 5, 2017

akira commented Jan 6, 2017 • edited

mspanc commented Jan 6, 2017

akira commented Jan 21, 2017

mspanc commented Jan 23, 2017

jbrowning commented Mar 8, 2017

akira commented Mar 9, 2017

jbrowning commented Mar 9, 2017

jbrowning commented Mar 27, 2017

akira commented Mar 29, 2017

akira commented Apr 3, 2017 • edited

j-mcnally commented Apr 3, 2017

akira commented Apr 6, 2017

akira commented Apr 10, 2017 • edited

jbrowning commented Apr 24, 2017

lpil commented Apr 26, 2017

akira commented Apr 26, 2017

jbrowning commented Jun 5, 2017

mspanc commented Jun 5, 2017

stevedomin commented Jun 6, 2017

lpil commented Jun 6, 2017

stevedomin commented Jun 6, 2017

mspanc commented Jun 6, 2017 • edited

akira commented Jun 11, 2017

jbrowning commented Jun 29, 2017

lpil commented Jun 29, 2017

hubertlepicki commented Jun 19, 2019

akira commented Jan 6, 2017 •

edited

akira commented Apr 3, 2017 •

edited

akira commented Apr 10, 2017 •

edited

mspanc commented Jun 6, 2017 •

edited