Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs randomly get stuck #228

Closed
mspanc opened this issue Jan 5, 2017 · 26 comments
Closed

Jobs randomly get stuck #228

mspanc opened this issue Jan 5, 2017 · 26 comments

Comments

@mspanc
Copy link

mspanc commented Jan 5, 2017

I have found cases when jobs get stuck.

This is the sample job (data retreived via exq_ui):

PID: vault:<0.15483.6>
queue: heavy
module: Vault.Jobs.Data.Processing.AnalysisAudioDurationJob
args: 9b091696-5ce7-4e2c-a2f5-a904f784d964,true
started at: Wed Jan 04 2017 23:00:52 GMT+0100 (CET)

It has ended it with an exception and never got cleared from the queue, blocking slot for other jobs.

4 Jan 2017 14:13:40.956 234 <190>1 2017-01-04T13:13:40.543771+00:00 app web.1 - - 13:13:40.543 [info] [Elixir.Vault.Jobs.Data.Processing.AnalysisAudioDurationJob #PID<0.19642.2> 9b091696-5ce7-4e2c-a2f5-a904f784d964] Saving
» 4 Jan 2017 14:13:40.957 158 <190>1 2017-01-04T13:13:40.544708+00:00 app web.1 - - 13:13:40.544 [error] Process #PID<0.19642.2> raised an exception
» 4 Jan 2017 14:13:40.957 144 <190>1 2017-01-04T13:13:40.544710+00:00 app web.1 - - ** (DBConnection.ConnectionError) ssl send: closed
» 4 Jan 2017 14:13:40.957 194 <190>1 2017-01-04T13:13:40.544711+00:00 app web.1 - - (ecto) lib/ecto/adapters/postgres/connection.ex:115: Ecto.Adapters.Postgres.Connection.execute/4
» 4 Jan 2017 14:13:40.957 163 <190>1 2017-01-04T13:13:40.544712+00:00 app web.1 - - (ecto) lib/ecto/adapters/sql.ex:243: Ecto.Adapters.SQL.sql_call/6
» 4 Jan 2017 14:13:40.957 171 <190>1 2017-01-04T13:13:40.544712+00:00 app web.1 - - (ecto) lib/ecto/adapters/sql.ex:441: Ecto.Adapters.SQL.execute_or_reset/7
» 4 Jan 2017 14:13:40.957 166 <190>1 2017-01-04T13:13:40.544713+00:00 app web.1 - - (ecto) lib/ecto/repo/queryable.ex:130: Ecto.Repo.Queryable.execute/5
» 4 Jan 2017 14:13:40.957 161 <190>1 2017-01-04T13:13:40.544714+00:00 app web.1 - - (ecto) lib/ecto/repo/queryable.ex:35: Ecto.Repo.Queryable.all/4
» 4 Jan 2017 14:13:40.957 161 <190>1 2017-01-04T13:13:40.544715+00:00 app web.1 - - (ecto) lib/ecto/repo/queryable.ex:68: Ecto.Repo.Queryable.one/4
» 4 Jan 2017 14:13:40.958 213 <190>1 2017-01-04T13:13:40.544716+00:00 app web.1 - - (vault) web/jobs/data/processing/result_parser.ex:68: Vault.Jobs.Data.Processing.ResultParser.parse_metadata_list/3
» 4 Jan 2017 14:13:40.958 199 <190>1 2017-01-04T13:13:40.544717+00:00 app web.1 - - (vault) web/jobs/data/processing/result_parser.ex:50: Vault.Jobs.Data.Processing.ResultParser.parse/3
» 4 Jan 2017 14:13:40.958 188 <190>1 2017-01-04T13:13:40.544767+00:00 app web.1 - - 13:13:40.544 [info] Queueing job 1eb24985-d7fc-4fb6-a327-e3606179649c to retry in 44.0 seconds
» 4 Jan 2017 14:13:40.958 174 <190>1 2017-01-04T13:13:40.547998+00:00 app web.1 - - 13:13:40.547 [info] {%DBConnection.ConnectionError{message: "ssl send: closed"},
» 4 Jan 2017 14:13:40.958 144 <190>1 2017-01-04T13:13:40.548000+00:00 app web.1 - - [{Ecto.Adapters.Postgres.Connection, :execute, 4,
» 4 Jan 2017 14:13:40.958 160 <190>1 2017-01-04T13:13:40.548001+00:00 app web.1 - - [file: 'lib/ecto/adapters/postgres/connection.ex', line: 115]},
» 4 Jan 2017 14:13:40.958 129 <190>1 2017-01-04T13:13:40.548002+00:00 app web.1 - - {Ecto.Adapters.SQL, :sql_call, 6,
» 4 Jan 2017 14:13:40.958 144 <190>1 2017-01-04T13:13:40.548003+00:00 app web.1 - - [file: 'lib/ecto/adapters/sql.ex', line: 243]},
» 4 Jan 2017 14:13:40.958 137 <190>1 2017-01-04T13:13:40.548003+00:00 app web.1 - - {Ecto.Adapters.SQL, :execute_or_reset, 7,
» 4 Jan 2017 14:13:40.959 144 <190>1 2017-01-04T13:13:40.548004+00:00 app web.1 - - [file: 'lib/ecto/adapters/sql.ex', line: 441]},
» 4 Jan 2017 14:13:40.959 130 <190>1 2017-01-04T13:13:40.548004+00:00 app web.1 - - {Ecto.Repo.Queryable, :execute, 5,
» 4 Jan 2017 14:13:40.959 146 <190>1 2017-01-04T13:13:40.548005+00:00 app web.1 - - [file: 'lib/ecto/repo/queryable.ex', line: 130]},
» 4 Jan 2017 14:13:40.959 126 <190>1 2017-01-04T13:13:40.548005+00:00 app web.1 - - {Ecto.Repo.Queryable, :all, 4,
» 4 Jan 2017 14:13:40.959 145 <190>1 2017-01-04T13:13:40.548006+00:00 app web.1 - - [file: 'lib/ecto/repo/queryable.ex', line: 35]},
» 4 Jan 2017 14:13:40.959 126 <190>1 2017-01-04T13:13:40.548007+00:00 app web.1 - - {Ecto.Repo.Queryable, :one, 4,
» 4 Jan 2017 14:13:40.959 145 <190>1 2017-01-04T13:13:40.548007+00:00 app web.1 - - [file: 'lib/ecto/repo/queryable.ex', line: 68]},
» 4 Jan 2017 14:13:40.959 162 <190>1 2017-01-04T13:13:40.548008+00:00 app web.1 - - {Vault.Jobs.Data.Processing.ResultParser, :parse_metadata_list, 3,
» 4 Jan 2017 14:13:40.959 160 <190>1 2017-01-04T13:13:40.548009+00:00 app web.1 - - [file: 'web/jobs/data/processing/result_parser.ex', line: 68]},
» 4 Jan 2017 14:13:40.959 148 <190>1 2017-01-04T13:13:40.548009+00:00 app web.1 - - {Vault.Jobs.Data.Processing.ResultParser, :parse, 3,
» 4 Jan 2017 14:13:40.959 161 <190>1 2017-01-04T13:13:40.548010+00:00 app web.1 - - [file: 'web/jobs/data/processing/result_parser.ex', line: 50]}]}
» 4 Jan 2017 14:13:40.959 229 <190>1 2017-01-04T13:13:40.548041+00:00 app web.1 - - 13:13:40.547 [info] Elixir.Vault.Jobs.Data.Processing.AnalysisAudioDurationJob[1eb24985-d7fc-4fb6-a327-e3606179649c] fail: 138109ms sec

It seems that PID in the log and exq_ui do not match.

@akira
Copy link
Owner

akira commented Jan 6, 2017

never got cleared from the queue, blocking slot for other jobs

How are you seeing this? Is this from the UI only, or actually tasks executing. What if concurrency is 1 and this happens? Or is just manifested by #227

@mspanc
Copy link
Author

mspanc commented Jan 6, 2017

I see that no jobs are happening, also don't see anything in the logs.

I am unable to test this with concurrency 1.

@akira
Copy link
Owner

akira commented Jan 21, 2017

@mspanc I can't seem to replicate this. Could it be possibly caused by #231? I am seeing a memory issue under heavy load. I'm still investigating, but have a fix for the stats part of it. I wonder if you were seeing increased memory as well.

@mspanc
Copy link
Author

mspanc commented Jan 23, 2017

I don't have environment any more which I can use to test this out.

@jbrowning
Copy link

We are experiencing this in production with Exq 0.8.3. Occasionally, jobs will get stuck in a perpetual processing state and hoard a process slot. Restarting does not fix and I have to manually clear out the process list via Exq.Api.clear_processes/1.

Let me know if there's anything you'd like me do to try to narrow down the issue.

@akira
Copy link
Owner

akira commented Mar 9, 2017

@jbrowning can you try Exq version 0.8.4, I recently added something that could potentially fix this.

@jbrowning
Copy link

Ok, sounds good. Will let you know.

Thanks for the quick response @akira

@jbrowning
Copy link

@akira we've had 0.8.4 in production for a few weeks now and still end up with stuck jobs. It seems to occur most often when a large number of jobs (2000+) are added to the queue at once.

@akira
Copy link
Owner

akira commented Mar 29, 2017

@jbrowning thanks for update. Will take a look.

@akira
Copy link
Owner

akira commented Apr 3, 2017

@jbrowning I haven't had a chance to narrow down yet.

Can you give me following info if possible:

  • mix.lock file or versions of dependent libraries that Exq uses (like Redix).
  • Roughly, how many messages do you process (per hour/day, etc)
  • Is there a pattern in how long the system is up before this happens?
  • are any particular type of exceptions that happens that you may see this in the log to cause this (if you workers raise anything, or if they have clean runs).
  • If you have any custom middleware, etc.
  • If you run ExqUI maybe you can paste stats of dashboard and get info from there.

Thanks!

@j-mcnally
Copy link
Collaborator

Based on past experience ive had with sidekiq exhibiting the behaviour, it was often the case of Blocking IO that never finished and didnt have timeout code. The classic example being network / http IO. Where a HTTP request would just wait forever and not realize it was disconnect because Ruby. Could it be similar? Curious if you are doing webscraping / http or any blocking io?

@akira
Copy link
Owner

akira commented Apr 6, 2017

@jbrowning I was able to reproduce this in one case, I will try to get a fix asap.

@akira
Copy link
Owner

akira commented Apr 10, 2017

@jbrowning If you can upgrade to Exq 0.8.6 - there was a situation where Process.exit would have caused #251 - so this is one possibility.

However, as @j-mcnally the other possibility is that the worker is stuck waiting on something application specific (http request, etc). In this case, the fix would not help. You should be careful about setting timeouts for everything. You can also set a timeout in your worker by calling :timer.kill_after(:timer.seconds(600)) before your processing which would exit process after 10 minutes in this case.

defmodule MyWorker do 
  def perform do 
    :timer.kill_after(:timer.seconds(600))
    # continue processing
  end 
end

@jbrowning
Copy link

@akira thanks again. Will deploy 0.8.6 today and see how it goes.

@lpil
Copy link
Contributor

lpil commented Apr 26, 2017

How did it go @jbrowning ? :)

@akira
Copy link
Owner

akira commented Apr 26, 2017

@jbrowning FYI I'm going to have some more updates that can potentially help, hoping to get them this weekend.

@jbrowning
Copy link

@akira does 0.8.7 contain those updates? We had this occur again this weekend.

@mspanc
Copy link
Author

mspanc commented Jun 5, 2017

@jbrowning I know that advertising another project here is not the most gentle thing for the author of exq, but I see you're struggling for quite a while. When I had the same issues with exq I ended up with writing https://github.com/mspanc/jumbo and it works for us for really high loads for a few months already.

@stevedomin
Copy link

@mspanc Great job on the library. Not having the jobs persisted is a bit of a killer in this case, no? Do you use hot code reload for deploys?

@lpil
Copy link
Contributor

lpil commented Jun 6, 2017

Possibly a good conversation for the Jumbo repo? :)

@stevedomin
Copy link

@lpil definitely, my bad.

@mspanc
Copy link
Author

mspanc commented Jun 6, 2017

@stevedomin mspanc/jumbo#1

@akira
Copy link
Owner

akira commented Jun 11, 2017

@jbrowning Yes, they had the fix to Exq running under high load (fixing issue with memory increasing). With those changes I was able to run Exq for sustained periods with maximum amount of messages going through.

I was originally suspecting you were having that issue, however, at this point I am thinking that this is not due to load, probably type of message that is causing this issue. It would be great to have a reproducible case for this. By any change, do you see any stack trace for middleware in your exception logs?

@jbrowning
Copy link

Again, thanks for all your help on this @akira. After deploying 0.9.0 we haven't seen any more "stuck jobs" but I have been seeing enqueue calls timing out.

{:timeout, {GenServer, :call, [Exq, {:enqueue, "default", Worker, [params], []}, 5000]}}

Were there any changes in 0.9.0 that would have caused the Exq service to block on these calls?

Happy to open this as a new issue if you'd rather do that.

@lpil
Copy link
Contributor

lpil commented Jun 29, 2017

Perhaps we should close this issue and open a new one for that timeout? :)

@hubertlepicki
Copy link

For people who stumble upon this error on production systems:

I believe the error:

  {:timeout, {GenServer, :call, [Exq, {:enqueue, "default", Worker, [params], []}, 5000]}}

may be happening on production because the params is something that can't be serialized to JSON. For some reason I am getting the same error when this happens on production, while in dev the error is different:

Exq.enqueue(Exq, "mailgun_mailer", Worker, [[{"hej", 1}]])
09:52:34.249 [error] GenServer Exq.Enqueuer terminating
** (Protocol.UndefinedError) protocol Jason.Encoder not implemented for {"hej", 1}, Jason.Encoder protocol must always be explicitly implemented. This protocol is implemented for: Ecto.Association.NotLoaded, Ecto.Schema.Metadata, Integer, Date, List, Decimal, NaiveDateTime, Time, Jason.Fragment, Float, BitString, Any, DateTime, Map, Atom            
    (jason) lib/jason.ex:150: Jason.encode!/2
    (exq) lib/exq/redis/job_queue.ex:447: Exq.Redis.JobQueue.to_job_serialized/5
    (exq) lib/exq/redis/job_queue.ex:22: Exq.Redis.JobQueue.enqueue/6
    (exq) lib/exq/enqueuer/server.ex:39: Exq.Enqueuer.Server.handle_cast/2
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:enqueue, {#PID<0.496.0>, #Reference<0.91093133.1862008833.102800>}, "mailgun_mailer", Worker, [[{"hej", 1}]], []}}                              
State: %Exq.Enqueuer.Server.State{namespace: "exq", redis: Exq.Redis.Client}
** (exit) exited in: GenServer.call(Exq, {:enqueue, "mailgun_mailer", Worker, [[{"hej", 1}]], []}, 5000)                                                                       
    ** (EXIT) time out
    (elixir) lib/gen_server.ex:989: GenServer.call/3

while on production:

iex> Exq.enqueue(Exq, "mailgun_mailer", Worker, [[{"hej", 1}]])                                                                                   
** (exit) exited in: GenServer.call(Exq, {:enqueue, "mailgun_mailer", Worker, [[{"hej", 1}]], []}, 5000)                                                                        
    ** (EXIT) time out
    (elixir) lib/gen_server.ex:989: GenServer.call/3
iex> 09:52:34.249 [error] GenServer Exq.Enqueuer terminating 

and then I have only timeout reported to my exception tracking (appsignal) and the original error is there in logs but at least AppSignal doesn't associate it with the exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants