Deadlocks and excessive CPU consumption with async responses. #447

obmarg · 2017-10-20T12:37:37Z

I've (again) been load testing a proxy application that makes use of async responses, against the latest master branch of hackney.

I'm basically spinning up 1000 clients that talk to our service, each of the requests those clients make will kick off a request using hackney. After a reasonably small number of requests (8000-9000 pre 1ab27d4, seems to be more after that) things start going wrong:

One of my processes that was receiving a response crashes.
Hackney manager crashes with

13:10:34.417 [error] GenServer :hackney_manager terminating
** (MatchError) no match of right hand side value: {:error, :normal}
    (hackney) /Users/graeme/src/thirdparty/hackney/src/hackney_manager.erl:525: :hackney_manager.clean_requests/5
    (hackney) /Users/graeme/src/thirdparty/hackney/src/hackney_manager.erl:499: :hackney_manager.handle_owner_exit/4
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

Hackney seems to be deadlocked somehow, as no further requests ever complete - though nothing ever crashes either.
beam.smp processor usage shoots up to 500%. This seems to be caused by hackney_sup - which permanently has a message queue of around 1000.

I've done some bisecting around this issue, and it definitely seems to have been introduced as part of #443

The text was updated successfully, but these errors were encountered:

benoitc · 2017-10-20T12:39:56Z

@obmarg is this with 0.10.1 ? Pb should be fixed in this version . Let me know.

obmarg · 2017-10-20T12:50:08Z

Yes, this is with 0.10.1. The fix you made in 0.10.1 does seem to have made the error condition less likely, but it still happens sooner or later.

benoitc · 2017-10-20T12:51:36Z

@obmarg can you share a snippet of your code? I would like to understand what it does.

benoitc · 2017-10-20T13:06:44Z

not sure how you get that error on line 525 though... I don't see how it can happen right now.

benoitc · 2017-10-20T13:13:48Z

@obmarg also what is in the supervisor queue? Did you look at it? (erlang:process_info or observer should do the trick)

obmarg · 2017-10-20T14:07:37Z

The code itself is reasonably simple, so sure. I've got this in one function:

HTTPoison.request(
  request.method,
  request.url,
  request.body,
  request.headers,
  hackney: [follow_redirect: false],
  timeout: deadline,
  recv_timeout: deadline,
  stream_to: self()
)

receive_response()

Then there's the receive response function:

  defp receive_response(conn, status_code \\ nil) do
    receive do
      %HTTPoison.Error{reason: :timeout} ->
        raise Securi.Exceptions.DeadlineExceeded
      %HTTPoison.Error{reason: {:closed, :timeout}} ->
        raise Securi.Exceptions.DeadlineExceeded
      %HTTPoison.Error{reason: :connect_timeout} ->
        raise Securi.Exceptions.DeadlineExceeded
      %HTTPoison.Error{reason: :closed} ->
        raise Securi.Exceptions.ConnectionClosedError
      %HTTPoison.Error{reason: reason} ->
        Logger.warn "HTTP Error: #{inspect reason}"
        raise Securi.Exceptions.UnknownHttpError
      %HTTPoison.AsyncStatus{code: status_code} ->
        receive_response(conn, status_code)
      %HTTPoison.AsyncHeaders{headers: headers} ->
        # Trimming this bit, but I just strip out some headers & send them with plug.
        conn = Plug.Conn.send_chunked(conn, status_code)
        receive_response(conn, status_code)
      %HTTPoison.AsyncChunk{chunk: chunk} ->
        {:ok, conn} = Plug.Conn.chunk(conn, chunk)
        receive_response(conn, status_code)
      %HTTPoison.AsyncEnd{} ->
        {:ok, conn}
    end
  end

It's usually when the raise Securi.Exceptions.DeadlineExceeded code path is hit that triggers this issue. Though not every time it is hit...

Here's a snippet of the supervisor queue.

[{:"$gen_call",
   {#PID<0.29361.0>, #Reference<0.640257458.685244423.43598>},
   {:start_child,
    {:default,
     {:hackney_pool, :start_link,
      [:default,
       [{:name, :default}, :async, {:stream_to, #PID<0.29362.0>},
        {:recv_timeout, 60000}, {:connect_timeout, 60000},
        {:follow_redirect, false}]]}, :permanent, 10000, :worker,
     [:hackney_pool]}}},
  {:"$gen_call", {#PID<0.28708.0>, #Reference<0.640257458.684982280.261322>},
   {:start_child,
    {:default,
     {:hackney_pool, :start_link,
      [:default,
       [{:name, :default}, :async, {:stream_to, #PID<0.28710.0>},
        {:recv_timeout, 60000}, {:connect_timeout, 60000},
        {:follow_redirect, false}]]}, :permanent, 10000, :worker,
     [:hackney_pool]}}},

It seems to just be repeats of those entries all the way down.

obmarg · 2017-10-20T14:17:32Z

Not sure if it's relevant, but it also appears that there's a bunch of Elixir.HTTPoison.transformer processes still around as well. (That's the intermediate process that HTTPoison uses to transform the hackney async messages into HTTPoison messages).

Though I suppose that might make sense if we're expecting hackney_sup to successfully restart stream processes so they can continue sending...

obmarg · 2017-10-20T14:27:04Z

I've added a bit of logging to the monitor_child function:

monitor_child(Pid) ->
  erlang:monitor(process, Pid),
  unlink(Pid),

  receive
    {'EXIT', Pid, Reason} ->
      receive
        {'DOWN', _, process, Pid, normal} ->
          ok;
        {'DOWN', _, process, Pid, OtherReason} ->
          io:format("Down for other reason... ~n"),
          io:format(Reason),
          io:format(OtherReason),
          {error, Reason}
      end
  after 0 ->
          ok
  end.

Which gives me the following output:

Down for other reason...
normal
noproc

benoitc · 2017-10-20T16:27:03Z

i will have a closer look tonight. i still don’t understand from where this error is coming. The pattern matching should have take care about it.

…

On 20 Oct 2017, at 16:27, Graeme Coupar ***@***.***> wrote: I've added a bit of logging to the monitor_child function: monitor_child(Pid) -> erlang:monitor(process, Pid), unlink(Pid), receive {'EXIT', Pid, Reason} -> receive {'DOWN', _, process, Pid, normal} -> ok; {'DOWN', _, process, Pid, OtherReason} -> io:format("Down for other reason... ~n"), io:format(Reason), io:format(OtherReason), {error, Reason} end after 0 -> ok end. Which gives me the following output: Down for other reason... normal noproc — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

benoitc · 2017-10-20T16:27:48Z

also noproc is handled in the master now.

…

On 20 Oct 2017, at 16:27, Graeme Coupar ***@***.***> wrote: I've added a bit of logging to the monitor_child function: monitor_child(Pid) -> erlang:monitor(process, Pid), unlink(Pid), receive {'EXIT', Pid, Reason} -> receive {'DOWN', _, process, Pid, normal} -> ok; {'DOWN', _, process, Pid, OtherReason} -> io:format("Down for other reason... ~n"), io:format(Reason), io:format(OtherReason), {error, Reason} end after 0 -> ok end. Which gives me the following output: Down for other reason... normal noproc — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

benoitc · 2017-10-23T06:33:49Z

@obmarg this should be fixed in master. I believe the issue was due to noproc not beeing matched. Let me know.

benoitc · 2017-10-24T11:59:02Z

@obmarg ping . did you reproduce it with latest master ?

obmarg · 2017-10-24T14:31:56Z

@benoitc The noproc match seems to have fixed the crash and the excessive CPU consumption that happened immediately afterwards.

However, I'm now seem to be getting a different issue: after a while, all my processes crash with a timeout when calling hackney_manager:start_async_response. Looking at hackney_manager in observer, it appears to be stuck in wait_async_response/1.

benoitc · 2017-11-07T20:48:24Z

@obmarg hrm i need to look at it. I will do that sometimes tomorrow

naag · 2018-01-11T12:30:40Z

We're also seeing this issue after upgrading from Hackney 1.8.0 to 1.10.1 (via HTTPoison 0.13.0) using the streaming function. Any update on this or do you need more information / support?

$ elixir --version
Erlang/OTP 20 [erts-9.1.5] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]

Elixir 1.5.2

benoitc · 2018-01-11T12:34:58Z

@naag can you let me know if latest master fix it for you. I can do a release today.

naag · 2018-01-11T12:44:54Z

We've got no way to reproduce the issue :-( It only came up after several days of successfully processing requests (this is also a kind of proxy application). As a hotfix, we restart our application currently once the issue is hit.

Do you have any hints on how we could force the issue to pop up?

benoitc · 2018-01-23T09:30:21Z

@naag no not really , i don't reproduce it myself. I would advise to go to latest stable 0.11.0 . Next major version will focus on making streams more simpler.

kuon · 2018-01-31T01:49:28Z

I don't know if this is related, but I use quantum to schedule jobs, and if I use fastimage in my job, the next invocation will hang on hackney_manager:wait_async_response.

In my job I do:

HTTPoison.get(some_url)
Fastimage.size(other_url)

Both calls work the first time, but on next call, the HTTPoison.get call hang indefinitely. And looking with the observer, hackney is stuck on hackney_manager:wait_async_response.

I can reproduce the issue 100% of the time.

I use hackney 1.11.

I tried with hackney 1.9.0, and I don't have the issue. I tried with master as of dca6478 and still have the issue.

kuon · 2018-01-31T02:31:47Z

I made an app that hang, I hope it will help you debug the issue:

https://github.com/kuon/hackney_lock

benoitc · 2018-01-31T07:52:17Z

I tried your example but I get the following: ± |master U:1 ✗| → iex -S mix Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:20:20] [ds:20:20:10] [async-threads:10] [hipe] [kernel-poll:false] [dtrace] Compiling 1 file (.ex) Interactive Elixir (1.6.0) - press Ctrl+C to exit (type h() ENTER for help) iex(1)> 08:47:34.005 [info] Performing job 08:47:34.006 [info] Making plain get request 08:47:34.247 [info] Getting image URL 08:47:35.899 [info] Making stream request using fastimage 08:47:36.100 [error] Task #PID<0.232.0> started from #PID<0.231.0> terminating ** (RuntimeError) error, could not open image file with error 403 due to reason, Forbidden (fastimage) lib/fastimage.ex:101: Fastimage.stream_chunks/5 (fastimage) lib/fastimage.ex:66: Fastimage.recv/1 (fastimage) lib/fastimage.ex:40: Fastimage.size/1 (quantum) lib/quantum/executor.ex:107: Quantum.Executor.execute_task/1 (elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2 (elixir) lib/task/supervised.ex:38: Task.Supervised.reply/5 (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3 Function: #Function<6.120520480/0 in Quantum.Executor.run/3> Args: [] 08:47:37.000 [info] Performing job 08:47:37.000 [info] Making plain get request nil iex(2)> I'm not sure what's happening but removing the call to FastImage make the loop as expected. Maybe the error makes FastImage crash? - benoît

…

On 31 Jan 2018, at 03:31, Nicolas Goy ***@***.*** ***@***.***>> wrote: I made an app that hang, I hope it will help you debug the issue: https://github.com/kuon/hackney_lock <https://github.com/kuon/hackney_lock> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#447 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA4ojY-YCsq5htpk-oxaLBSLZyrvfVgks5tP9CTgaJpZM4QAm-c>.

kuon · 2018-01-31T12:30:49Z

The point is that it hangs on the next "plain request", which is unrelated to fastimage. Fastimage is using hackney stream and is causing the deadlock. It might be a bug on how fastimage is using hackney, but as the issue is not present with hackney 1.9, I thought it might be hackney's fault.

I don't know what is there with this particular URL that is causing the bug. If it's the 403 (which I am not sure I had yesterday) or something else (headers...).

Again it does not lock with hackney 1.9.

stephenmoloney · 2018-01-31T13:28:07Z

I have some time in the second half of February allocated to do some work on Fastimage so I should have time to resolve this by March.

In the meantime, I'll have a quick look and post back.

benoitc · 2018-01-31T13:35:33Z

@kuon ah ok. I will have a look in fastimage to see what is happening. (cc @stephenmoloney).

stephenmoloney · 2018-01-31T14:00:01Z

@kuon I took a look at your repository

Ok, here is a more understandable error:

Fastimage.size <url>
** (RuntimeError) error, could not open image file with error 403 due to reason, Forbidden
    (fastimage) lib/fastimage.ex:101: Fastimage.stream_chunks/5
    (fastimage) lib/fastimage.ex:66: Fastimage.recv/1
    (fastimage) lib/fastimage.ex:40: Fastimage.size/1

Seems like image url is protected by some sort of auth, hence the 403 error.

I think it crashes here

So because there is a 403 forbidden, Fastimage is just raising a runtime error.

One of the things I have planned for the work in February that I am going to do is to add new functions for handling things like errors better...

So it will be easier to handle errors...
Fastimage.size!/1 and Fastimage.size/1 and so errors can then be handled.

In the release it will look like this with Fastimage.size/1
{:error, "error, could not open image file with error 403 due to reason, Forbidden"} and the error can be handled....

If people want support for auth in getting images, could open a separate issue in Fastimage for that.

For now, if you do this it will stop crashing your supervisor...

    try do
      Fastimage.size(image)
    rescue
      e in RuntimeError ->
        Logger.error(e.message)
    end

stephenmoloney · 2018-01-31T14:01:13Z

This is not a hackney problem. It seems like a lack of authorization issue.

kuon · 2018-01-31T15:04:30Z

@stephenmoloney The problem is that it will put hackney into a state that will hang on the next process. In my original code, I do catch the fastimage exception in a try/rescue.

kuon · 2018-01-31T15:10:35Z

I made an update to my demo repository with the actual URL list I had when encountering the issue. I know that some URL are invalid, but now I catch the fastimage error, and no exception is raised.

You will notice that after having looped into all the URLs, it will hang on the second job run, here:

16:08:34.001 [info]  Performing job
16:08:34.001 [info]  Making plain get request

stephenmoloney · 2018-01-31T15:36:04Z

@kuon If it stops there that means your opengraph fetch is hanging... Otherwise you should see
[info] Making stream request using fastimage

      Logger.info("Getting image URL")

      with {:ok, %{image: image}} <- OpenGraph.fetch(url) do
        Logger.info("Making stream request using fastimage")

I don't think the hanging is due to fastimage.

kuon · 2018-01-31T15:41:28Z

It is not fastimage call that is hanging, what is hanging is the HTTPoison.get at the top (the one accessing google).

Fastimage is causing hackney to lock on next call from another process. I think it has something to do with fastimage using streams.

Basically, the sequence is the following:

spawn process A (first invocation of job by job scheduler):
http get: works
fastimage call: works
http get: works
fastimage call: works
... (repeat the above for each URL)
spawn process B (second invocation of job by job scheduler):
http get: locks hackney

stephenmoloney · 2018-01-31T15:48:02Z

Ok, can you add this line just before the raise and retry...

See here.
https://github.com/stephenmoloney/fastimage/blob/master/lib/fastimage.ex#L101

error_msg = "error, could not open image file with error #{status_code} due to reason, #{reason}"
raise(error_msg)

error_msg = "error, could not open image file with error #{status_code} due to reason, #{reason}"
close_stream(stream_ref)
raise(error_msg)

@benoitc do you think that the close_stream/1 function might make a difference?

and see if that makes any difference?

I'm not 100% sure why it is happening nor am I 100% sure if that will help.

kuon · 2018-01-31T18:31:33Z

@stephenmoloney Your patch fixes the issue. Now I don't know if it's a bug in hackney (as it was working with 1.9) or should be expected.

stephenmoloney · 2018-01-31T18:45:06Z

yea, glad to hear it, I think it's a problem with fastimage... I believe the stream should be closed...
I'll merge a PR if you make one.

kuon · 2018-01-31T20:34:18Z

I still think there is something happening in hackney that should be fixed (as the OP of this issue is encountering).

benoitc · 2018-02-26T14:39:01Z

@kuon not sure what you mean by " (as the OP of this issue is encountering)." can you clarify ? I would be interested by some traces anyway. Can someone experimenting the issue enable trace fromt he hackney tracing module?

kuon · 2018-02-26T14:45:03Z

OP said "Hackney seems to be deadlocked somehow", that is what I am referring to, it is based on feeling and no hard technical data. The issue is not present in 1.9, that is why I think hackney is at fault, and my hackney is deadlocking when using streams, that is where my feeling comes from, but again, I have no hard evidence.

I pinned hackney 1.9 in my codebase, and I am working on something else right now. I will see if I can take some time to make traces, but not before a few days/week. My demo code is still up here https://github.com/kuon/hackney_lock maybe it can be used to help pinpoint the issue.

stephenmoloney · 2018-02-26T15:36:03Z

I'll try to get to the tracing if I can get the time in the next few days ... but off which version... 1.11.0 seems like the best idea?

benoitc · 2018-04-03T10:21:08Z

please test #497 and let me know about the result.

benoitc · 2018-04-03T11:11:52Z

just tested it and it fixes the issue.

benoitc · 2018-04-03T11:21:21Z

fixed via f81ed84

obmarg · 2018-04-03T12:30:36Z

@benoitc tested, and I'm still getting deadlocks. The symptoms are slightly different from my initial report though, would you rather I raise a new issue with details?

benoitc · 2018-04-03T12:50:22Z

@obmarg yes please :) Makes sure also you're always reading the body...

benoitc added bug working on it labels Nov 7, 2017

kuon mentioned this issue Jan 31, 2018

Deadlock with hackney 1.11 stephenmoloney/fastimage#13

Closed

This was referenced Feb 24, 2018

ssl_closed fix #473

Closed

failing build on hackney 1.10.1, no match of right hand side value: {:error, :normal} stephenmoloney/fastimage#21

Closed

1.0.0-rc (hackney ~> 1.9.0) stephenmoloney/fastimage#24

Merged

benoitc closed this as completed Apr 3, 2018

Deadlocks and excessive CPU consumption with async responses. #447

Deadlocks and excessive CPU consumption with async responses. #447

Comments

obmarg commented Oct 20, 2017

benoitc commented Oct 20, 2017

obmarg commented Oct 20, 2017

benoitc commented Oct 20, 2017

benoitc commented Oct 20, 2017

benoitc commented Oct 20, 2017

obmarg commented Oct 20, 2017 • edited Loading

obmarg commented Oct 20, 2017

obmarg commented Oct 20, 2017

benoitc commented Oct 20, 2017 via email

benoitc commented Oct 20, 2017 via email

benoitc commented Oct 23, 2017

benoitc commented Oct 24, 2017

obmarg commented Oct 24, 2017

benoitc commented Nov 7, 2017

naag commented Jan 11, 2018 • edited Loading

benoitc commented Jan 11, 2018

naag commented Jan 11, 2018

benoitc commented Jan 23, 2018

kuon commented Jan 31, 2018 • edited Loading

kuon commented Jan 31, 2018

benoitc commented Jan 31, 2018 via email

kuon commented Jan 31, 2018 • edited Loading

stephenmoloney commented Jan 31, 2018

benoitc commented Jan 31, 2018

stephenmoloney commented Jan 31, 2018 • edited Loading

stephenmoloney commented Jan 31, 2018 • edited Loading

kuon commented Jan 31, 2018

kuon commented Jan 31, 2018

stephenmoloney commented Jan 31, 2018

kuon commented Jan 31, 2018 • edited Loading

stephenmoloney commented Jan 31, 2018 • edited Loading

kuon commented Jan 31, 2018

stephenmoloney commented Jan 31, 2018 • edited Loading

kuon commented Jan 31, 2018

benoitc commented Feb 26, 2018

kuon commented Feb 26, 2018

stephenmoloney commented Feb 26, 2018

benoitc commented Apr 3, 2018

benoitc commented Apr 3, 2018

benoitc commented Apr 3, 2018

obmarg commented Apr 3, 2018

benoitc commented Apr 3, 2018

obmarg commented Oct 20, 2017 •

edited

Loading

naag commented Jan 11, 2018 •

edited

Loading

kuon commented Jan 31, 2018 •

edited

Loading

kuon commented Jan 31, 2018 •

edited

Loading

stephenmoloney commented Jan 31, 2018 •

edited

Loading

stephenmoloney commented Jan 31, 2018 •

edited

Loading

kuon commented Jan 31, 2018 •

edited

Loading

stephenmoloney commented Jan 31, 2018 •

edited

Loading

stephenmoloney commented Jan 31, 2018 •

edited

Loading