Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlocks and excessive CPU consumption with async responses. #447

Closed
obmarg opened this issue Oct 20, 2017 · 42 comments
Closed

Deadlocks and excessive CPU consumption with async responses. #447

obmarg opened this issue Oct 20, 2017 · 42 comments

Comments

@obmarg
Copy link
Contributor

obmarg commented Oct 20, 2017

I've (again) been load testing a proxy application that makes use of async responses, against the latest master branch of hackney.

I'm basically spinning up 1000 clients that talk to our service, each of the requests those clients make will kick off a request using hackney. After a reasonably small number of requests (8000-9000 pre 1ab27d4, seems to be more after that) things start going wrong:

  1. One of my processes that was receiving a response crashes.
  2. Hackney manager crashes with
13:10:34.417 [error] GenServer :hackney_manager terminating
** (MatchError) no match of right hand side value: {:error, :normal}
    (hackney) /Users/graeme/src/thirdparty/hackney/src/hackney_manager.erl:525: :hackney_manager.clean_requests/5
    (hackney) /Users/graeme/src/thirdparty/hackney/src/hackney_manager.erl:499: :hackney_manager.handle_owner_exit/4
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
  1. Hackney seems to be deadlocked somehow, as no further requests ever complete - though nothing ever crashes either.
  2. beam.smp processor usage shoots up to 500%. This seems to be caused by hackney_sup - which permanently has a message queue of around 1000.

I've done some bisecting around this issue, and it definitely seems to have been introduced as part of #443

@benoitc
Copy link
Owner

benoitc commented Oct 20, 2017

@obmarg is this with 0.10.1 ? Pb should be fixed in this version . Let me know.

@obmarg
Copy link
Contributor Author

obmarg commented Oct 20, 2017

Yes, this is with 0.10.1. The fix you made in 0.10.1 does seem to have made the error condition less likely, but it still happens sooner or later.

@benoitc
Copy link
Owner

benoitc commented Oct 20, 2017

@obmarg can you share a snippet of your code? I would like to understand what it does.

@benoitc
Copy link
Owner

benoitc commented Oct 20, 2017

not sure how you get that error on line 525 though... I don't see how it can happen right now.

@benoitc
Copy link
Owner

benoitc commented Oct 20, 2017

@obmarg also what is in the supervisor queue? Did you look at it? (erlang:process_info or observer should do the trick)

@obmarg
Copy link
Contributor Author

obmarg commented Oct 20, 2017

The code itself is reasonably simple, so sure. I've got this in one function:

HTTPoison.request(
  request.method,
  request.url,
  request.body,
  request.headers,
  hackney: [follow_redirect: false],
  timeout: deadline,
  recv_timeout: deadline,
  stream_to: self()
)

receive_response()

Then there's the receive response function:

  defp receive_response(conn, status_code \\ nil) do
    receive do
      %HTTPoison.Error{reason: :timeout} ->
        raise Securi.Exceptions.DeadlineExceeded
      %HTTPoison.Error{reason: {:closed, :timeout}} ->
        raise Securi.Exceptions.DeadlineExceeded
      %HTTPoison.Error{reason: :connect_timeout} ->
        raise Securi.Exceptions.DeadlineExceeded
      %HTTPoison.Error{reason: :closed} ->
        raise Securi.Exceptions.ConnectionClosedError
      %HTTPoison.Error{reason: reason} ->
        Logger.warn "HTTP Error: #{inspect reason}"
        raise Securi.Exceptions.UnknownHttpError
      %HTTPoison.AsyncStatus{code: status_code} ->
        receive_response(conn, status_code)
      %HTTPoison.AsyncHeaders{headers: headers} ->
        # Trimming this bit, but I just strip out some headers & send them with plug.
        conn = Plug.Conn.send_chunked(conn, status_code)
        receive_response(conn, status_code)
      %HTTPoison.AsyncChunk{chunk: chunk} ->
        {:ok, conn} = Plug.Conn.chunk(conn, chunk)
        receive_response(conn, status_code)
      %HTTPoison.AsyncEnd{} ->
        {:ok, conn}
    end
  end

It's usually when the raise Securi.Exceptions.DeadlineExceeded code path is hit that triggers this issue. Though not every time it is hit...

Here's a snippet of the supervisor queue.

[{:"$gen_call",
   {#PID<0.29361.0>, #Reference<0.640257458.685244423.43598>},
   {:start_child,
    {:default,
     {:hackney_pool, :start_link,
      [:default,
       [{:name, :default}, :async, {:stream_to, #PID<0.29362.0>},
        {:recv_timeout, 60000}, {:connect_timeout, 60000},
        {:follow_redirect, false}]]}, :permanent, 10000, :worker,
     [:hackney_pool]}}},
  {:"$gen_call", {#PID<0.28708.0>, #Reference<0.640257458.684982280.261322>},
   {:start_child,
    {:default,
     {:hackney_pool, :start_link,
      [:default,
       [{:name, :default}, :async, {:stream_to, #PID<0.28710.0>},
        {:recv_timeout, 60000}, {:connect_timeout, 60000},
        {:follow_redirect, false}]]}, :permanent, 10000, :worker,
     [:hackney_pool]}}},

It seems to just be repeats of those entries all the way down.

@obmarg
Copy link
Contributor Author

obmarg commented Oct 20, 2017

Not sure if it's relevant, but it also appears that there's a bunch of Elixir.HTTPoison.transformer processes still around as well. (That's the intermediate process that HTTPoison uses to transform the hackney async messages into HTTPoison messages).

Though I suppose that might make sense if we're expecting hackney_sup to successfully restart stream processes so they can continue sending...

@obmarg
Copy link
Contributor Author

obmarg commented Oct 20, 2017

I've added a bit of logging to the monitor_child function:

monitor_child(Pid) ->
  erlang:monitor(process, Pid),
  unlink(Pid),

  receive
    {'EXIT', Pid, Reason} ->
      receive
        {'DOWN', _, process, Pid, normal} ->
          ok;
        {'DOWN', _, process, Pid, OtherReason} ->
          io:format("Down for other reason... ~n"),
          io:format(Reason),
          io:format(OtherReason),
          {error, Reason}
      end
  after 0 ->
          ok
  end.

Which gives me the following output:

Down for other reason...
normal
noproc

@benoitc
Copy link
Owner

benoitc commented Oct 20, 2017 via email

@benoitc
Copy link
Owner

benoitc commented Oct 20, 2017 via email

@benoitc
Copy link
Owner

benoitc commented Oct 23, 2017

@obmarg this should be fixed in master. I believe the issue was due to noproc not beeing matched. Let me know.

@benoitc
Copy link
Owner

benoitc commented Oct 24, 2017

@obmarg ping . did you reproduce it with latest master ?

@obmarg
Copy link
Contributor Author

obmarg commented Oct 24, 2017

@benoitc The noproc match seems to have fixed the crash and the excessive CPU consumption that happened immediately afterwards.

However, I'm now seem to be getting a different issue: after a while, all my processes crash with a timeout when calling hackney_manager:start_async_response. Looking at hackney_manager in observer, it appears to be stuck in wait_async_response/1.

@benoitc
Copy link
Owner

benoitc commented Nov 7, 2017

@obmarg hrm i need to look at it. I will do that sometimes tomorrow

@naag
Copy link

naag commented Jan 11, 2018

We're also seeing this issue after upgrading from Hackney 1.8.0 to 1.10.1 (via HTTPoison 0.13.0) using the streaming function. Any update on this or do you need more information / support?

$ elixir --version
Erlang/OTP 20 [erts-9.1.5] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]

Elixir 1.5.2

@benoitc
Copy link
Owner

benoitc commented Jan 11, 2018

@naag can you let me know if latest master fix it for you. I can do a release today.

@naag
Copy link

naag commented Jan 11, 2018

We've got no way to reproduce the issue :-( It only came up after several days of successfully processing requests (this is also a kind of proxy application). As a hotfix, we restart our application currently once the issue is hit.

Do you have any hints on how we could force the issue to pop up?

@benoitc
Copy link
Owner

benoitc commented Jan 23, 2018

@naag no not really , i don't reproduce it myself. I would advise to go to latest stable 0.11.0 . Next major version will focus on making streams more simpler.

@kuon
Copy link

kuon commented Jan 31, 2018

I don't know if this is related, but I use quantum to schedule jobs, and if I use fastimage in my job, the next invocation will hang on hackney_manager:wait_async_response.

In my job I do:

HTTPoison.get(some_url)
Fastimage.size(other_url)

Both calls work the first time, but on next call, the HTTPoison.get call hang indefinitely. And looking with the observer, hackney is stuck on hackney_manager:wait_async_response.

I can reproduce the issue 100% of the time.

I use hackney 1.11.

I tried with hackney 1.9.0, and I don't have the issue. I tried with master as of dca6478 and still have the issue.

@kuon
Copy link

kuon commented Jan 31, 2018

I made an app that hang, I hope it will help you debug the issue:

https://github.com/kuon/hackney_lock

@benoitc
Copy link
Owner

benoitc commented Jan 31, 2018 via email

@kuon
Copy link

kuon commented Jan 31, 2018

The point is that it hangs on the next "plain request", which is unrelated to fastimage. Fastimage is using hackney stream and is causing the deadlock. It might be a bug on how fastimage is using hackney, but as the issue is not present with hackney 1.9, I thought it might be hackney's fault.

I don't know what is there with this particular URL that is causing the bug. If it's the 403 (which I am not sure I had yesterday) or something else (headers...).

Again it does not lock with hackney 1.9.

@stephenmoloney
Copy link

I have some time in the second half of February allocated to do some work on Fastimage so I should have time to resolve this by March.

In the meantime, I'll have a quick look and post back.

@benoitc
Copy link
Owner

benoitc commented Jan 31, 2018

@kuon ah ok. I will have a look in fastimage to see what is happening. (cc @stephenmoloney).

@stephenmoloney
Copy link

stephenmoloney commented Jan 31, 2018

@kuon I took a look at your repository

Ok, here is a more understandable error:

Fastimage.size <url>
** (RuntimeError) error, could not open image file with error 403 due to reason, Forbidden
    (fastimage) lib/fastimage.ex:101: Fastimage.stream_chunks/5
    (fastimage) lib/fastimage.ex:66: Fastimage.recv/1
    (fastimage) lib/fastimage.ex:40: Fastimage.size/1

Seems like image url is protected by some sort of auth, hence the 403 error.

I think it crashes here

So because there is a 403 forbidden, Fastimage is just raising a runtime error.

One of the things I have planned for the work in February that I am going to do is to add new functions for handling things like errors better...

So it will be easier to handle errors...
Fastimage.size!/1 and Fastimage.size/1 and so errors can then be handled.

In the release it will look like this with Fastimage.size/1
{:error, "error, could not open image file with error 403 due to reason, Forbidden"} and the error can be handled....

If people want support for auth in getting images, could open a separate issue in Fastimage for that.

For now, if you do this it will stop crashing your supervisor...

    try do
      Fastimage.size(image)
    rescue
      e in RuntimeError ->
        Logger.error(e.message)
    end

@stephenmoloney
Copy link

stephenmoloney commented Jan 31, 2018

This is not a hackney problem. It seems like a lack of authorization issue.

@kuon
Copy link

kuon commented Jan 31, 2018

@stephenmoloney The problem is that it will put hackney into a state that will hang on the next process. In my original code, I do catch the fastimage exception in a try/rescue.

@kuon
Copy link

kuon commented Jan 31, 2018

I made an update to my demo repository with the actual URL list I had when encountering the issue. I know that some URL are invalid, but now I catch the fastimage error, and no exception is raised.

You will notice that after having looped into all the URLs, it will hang on the second job run, here:

16:08:34.001 [info]  Performing job
16:08:34.001 [info]  Making plain get request

@stephenmoloney
Copy link

@kuon If it stops there that means your opengraph fetch is hanging... Otherwise you should see
[info] Making stream request using fastimage

      Logger.info("Getting image URL")

      with {:ok, %{image: image}} <- OpenGraph.fetch(url) do
        Logger.info("Making stream request using fastimage")

I don't think the hanging is due to fastimage.

@kuon
Copy link

kuon commented Jan 31, 2018

It is not fastimage call that is hanging, what is hanging is the HTTPoison.get at the top (the one accessing google).

Fastimage is causing hackney to lock on next call from another process. I think it has something to do with fastimage using streams.

Basically, the sequence is the following:

  • spawn process A (first invocation of job by job scheduler):
  • http get: works
  • fastimage call: works
  • http get: works
  • fastimage call: works
  • ... (repeat the above for each URL)
  • spawn process B (second invocation of job by job scheduler):
  • http get: locks hackney

@stephenmoloney
Copy link

stephenmoloney commented Jan 31, 2018

Ok, can you add this line just before the raise and retry...

See here.
https://github.com/stephenmoloney/fastimage/blob/master/lib/fastimage.ex#L101

error_msg = "error, could not open image file with error #{status_code} due to reason, #{reason}"
raise(error_msg)
error_msg = "error, could not open image file with error #{status_code} due to reason, #{reason}"
close_stream(stream_ref)
raise(error_msg)

@benoitc do you think that the close_stream/1 function might make a difference?

and see if that makes any difference?

I'm not 100% sure why it is happening nor am I 100% sure if that will help.

@kuon
Copy link

kuon commented Jan 31, 2018

@stephenmoloney Your patch fixes the issue. Now I don't know if it's a bug in hackney (as it was working with 1.9) or should be expected.

@stephenmoloney
Copy link

stephenmoloney commented Jan 31, 2018

yea, glad to hear it, I think it's a problem with fastimage... I believe the stream should be closed...
I'll merge a PR if you make one.

@kuon
Copy link

kuon commented Jan 31, 2018

I still think there is something happening in hackney that should be fixed (as the OP of this issue is encountering).

@benoitc
Copy link
Owner

benoitc commented Feb 26, 2018

@kuon not sure what you mean by " (as the OP of this issue is encountering)." can you clarify ? I would be interested by some traces anyway. Can someone experimenting the issue enable trace fromt he hackney tracing module?

@kuon
Copy link

kuon commented Feb 26, 2018

OP said "Hackney seems to be deadlocked somehow", that is what I am referring to, it is based on feeling and no hard technical data. The issue is not present in 1.9, that is why I think hackney is at fault, and my hackney is deadlocking when using streams, that is where my feeling comes from, but again, I have no hard evidence.

I pinned hackney 1.9 in my codebase, and I am working on something else right now. I will see if I can take some time to make traces, but not before a few days/week. My demo code is still up here https://github.com/kuon/hackney_lock maybe it can be used to help pinpoint the issue.

@stephenmoloney
Copy link

I'll try to get to the tracing if I can get the time in the next few days ... but off which version... 1.11.0 seems like the best idea?

@benoitc
Copy link
Owner

benoitc commented Apr 3, 2018

please test #497 and let me know about the result.

@benoitc
Copy link
Owner

benoitc commented Apr 3, 2018

just tested it and it fixes the issue.

@benoitc
Copy link
Owner

benoitc commented Apr 3, 2018

fixed via f81ed84

@benoitc benoitc closed this as completed Apr 3, 2018
@obmarg
Copy link
Contributor Author

obmarg commented Apr 3, 2018

@benoitc tested, and I'm still getting deadlocks. The symptoms are slightly different from my initial report though, would you rather I raise a new issue with details?

@benoitc
Copy link
Owner

benoitc commented Apr 3, 2018

@obmarg yes please :) Makes sure also you're always reading the body...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants