Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connections not returned to the pool on connection errors #643

Closed
InoMurko opened this issue Jun 30, 2020 · 16 comments · Fixed by #661
Closed

connections not returned to the pool on connection errors #643

InoMurko opened this issue Jun 30, 2020 · 16 comments · Fixed by #661

Comments

@InoMurko
Copy link

So firstly, I did the same thing using these:

  certifi 2.5.1
  hackney 1.15.2
  httpoison 1.6.0
  idna 6.0.0
  ssl_verify_fun 1.1.5
  unicode_util_compat 0.4.1

Connections were returned to the pool correctly (note in_use_count remains 0).

16 :hackney_pool.get_stats :default
[name: :default, max: 50, in_use_count: 0, free_count: 0, queue_count: 0]
17 2020-06-30 16:00:28.250 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
17 :hackney_pool.get_stats :default
[name: :default, max: 50, in_use_count: 0, free_count: 0, queue_count: 0]
18 2020-06-30 16:00:29.054 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
18 :hackney_pool.get_stats :default
[name: :default, max: 50, in_use_count: 0, free_count: 0, queue_count: 0]

Now I bumped httpoison (and everything else with it).

Upgraded:
  certifi 2.5.1 => 2.5.2
  hackney 1.15.2 => 1.16.0
  httpoison 1.6.0 => 1.7.0
  idna 6.0.0 => 6.0.1
  ssl_verify_fun 1.1.5 => 1.1.6
  unicode_util_compat 0.4.1 => 0.5.0 (minor)

The story here is that I retry establishing a http connection until I succeed. But in this case, the pool gets maxed out because in_use_count keeps on growing on {:error, :econnrefused} errors.

...
41 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 41, free_count: 1, queue_count: 0]
42 2020-06-30 15:54:09.706 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
42 :hackney_pool.get_stats(:default)2020-06-30 15:54:10.511 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅

[name: :default, max: 50, in_use_count: 43, free_count: 1, queue_count: 0]
43 2020-06-30 15:54:11.314 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
43 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 44, free_count: 1, queue_count: 0]
44 2020-06-30 15:54:12.117 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
44 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 45, free_count: 1, queue_count: 0]
45 :hackney_pool.get_stats(:default)2020-06-30 15:54:12.920 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅

[name: :default, max: 50, in_use_count: 46, free_count: 1, queue_count: 0]
46 :hackney_pool.get_stats(:default)2020-06-30 15:54:13.723 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅

[name: :default, max: 50, in_use_count: 47, free_count: 1, queue_count: 0]
47 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 47, free_count: 1, queue_count: 0]
48 2020-06-30 15:54:14.526 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
2020-06-30 15:54:15.329 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
2020-06-30 15:54:16.133 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
48 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 50, free_count: 1, queue_count: 1]
49 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 50, free_count: 1, queue_count: 1]
50 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 50, free_count: 1, queue_count: 1]
51 2020-06-30 15:54:24.936 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :checkout_timeout}⋅
@InoMurko
Copy link
Author

InoMurko commented Jul 3, 2020

Erlang 23 [erts-11.0.2
Elixir 1.10.2

@benoitc
Copy link
Owner

benoitc commented Jul 5, 2020

do you read the body or skip it explicitly? if the connection is not closed or the body not read (or skipped which will flush it) the socket will not be released to the pool.

@jfis
Copy link

jfis commented Jul 5, 2020

i believe @InoMurko is saying there is no body to be read because the connection is refused.

should it get released to the pool when the connection is refused?

certifi 2.5.2
hackney 1.16.0
idna 6.0.1
ssl_verify_fun 1.1.6
unicode_util_compat 0.5.0

code: (sorry for elixir code. if you need this translated to erlang, i will do my best)

nowhere = "http://127.0.0.1:55555"
for _ <- 1..51 do
  :hackney.post(nowhere, [], "", [:with_body]) |> IO.inspect()
  :hackney_pool.get_stats(:default) |> IO.inspect()
end

output:

{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 1, free_count: 0, queue_count: 0]
{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 2, free_count: 0, queue_count: 0]
{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 3, free_count: 0, queue_count: 0]
...
[name: :default, max: 50, in_use_count: 48, free_count: 0, queue_count: 0]
{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 49, free_count: 0, queue_count: 0]
{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 50, free_count: 0, queue_count: 0]
{:error, :checkout_timeout}
[name: :default, max: 50, in_use_count: 50, free_count: 0, queue_count: 0]

@InoMurko
Copy link
Author

InoMurko commented Jul 6, 2020

@jfis yeah, thats what I meant, thanks for clarifying!

@benoitc there is not body, because the connection is refused with {:error, :econnrefused}.

@benoitc
Copy link
Owner

benoitc commented Jul 6, 2020

i believe @InoMurko is saying there is no body to be read because the connection is refused.

should it get released to the pool when the connection is refused?

what do you mean by connection refused? socket level or HTTP level?

@benoitc
Copy link
Owner

benoitc commented Jul 6, 2020

@benoitc there is not body, because the connection is refused with {:error, :econnrefused}.

in that case it should indeed release the socket. Sounds like something new, I will check.

@benoitc benoitc self-assigned this Jul 6, 2020
@benoitc benoitc added the bug label Jul 6, 2020
@kape1395
Copy link

kape1395 commented Jul 7, 2020

I have similar symptoms after upgrading hackney from 1.15.1 to 1.16.0.
Each time I get a timeout (recv timeout in my case, not the connect_timeout), the connection is not returned to the pool:

hackney_pool:get_stats(default).
[{name,default},
{max,5000},
{in_use_count,72},
{free_count,0},
{queue_count,0}]

I check the log, and the in_use_count matches the number of {error, timeout} returned by:

Opts = [
        with_body,
        insecure,
        {recv_timeout, 120000}
    ],
hackney:post(Url, [{<<"Content-Type">>, <<"application/json">>}], Json, Opts)

@silviucpp
Copy link

I have same problems as @kape1395 describes.

@benoitc
Copy link
Owner

benoitc commented Jul 19, 2020

I think I see the issue. I will have a look tomorrow on it, thanks for the report

@silviucpp
Copy link

@benoitc easy way to replicate is to create a pool with max 4-5 connetions and them make requests to some servers that are stopped.

First N connections (where N is the pool size) will fail right away with econnrefused reason and all subsequent one with checkout_timeout

@losvedir
Copy link

I believe we have this same issue. We started running into occasional checkout_timeout issues when we upgraded hackney from 1.15.2 to 1.16.0. The only fix seems to be restarting the app (unless is there a good way to flush a pool?).

We added logging from :hackney_pool.get_stats/1 once a minute. Here's what the chart looks like over a 24 hour time period of min(in_use_count):

Screen Shot 2020-07-23 at 8 31 42 AM

It appears that the pool is gradually filling (until it drops to 0 after I restarted the app). The big jump at 8:48pm corresponds exactly to when we had a brief network outage, and a flurry of :nxdomain warnings in our logs.

Our requests are all made via ex_aws, not via hackney directly, and the code that calls hackney seems to be here. I saw other issues on the tracker here about the pool filling up from not reading the body, but it looks like the requests there use the :with_body option, so I think that's not the problem in this case.

@silviucpp
Copy link

@losvedir right no to workaround this problem I restart the pool when I receive checkout_timeout.

@losvedir
Copy link

losvedir commented Aug 2, 2020

@silviucpp - How are you restarting the pool? I put up a branch of our app that once a minute logs the pool stats, and stops the pool if the in_use_count is above a threshold. However, the very next minute the in_use_count is back where it was before stopping, so it seems like it's not "clearing" the pool. Is there a better way to empty the pool? I'm doing something like this:

stats = :hackney_pool.get_stats(pool) 
if stats[:in_use_count] >= @connection_pool_in_use_threshold do
  Logger.warn("stopping_pool pool=#{pool}")
  :hackney_pool.stop_pool(pool)
end

Also, I haven't been able to recreate the behavior in a minimal case. I put up this repo which is an app that starts up ranch to accept incoming connections, and also starts up a few GenServers that use hackney to make requests using a pool. I've tried various ways of having ranch fail (econnrefused, timeout, nxdomain), and can't replicate the pool filling up like in our production app. I suspect it might be some race condition or something involving a lot of requests.

@benoitc - are you still looking into this? You said "I think I see the issue." If you have an idea what it might be, I'd be happy to try to replicate the behavior in my example repo to try to get a consistently failing example for you to work with.

In the meantime, I think my only recourse is to switch back to 1.15.2. I have that branch running on a staging server and the behavior is once again correctly keeping in_use_count flat, not slowly growing over time.

@silviucpp
Copy link

Hello

In my case when {error, checkout_timeout} occurs I'm doing:

hackney_pool:stop_pool(Pool),
hackney_pool:start_pool(Pool, [{timeout, 15000}, {max_connections, 20}])

And retry the query.

Silviu

@IceDragon200
Copy link

IceDragon200 commented Sep 14, 2020

I was preparing this comment while testing, so some areas were added during the test, anyway unto the actual comment.

I'm digging into this issue at the moment, what I've found so far is the connections are not even in the pool:

I'm looking at the pool checkout process at the moment

https://github.com/benoitc/hackney/blob/master/src/hackney_pool.erl#L86-L133

I inserted some Elixir.IO.inspect/1 lines at the top of each case (I'm using elixir since I don't have a proper erlang dev environment setup)

Issuing a request then causes an {:error, :no_socket, owner}, so the pool increments the counters regardless and and returns the tuple, hackney_pool:do_checkout/7 then tries to establish a connection and fails returning the connection errors as expected (e.g. {:error, :econnrefused}).

What I found interesting is a valid request (one that completes with an actual response) also increments the in use counter, but also had a no_socket error, the worse part, it also doesn't decrement the in use counter afterwards.

I managed to trigger a checkout timeout with valid requests just by calling hackney:request/1 with a valid url 60 times in a row.

After some more testing and tweaking

It seems the error related failures can be fixed by sticking a checkout_cancel call in their respective cases, the original code had it as a inline call, so I've moved that into a small function to quickly reuse it in all the cases.
A better option would be to refactor this function so it performs a handle_checkout_result which checks for the failures and returns cancels the checkout as required

%% in hackney_pool.erl

do_checkout(Requester, Host, _Port, Transport, #client{options=Opts,
  mod_metrics=Metrics}=Client, ConnectTimeout, CheckoutTimeout) ->
  {Connection, ConnectOptions} = hackney_connection:new(Client),
  RequestRef = Client#client.request_ref,
  PoolName = proplists:get_value(pool, Opts, default),
  Pool = find_pool(PoolName, Opts),
  case catch gen_server:call(Pool, {checkout, Connection, Requester, RequestRef}, CheckoutTimeout) of
    {ok, Socket, Owner} ->
      %% stats
      ?report_debug("reuse a connection", [{pool, PoolName}]),
      _ = metrics:update_meter(Metrics, [hackney_pool, PoolName, take_rate], 1),
      _ = metrics:increment_counter(Metrics, [hackney_pool, Host, reuse_connection]),


      {ok, {PoolName, RequestRef, Connection, Owner, Transport}, Socket};
    {error, no_socket, Owner} ->
      ?report_trace("no socket in the pool", [{pool, PoolName}]),
      Begin = os:timestamp(),
      case hackney_connection:connect(Connection, ConnectOptions, ConnectTimeout) of
        {ok, Socket} ->
          case hackney_connection:controlling_process(Connection, Socket, Requester) of
            ok ->
              ?report_trace("new connection", []),
              ConnectTime = timer:now_diff(os:timestamp(), Begin)/1000,
              _ = metrics:update_histogram(Metrics, [hackney, Host, connect_time], ConnectTime),
              _ = metrics:increment_counter(Metrics, [hackney_pool, Host, new_connection]),
              {ok, {PoolName, RequestRef, Connection, Owner, Transport}, Socket};
            Error ->
              'Elixir.IO':inspect({pool, connect_error_controlling, Error}),
              catch hackney_connection:close(Connection, Socket),
              cancel_checkout(Pool, Connection, RequestRef),
              _ = metrics:increment_counter(Metrics, [hackney, Host, connect_error]),
              Error
           end;
        {error, timeout} ->
          _ = metrics:increment_counter(Metrics, [hackney, Host, connect_timeout]),
          cancel_checkout(Pool, Connection, RequestRef),
          {error, timeout};
        Error ->
          ?report_trace("connect error", []),
          'Elixir.IO':inspect({pool, connect_error, Error}),
          _ = metrics:increment_counter(Metrics, [hackney, Host, connect_error]),
          cancel_checkout(Pool, Connection, RequestRef),
          Error
      end;
    {error, Reason} ->
      {error, Reason};
    {'EXIT', {timeout, Reason}} ->
      % socket will still checkout so to avoid deadlock we send in a cancellation
      cancel_checkout(Pool, Connection, RequestRef),
      {error, checkout_timeout}
  end.

cancel_checkout(Pool, Connection, RequestRef) ->
  gen_server:cast(Pool, {checkout_cancel, Connection, RequestRef}).

The above change seems to fix the {:error, :econnrefused | :timeout | atom} related failures.
I still haven't figured out why valid requests are exhausting the pool in my current tests, I'll update this comment with my findings (or failures)

Regarding the valid request failure

The issue was not reading off the request body to complete the transaction:

for _ <- 1..60 do
  result = {:ok, 200, headers, body_ref} = :hackney.request("http://localhost:7654")
  _ = :hackney.body(body_ref) #< the important bit
  IO.inspect {:response, result}
  IO.inspect {:pool_stats, :hackney_pool.get_stats(:default)}
end

IceDragon200 added a commit to IceDragon200/hackney that referenced this issue Sep 14, 2020
s3cur3 added a commit to X-Plane/weather-mirror that referenced this issue Oct 23, 2020
…ailed downloads, the pool is (permanently?) exhausted

Without this, after we hit a few HTTP errors (which of course NOAA serves us a lot, that's the whole point of this proxy), all (?) future downloads fail with this error:
{:error, %HTTPoison.Error{id: nil, reason: :checkout_timeout}}

Hackney issue: edgurgel/httpoison#414
More info: benoitc/hackney#643
@InoMurko
Copy link
Author

Thanks everyone!!! <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants