connections not returned to the pool on connection errors #643

InoMurko · 2020-06-30T14:08:49Z

So firstly, I did the same thing using these:

  certifi 2.5.1
  hackney 1.15.2
  httpoison 1.6.0
  idna 6.0.0
  ssl_verify_fun 1.1.5
  unicode_util_compat 0.4.1

Connections were returned to the pool correctly (note in_use_count remains 0).

16 :hackney_pool.get_stats :default
[name: :default, max: 50, in_use_count: 0, free_count: 0, queue_count: 0]
17 2020-06-30 16:00:28.250 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
17 :hackney_pool.get_stats :default
[name: :default, max: 50, in_use_count: 0, free_count: 0, queue_count: 0]
18 2020-06-30 16:00:29.054 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
18 :hackney_pool.get_stats :default
[name: :default, max: 50, in_use_count: 0, free_count: 0, queue_count: 0]

Now I bumped httpoison (and everything else with it).

Upgraded:
  certifi 2.5.1 => 2.5.2
  hackney 1.15.2 => 1.16.0
  httpoison 1.6.0 => 1.7.0
  idna 6.0.0 => 6.0.1
  ssl_verify_fun 1.1.5 => 1.1.6
  unicode_util_compat 0.4.1 => 0.5.0 (minor)

The story here is that I retry establishing a http connection until I succeed. But in this case, the pool gets maxed out because in_use_count keeps on growing on {:error, :econnrefused} errors.

...
41 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 41, free_count: 1, queue_count: 0]
42 2020-06-30 15:54:09.706 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
42 :hackney_pool.get_stats(:default)2020-06-30 15:54:10.511 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅

[name: :default, max: 50, in_use_count: 43, free_count: 1, queue_count: 0]
43 2020-06-30 15:54:11.314 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
43 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 44, free_count: 1, queue_count: 0]
44 2020-06-30 15:54:12.117 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
44 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 45, free_count: 1, queue_count: 0]
45 :hackney_pool.get_stats(:default)2020-06-30 15:54:12.920 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅

[name: :default, max: 50, in_use_count: 46, free_count: 1, queue_count: 0]
46 :hackney_pool.get_stats(:default)2020-06-30 15:54:13.723 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅

[name: :default, max: 50, in_use_count: 47, free_count: 1, queue_count: 0]
47 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 47, free_count: 1, queue_count: 0]
48 2020-06-30 15:54:14.526 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
2020-06-30 15:54:15.329 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
2020-06-30 15:54:16.133 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :econnrefused}⋅
48 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 50, free_count: 1, queue_count: 1]
49 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 50, free_count: 1, queue_count: 1]
50 :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 50, free_count: 1, queue_count: 1]
51 2020-06-30 15:54:24.936 [error] module=Engine.Ethereum.HeightMonitor.Core function=fetch_height/2 ⋅Error retrieving Ethereum height: {:error, :checkout_timeout}⋅

The text was updated successfully, but these errors were encountered:

InoMurko · 2020-07-03T16:51:26Z

Erlang 23 [erts-11.0.2
Elixir 1.10.2

benoitc · 2020-07-05T10:19:58Z

do you read the body or skip it explicitly? if the connection is not closed or the body not read (or skipped which will flush it) the socket will not be released to the pool.

jfis · 2020-07-05T20:47:11Z

i believe @InoMurko is saying there is no body to be read because the connection is refused.

should it get released to the pool when the connection is refused?

certifi 2.5.2
hackney 1.16.0
idna 6.0.1
ssl_verify_fun 1.1.6
unicode_util_compat 0.5.0

code: (sorry for elixir code. if you need this translated to erlang, i will do my best)

nowhere = "http://127.0.0.1:55555"
for _ <- 1..51 do
  :hackney.post(nowhere, [], "", [:with_body]) |> IO.inspect()
  :hackney_pool.get_stats(:default) |> IO.inspect()
end

output:

{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 1, free_count: 0, queue_count: 0]
{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 2, free_count: 0, queue_count: 0]
{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 3, free_count: 0, queue_count: 0]
...
[name: :default, max: 50, in_use_count: 48, free_count: 0, queue_count: 0]
{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 49, free_count: 0, queue_count: 0]
{:error, :econnrefused}
[name: :default, max: 50, in_use_count: 50, free_count: 0, queue_count: 0]
{:error, :checkout_timeout}
[name: :default, max: 50, in_use_count: 50, free_count: 0, queue_count: 0]

InoMurko · 2020-07-06T09:11:28Z

@jfis yeah, thats what I meant, thanks for clarifying!

@benoitc there is not body, because the connection is refused with {:error, :econnrefused}.

benoitc · 2020-07-06T09:13:00Z

i believe @InoMurko is saying there is no body to be read because the connection is refused.

should it get released to the pool when the connection is refused?

what do you mean by connection refused? socket level or HTTP level?

benoitc · 2020-07-06T09:15:01Z

@benoitc there is not body, because the connection is refused with {:error, :econnrefused}.

in that case it should indeed release the socket. Sounds like something new, I will check.

kape1395 · 2020-07-07T07:45:44Z

I have similar symptoms after upgrading hackney from 1.15.1 to 1.16.0.
Each time I get a timeout (recv timeout in my case, not the connect_timeout), the connection is not returned to the pool:

hackney_pool:get_stats(default).
[{name,default},
{max,5000},
{in_use_count,72},
{free_count,0},
{queue_count,0}]

I check the log, and the in_use_count matches the number of {error, timeout} returned by:

Opts = [
        with_body,
        insecure,
        {recv_timeout, 120000}
    ],
hackney:post(Url, [{<<"Content-Type">>, <<"application/json">>}], Json, Opts)

silviucpp · 2020-07-19T05:28:01Z

I have same problems as @kape1395 describes.

benoitc · 2020-07-19T16:11:14Z

I think I see the issue. I will have a look tomorrow on it, thanks for the report

silviucpp · 2020-07-22T08:02:58Z

@benoitc easy way to replicate is to create a pool with max 4-5 connetions and them make requests to some servers that are stopped.

First N connections (where N is the pool size) will fail right away with econnrefused reason and all subsequent one with checkout_timeout

losvedir · 2020-07-23T13:41:21Z

I believe we have this same issue. We started running into occasional checkout_timeout issues when we upgraded hackney from 1.15.2 to 1.16.0. The only fix seems to be restarting the app (unless is there a good way to flush a pool?).

We added logging from :hackney_pool.get_stats/1 once a minute. Here's what the chart looks like over a 24 hour time period of min(in_use_count):

It appears that the pool is gradually filling (until it drops to 0 after I restarted the app). The big jump at 8:48pm corresponds exactly to when we had a brief network outage, and a flurry of :nxdomain warnings in our logs.

Our requests are all made via ex_aws, not via hackney directly, and the code that calls hackney seems to be here. I saw other issues on the tracker here about the pool filling up from not reading the body, but it looks like the requests there use the :with_body option, so I think that's not the problem in this case.

silviucpp · 2020-07-23T14:57:03Z

@losvedir right no to workaround this problem I restart the pool when I receive checkout_timeout.

losvedir · 2020-08-02T15:22:21Z

@silviucpp - How are you restarting the pool? I put up a branch of our app that once a minute logs the pool stats, and stops the pool if the in_use_count is above a threshold. However, the very next minute the in_use_count is back where it was before stopping, so it seems like it's not "clearing" the pool. Is there a better way to empty the pool? I'm doing something like this:

stats = :hackney_pool.get_stats(pool) 
if stats[:in_use_count] >= @connection_pool_in_use_threshold do
  Logger.warn("stopping_pool pool=#{pool}")
  :hackney_pool.stop_pool(pool)
end

Also, I haven't been able to recreate the behavior in a minimal case. I put up this repo which is an app that starts up ranch to accept incoming connections, and also starts up a few GenServers that use hackney to make requests using a pool. I've tried various ways of having ranch fail (econnrefused, timeout, nxdomain), and can't replicate the pool filling up like in our production app. I suspect it might be some race condition or something involving a lot of requests.

@benoitc - are you still looking into this? You said "I think I see the issue." If you have an idea what it might be, I'd be happy to try to replicate the behavior in my example repo to try to get a consistently failing example for you to work with.

In the meantime, I think my only recourse is to switch back to 1.15.2. I have that branch running on a staging server and the behavior is once again correctly keeping in_use_count flat, not slowly growing over time.

silviucpp · 2020-08-03T11:47:34Z

Hello

In my case when {error, checkout_timeout} occurs I'm doing:

hackney_pool:stop_pool(Pool),
hackney_pool:start_pool(Pool, [{timeout, 15000}, {max_connections, 20}])

And retry the query.

Silviu

IceDragon200 · 2020-09-14T18:36:29Z

I was preparing this comment while testing, so some areas were added during the test, anyway unto the actual comment.

I'm digging into this issue at the moment, what I've found so far is the connections are not even in the pool:

I'm looking at the pool checkout process at the moment

https://github.com/benoitc/hackney/blob/master/src/hackney_pool.erl#L86-L133

I inserted some Elixir.IO.inspect/1 lines at the top of each case (I'm using elixir since I don't have a proper erlang dev environment setup)

Issuing a request then causes an {:error, :no_socket, owner}, so the pool increments the counters regardless and and returns the tuple, hackney_pool:do_checkout/7 then tries to establish a connection and fails returning the connection errors as expected (e.g. {:error, :econnrefused}).

What I found interesting is a valid request (one that completes with an actual response) also increments the in use counter, but also had a no_socket error, the worse part, it also doesn't decrement the in use counter afterwards.

~~I managed to trigger a checkout timeout with valid requests just by calling hackney:request/1 with a valid url 60 times in a row.~~

After some more testing and tweaking

It seems the error related failures can be fixed by sticking a checkout_cancel call in their respective cases, the original code had it as a inline call, so I've moved that into a small function to quickly reuse it in all the cases.
A better option would be to refactor this function so it performs a handle_checkout_result which checks for the failures and returns cancels the checkout as required

%% in hackney_pool.erl

do_checkout(Requester, Host, _Port, Transport, #client{options=Opts,
  mod_metrics=Metrics}=Client, ConnectTimeout, CheckoutTimeout) ->
  {Connection, ConnectOptions} = hackney_connection:new(Client),
  RequestRef = Client#client.request_ref,
  PoolName = proplists:get_value(pool, Opts, default),
  Pool = find_pool(PoolName, Opts),
  case catch gen_server:call(Pool, {checkout, Connection, Requester, RequestRef}, CheckoutTimeout) of
    {ok, Socket, Owner} ->
      %% stats
      ?report_debug("reuse a connection", [{pool, PoolName}]),
      _ = metrics:update_meter(Metrics, [hackney_pool, PoolName, take_rate], 1),
      _ = metrics:increment_counter(Metrics, [hackney_pool, Host, reuse_connection]),


      {ok, {PoolName, RequestRef, Connection, Owner, Transport}, Socket};
    {error, no_socket, Owner} ->
      ?report_trace("no socket in the pool", [{pool, PoolName}]),
      Begin = os:timestamp(),
      case hackney_connection:connect(Connection, ConnectOptions, ConnectTimeout) of
        {ok, Socket} ->
          case hackney_connection:controlling_process(Connection, Socket, Requester) of
            ok ->
              ?report_trace("new connection", []),
              ConnectTime = timer:now_diff(os:timestamp(), Begin)/1000,
              _ = metrics:update_histogram(Metrics, [hackney, Host, connect_time], ConnectTime),
              _ = metrics:increment_counter(Metrics, [hackney_pool, Host, new_connection]),
              {ok, {PoolName, RequestRef, Connection, Owner, Transport}, Socket};
            Error ->
              'Elixir.IO':inspect({pool, connect_error_controlling, Error}),
              catch hackney_connection:close(Connection, Socket),
              cancel_checkout(Pool, Connection, RequestRef),
              _ = metrics:increment_counter(Metrics, [hackney, Host, connect_error]),
              Error
           end;
        {error, timeout} ->
          _ = metrics:increment_counter(Metrics, [hackney, Host, connect_timeout]),
          cancel_checkout(Pool, Connection, RequestRef),
          {error, timeout};
        Error ->
          ?report_trace("connect error", []),
          'Elixir.IO':inspect({pool, connect_error, Error}),
          _ = metrics:increment_counter(Metrics, [hackney, Host, connect_error]),
          cancel_checkout(Pool, Connection, RequestRef),
          Error
      end;
    {error, Reason} ->
      {error, Reason};
    {'EXIT', {timeout, Reason}} ->
      % socket will still checkout so to avoid deadlock we send in a cancellation
      cancel_checkout(Pool, Connection, RequestRef),
      {error, checkout_timeout}
  end.

cancel_checkout(Pool, Connection, RequestRef) ->
  gen_server:cast(Pool, {checkout_cancel, Connection, RequestRef}).

The above change seems to fix the {:error, :econnrefused | :timeout | atom} related failures.
~~I still haven't figured out why valid requests are exhausting the pool in my current tests, I'll update this comment with my findings (or failures)~~

Regarding the valid request failure

The issue was not reading off the request body to complete the transaction:

for _ <- 1..60 do
  result = {:ok, 200, headers, body_ref} = :hackney.request("http://localhost:7654")
  _ = :hackney.body(body_ref) #< the important bit
  IO.inspect {:response, result}
  IO.inspect {:pool_stats, :hackney_pool.get_stats(:default)}
end

Related benoitc#643

…ailed downloads, the pool is (permanently?) exhausted Without this, after we hit a few HTTP errors (which of course NOAA serves us a lot, that's the whole point of this proxy), all (?) future downloads fail with this error: {:error, %HTTPoison.Error{id: nil, reason: :checkout_timeout}} Hackney issue: edgurgel/httpoison#414 More info: benoitc/hackney#643

Fixed issue #643 and memory leak in hackney_pool

InoMurko · 2020-11-20T10:05:12Z

Thanks everyone!!! <3

benoitc self-assigned this Jul 6, 2020

benoitc added the bug label Jul 6, 2020

benoitc added working on it pool labels Jul 7, 2020

InoMurko mentioned this issue Jul 8, 2020

plug health test omgnetwork/omg-childchain-v2#80

Merged

losvedir mentioned this issue Jul 22, 2020

feat: Add a health check engine to monitor conneciton pools mbta/realtime_signs#387

Merged

4 tasks

losvedir mentioned this issue Aug 3, 2020

fix: Rollback hackney to 1.15.2 mbta/realtime_signs#389

Merged

4 tasks

nateless mentioned this issue Aug 31, 2020

checkout_timeout after enough http errors edgurgel/httpoison#414

Open

digitalcora mentioned this issue Sep 1, 2020

build(deps): bump hackney from 1.15.2 to 1.16.0 mbta/realtime_signs#391

Closed

benoitc added this to the 1.17.0 milestone Sep 11, 2020

IceDragon200 added a commit to IceDragon200/hackney that referenced this issue Sep 14, 2020

Fixed checkout_timeout on errors caused by not canceling the checkout

d042296

Related benoitc#643

IceDragon200 mentioned this issue Sep 14, 2020

Fixed checkout_timeout on errors caused by not canceling the checkout #656

Closed

SergeTupchiy mentioned this issue Oct 13, 2020

Fixed issue #643 and memory leak in hackney_pool #661

Merged

InoMurko mentioned this issue Oct 30, 2020

ensure HA works omgnetwork/omg-childchain-v2#157

Merged

mruoss mentioned this issue Nov 6, 2020

Controller actions called with checkout_timeout after master upgrade coryodaniel/bonny#98

Closed

v0idpwn mentioned this issue Nov 20, 2020

Support other HTTP adapters dashbitco/broadway_cloud_pub_sub#51

Closed

benoitc closed this as completed in #661 Nov 20, 2020

benoitc added a commit that referenced this issue Nov 20, 2020

Merge pull request #661 from SergeTupchiy/master

91763df

Fixed issue #643 and memory leak in hackney_pool

InoMurko mentioned this issue Dec 14, 2020

Remove unremovable default hackney pool mana-ethereum/ethereumex#82

Merged

DialaHamadneh mentioned this issue Jun 16, 2021

Upgrade hackney and add delete organization workos/workos-elixir#5

Merged

PaulJKim mentioned this issue Mar 14, 2023

Remove hackney from realtime_signs mbta/realtime_signs#520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connections not returned to the pool on connection errors #643

connections not returned to the pool on connection errors #643

InoMurko commented Jun 30, 2020

InoMurko commented Jul 3, 2020

benoitc commented Jul 5, 2020

jfis commented Jul 5, 2020

InoMurko commented Jul 6, 2020 •

edited

benoitc commented Jul 6, 2020

benoitc commented Jul 6, 2020

kape1395 commented Jul 7, 2020

silviucpp commented Jul 19, 2020

benoitc commented Jul 19, 2020

silviucpp commented Jul 22, 2020

losvedir commented Jul 23, 2020

silviucpp commented Jul 23, 2020

losvedir commented Aug 2, 2020

silviucpp commented Aug 3, 2020

IceDragon200 commented Sep 14, 2020 •

edited

InoMurko commented Nov 20, 2020

connections not returned to the pool on connection errors #643

connections not returned to the pool on connection errors #643

Comments

InoMurko commented Jun 30, 2020

InoMurko commented Jul 3, 2020

benoitc commented Jul 5, 2020

jfis commented Jul 5, 2020

InoMurko commented Jul 6, 2020 • edited

benoitc commented Jul 6, 2020

benoitc commented Jul 6, 2020

kape1395 commented Jul 7, 2020

silviucpp commented Jul 19, 2020

benoitc commented Jul 19, 2020

silviucpp commented Jul 22, 2020

losvedir commented Jul 23, 2020

silviucpp commented Jul 23, 2020

losvedir commented Aug 2, 2020

silviucpp commented Aug 3, 2020

IceDragon200 commented Sep 14, 2020 • edited

InoMurko commented Nov 20, 2020

InoMurko commented Jul 6, 2020 •

edited

IceDragon200 commented Sep 14, 2020 •

edited