Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hackney pool overload #510

Open
grepz opened this issue Jun 6, 2018 · 2 comments

Comments

Projects
None yet
4 participants
@grepz
Copy link

commented Jun 6, 2018

Hello,

It is possible to overload hackney pool and render it unresponsive or even crash erlang node.
The case is like this:

Some process sending simple http requests on a server that can handle not more then N RPS with rate N * k(k >= 2) At some point hackney pool queue size begins to grow indefinitely(with my numbers its ~300 rps of receiving server and 600-700 rps for client).

Process info for a pool shows this:

[{current_function,{queue,filter_r,2}},
 {initial_call,{proc_lib,init_p,5}},
 {status,runnable},
 {message_queue_len,305621},
 {messages,[{'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",
                                           80,hackney_tcp},
                                          #Ref<0.0.1835010.103289>}},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1572872.124197>}},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1835009.101047>}},
            {'DOWN',#Ref<0.0.1572866.107847>,request,<0.15587.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572868.125051>,request,<0.15587.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572869.178066>,request,<0.15587.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572865.99206>,request,<0.15586.32>,normal},
            {'DOWN',#Ref<0.0.1572865.158565>,request,<0.15586.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572867.214897>,request,<0.15586.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572866.107844>,request,<0.15584.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572869.91611>,request,<0.15584.32>,normal},
            {'DOWN',#Ref<0.0.1572872.6542>,request,<0.15584.32>,normal},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1835011.64499>}},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1572872.124201>}},
            {'DOWN',#Ref<0.0.1572867.71002>,request,<0.15585.32>,normal},
            {'DOWN',#Ref<0.0.1572871.14102>,request,<0.15585.32>,normal},
            {'DOWN',#Ref<0.0.1572868.211148>,request,<0.15585.32>,
                    normal},
            {'$gen_cast',{checkout_cancel,{...},...}},
            {'$gen_cast',{checkout_cancel,...}},
            {'$gen_call',{...},...},
            {'$gen_cast',...},
            {...}|...]},
 {links,[<0.982.0>]},
 {dictionary,[{'$ancestors',[hackney_sup,<0.981.0>]},
              {'$initial_call',{hackney_pool,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.980.0>},
 {total_heap_size,25188109},
 {heap_size,2984878},
 {stack_size,8939},
 {reductions,7087623884},
 {garbage_collection,[{min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,199}]},
 {suspending,[]}]

Pool configuration looks like this:

  max_conn => 100,
  timeout => 3000

Since message queue size is huge, pool becomes completely unresponsive.

My proposal: maybe its a good idea to set up some back pressure or introduce some defensive mechanism that will allow to avoid pool lock.

Also it seems a bit unclear when hackney_pool answers {error, connect_timeout} for gen_server:call timeout to a pool. Forces confusion in case described above. Connection itself has not been even tried to be established. Since gen_server timeouted its call.

@benoitc benoitc self-assigned this Jun 9, 2018

@seanmcevoy

This comment has been minimized.

Copy link
Contributor

commented Oct 11, 2018

i noticed similar issues and when testing have alieviated a lot of them with this fix:
#540
it's not a complete fix but it definitely helps the symptoms here
@benoitc

@indrekj

This comment has been minimized.

Copy link

commented Mar 5, 2019

I'm not sure if it's the same problem but my pools are getting stuck as well. I'm using hackney 1.15.0. Running get stats on the pool returns:

iex(transporter@10.0.72.101)14> :hackney_pool.get_stats(:"client-logger")
[
  name: :"client-logger",
  max: 400,
  in_use_count: 0,
  free_count: 0,
  queue_count: 13961
]

in_use_count is 0 but the queue count is huge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.