Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hackney pool overload #510

Open
grepz opened this issue Jun 6, 2018 · 2 comments
Open

Hackney pool overload #510

grepz opened this issue Jun 6, 2018 · 2 comments
Assignees

Comments

@grepz
Copy link

@grepz grepz commented Jun 6, 2018

Hello,

It is possible to overload hackney pool and render it unresponsive or even crash erlang node.
The case is like this:

Some process sending simple http requests on a server that can handle not more then N RPS with rate N * k(k >= 2) At some point hackney pool queue size begins to grow indefinitely(with my numbers its ~300 rps of receiving server and 600-700 rps for client).

Process info for a pool shows this:

[{current_function,{queue,filter_r,2}},
 {initial_call,{proc_lib,init_p,5}},
 {status,runnable},
 {message_queue_len,305621},
 {messages,[{'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",
                                           80,hackney_tcp},
                                          #Ref<0.0.1835010.103289>}},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1572872.124197>}},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1835009.101047>}},
            {'DOWN',#Ref<0.0.1572866.107847>,request,<0.15587.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572868.125051>,request,<0.15587.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572869.178066>,request,<0.15587.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572865.99206>,request,<0.15586.32>,normal},
            {'DOWN',#Ref<0.0.1572865.158565>,request,<0.15586.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572867.214897>,request,<0.15586.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572866.107844>,request,<0.15584.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572869.91611>,request,<0.15584.32>,normal},
            {'DOWN',#Ref<0.0.1572872.6542>,request,<0.15584.32>,normal},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1835011.64499>}},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1572872.124201>}},
            {'DOWN',#Ref<0.0.1572867.71002>,request,<0.15585.32>,normal},
            {'DOWN',#Ref<0.0.1572871.14102>,request,<0.15585.32>,normal},
            {'DOWN',#Ref<0.0.1572868.211148>,request,<0.15585.32>,
                    normal},
            {'$gen_cast',{checkout_cancel,{...},...}},
            {'$gen_cast',{checkout_cancel,...}},
            {'$gen_call',{...},...},
            {'$gen_cast',...},
            {...}|...]},
 {links,[<0.982.0>]},
 {dictionary,[{'$ancestors',[hackney_sup,<0.981.0>]},
              {'$initial_call',{hackney_pool,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.980.0>},
 {total_heap_size,25188109},
 {heap_size,2984878},
 {stack_size,8939},
 {reductions,7087623884},
 {garbage_collection,[{min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,199}]},
 {suspending,[]}]

Pool configuration looks like this:

  max_conn => 100,
  timeout => 3000

Since message queue size is huge, pool becomes completely unresponsive.

My proposal: maybe its a good idea to set up some back pressure or introduce some defensive mechanism that will allow to avoid pool lock.

Also it seems a bit unclear when hackney_pool answers {error, connect_timeout} for gen_server:call timeout to a pool. Forces confusion in case described above. Connection itself has not been even tried to be established. Since gen_server timeouted its call.

@seanmcevoy
Copy link
Contributor

@seanmcevoy seanmcevoy commented Oct 11, 2018

i noticed similar issues and when testing have alieviated a lot of them with this fix:
#540
it's not a complete fix but it definitely helps the symptoms here
@benoitc

@indrekj
Copy link

@indrekj indrekj commented Mar 5, 2019

I'm not sure if it's the same problem but my pools are getting stuck as well. I'm using hackney 1.15.0. Running get stats on the pool returns:

iex(transporter@10.0.72.101)14> :hackney_pool.get_stats(:"client-logger")
[
  name: :"client-logger",
  max: 400,
  in_use_count: 0,
  free_count: 0,
  queue_count: 13961
]

in_use_count is 0 but the queue count is huge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants