Skip to content

Process stuck with growing message queue #5420

@jamesaimonetti

Description

@jamesaimonetti

We are seeing ever worsening performance in couch 3.3.3

Description

Over time queries to couch take longer and eventually start return 500s and we see perf continue to degrade.

We've found a process with a growing mailbox:

process_info(pid(0,289,0)).
[{registered_name,couch_server_10},
 {current_function,{erts_internal,await_result,1}},
 {initial_call,{proc_lib,init_p,5}},
 {status,running},
 {message_queue_len,69312},
 {links,[<0.18892.3178>,<0.25843.3474>,<0.28109.2975>,
         <0.30613.3209>,<0.32351.3494>,<0.31224.3509>,<0.30413.3250>,
         <0.27158.3496>,<0.28042.3560>,<0.19662.3364>,<0.22065.3445>,
         <0.22667.3591>,<0.20881.3172>,<0.19280.3563>,<0.19642.3408>,
         <0.19654.3416>,<0.19041.3365>,<0.9166.3328>,<0.17046.2913>,
         <0.17074.3321>,<0.17825.3408>|...]},
 {dictionary,[{'$ancestors',[couch_primary_services,
                             couch_sup,<0.256.0>]},
              {'$initial_call',{couch_server,init,1}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal}, 
 {group_leader,<0.255.0>},
 {total_heap_size,365113},
 {heap_size,46422},
 {stack_size,45},
 {reductions,99576710041},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,16048}]},
 {suspending,[]}]

Looking at the linked processes we see a lot of db updates appearing to be stuck in do_call:

[{current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,2},
 {links,[<7672.6048.3484>,<7672.289.0>]},
 {dictionary,[{'$ancestors',[<7672.25449.3401>]},
              {io_priority,{db_update,<<"shards/00000000-ffffffff/account/0d/a0/175b29c55f3888839e47caf2821e-202502.1738041440">>}},
              {last_id_merged,<<"202502-ledgers_monthly_rollover">>},
              {'$initial_call',{couch_db_updater,init,1}},
              {idle_limit,61000}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<7672.255.0>},
 {total_heap_size,4185},
 {heap_size,4185},
 {stack_size,44},
 {reductions,21157},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,
                                       size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,0}]},
 {suspending,[]}]

Steps to Reproduce

This develops over time but appears correlated with a number of tasks we run at the beginning of the month

Expected Behaviour

Don't lock up.

Your Environment

  • CouchDB version used: 3.3.3 / OTP 24
  • Browser name and version: N/A
  • Operating system and version: Centos

Additional Context

Its a 3-node cluster and we see this on all three nodes.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions