Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging or 500 errors during concurrent replication #1093

Closed
ReeceStevens opened this issue Jan 5, 2018 · 20 comments
Closed

Hanging or 500 errors during concurrent replication #1093

ReeceStevens opened this issue Jan 5, 2018 · 20 comments

Comments

@ReeceStevens
Copy link

ReeceStevens commented Jan 5, 2018

Expected Behavior

If multiple replications are occurring in different processes, they should not cause the database to return a 500 error (in the case of CouchDB 2.1.1) or hang for extended periods of time (CouchDB 1.6.1).

Current Behavior

I apologize in advance for the length of some of these log snippets-- I am including the entirety of lines that are beginning with [error], which sometimes can be quite a bit of information.

CouchDB 2.1.1

If multiple threads are triggering replications at the same time, an occasional server 500 error will occur. This is more likely to occur if all replications are referring to the same source database.

Logs show the following message:

[error] 2018-01-05T18:34:27.007398Z couchdb@127.0.0.1 <0.14271.1667> f5fe1295eb req_err(54226069) internal_server_error : No DB shards could be opened.
    [<<"fabric_util:get_shard/4 L185">>,<<"fabric_util:get_shard/4 L200">>,<<"fabric_util:get_shard/4 L200">>,<<"fabric:get_security/2 L146">>,<<"chttpd_au
th_request:db_authorization_check/1 L91">>,<<"chttpd_auth_request:authorize_request/1 L19">>,<<"chttpd:process_request/1 L293">>,<<"chttpd:handle_request_i
nt/1 L231">>]
[error] 2018-01-05T18:34:25.555735Z couchdb@127.0.0.1 <0.23659.1662> -------- Replicator, request GET to "http://localhost:5984/rs_src_6/60da00929026155788
ff550e81ccdabe?revs=true&open_revs=%5B%221-4d79096a94eeb4a5b15bc94cb52a07dc%22%5D&latest=true" failed. The received HTTP error code is 500
[notice] 2018-01-05T18:34:25.560581Z couchdb@127.0.0.1 <0.17111.1664> -------- Retrying GET to http://localhost:5984/rs_src_6/60da00929026155788ff550e81ccd
abe?revs=true&open_revs=%5B%221-4d79096a94eeb4a5b15bc94cb52a07dc%22%5D&latest=true in 4.0 seconds due to error {code,500}

and occasionally:

[notice] 2018-01-05T19:36:46.773242Z couchdb@127.0.0.1 <0.3476.1676> 946d87022f localhost:5984 127.0.0.1 undefined PUT /rs_dest_4/_local/2ea4663317c5b7556$
9098f53783083c 500 ok 2595
[error] 2018-01-05T19:36:46.773493Z couchdb@127.0.0.1 <0.28639.1669> -------- Replicator, request PUT to "http://localhost:5984/rs_dest_4/_local/2ea4663317
c5b755669098f53783083c" failed due to error {code,500}
[error] 2018-01-05T19:36:46.773568Z couchdb@127.0.0.1 <0.28639.1669> -------- Replication `2ea4663317c5b755669098f53783083c+create_target` (`http://localho
st:5984/rs_src_0/` -> `http://localhost:5984/rs_dest_4/`) failed: {http_request_failed,"PUT",
                     "http://localhost:5984/rs_dest_4/_local/2ea4663317c5b755669098f53783083c",
                     {error,{code,500}}}
[error] 2018-01-05T19:36:46.773790Z couchdb@127.0.0.1 <0.350.0> -------- couch_replicator_scheduler : Transient job {"2ea4663317c5b755669098f53783083c","+c
reate_target"} failed, removing. Error: <<"{http_request_failed,\"PUT\",\n                     \"http://localhost:5984/rs_dest_4/_local/2ea4663317c5b755669
098f53783083c\",\n                     {error,{code,500}}}">>
[notice] 2018-01-05T19:36:46.773990Z couchdb@127.0.0.1 <0.11718.1671> d2ce14ca71 localhost:5984 127.0.0.1 undefined POST /_replicate 500 ok 68552
[error] 2018-01-05T19:36:46.774397Z couchdb@127.0.0.1 <0.28639.1669> -------- gen_server {couch_replicator_scheduler_job,{[50,101,97,52,54,54,51,51,49,55,9
9,53,98,55,53,53,54,54,57,48,57,56,102,53,51,55,56,51,48,56,51,99],[43,99,114,101,97,116,101,95,116,97,114,103,101,116]}} terminated with reason: {http_req
uest_failed,"PUT","http://localhost:5984/rs_dest_4/_local/2ea4663317c5b755669098f53783083c",{error,{code,500}}} at couch_replicator_httpc:report_error/4(li
ne:341) <= couch_replicator_httpc:send_req/3(line:67) <= couch_replicator_scheduler_job:update_checkpoint/2(line:824) <= couch_replicator_scheduler_job:upd
ate_checkpoint/3(line:814) <= couch_replicator_scheduler_job:do_checkpoint/1(line:787) <= couch_replicator_scheduler_job:do_last_checkpoint/1(line:551) <=
gen_server:try_dispatch/4(line:615) <= gen_server:handle_msg/5(line:681)
  last msg: {'EXIT',<0.2642.1672>,normal}
     state: [{rep_id,{"2ea4663317c5b755669098f53783083c","+create_target"}},{source,"http://localhost:5984/rs_src_0/"},{target,"http://localhost:5984/rs_de
st_4/"},{db_name,null},{doc_id,null},{options,[{checkpoint_interval,30000},{connection_timeout,30000},{create_target,true},{http_connections,20},{retries,5
},{socket_options,[{keepalive,true},{nodelay,false}]},{use_checkpoints,true},{worker_batch_size,500},{worker_processes,4}]},{session_id,<<"f324fa5c79c24efe
92eeaab39762d529">>},{start_seq,{0,0}},{source_seq,<<"1-g1AAAACreJzLYWBgYMlgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L9____sxIZ8ShKcgCSSfVgdQx41OWxAEmGBiAFVLqfGLU
HIGpB5mYBAGNWL7M">>},{committed_seq,{0,0}},{current_through_seq,{1,<<"1-g1AAAAEBeJzLYWBgYMlgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L9____szKYExlzgQLs5oamKSkmFtg
04DEmyQFIJtVDTWIAm2RiaWJqaGCcwsBZmpeSmpaZl5qCx4Q8FiDJ0ACkgIbsR5hiYZhsYGJkQZIpByCm_M9KZMgCAHQhRKU">>}},{highest_seq_done,{1,<<"1-g1AAAAEBeJzLYWBgYMlgTmEQTM4
vTc5ISXIwNDLXMwBCwxygFFMiQ5L9____szKYExlzgQLs5oamKSkmFtg04DEmyQFIJtVDTWIAm2RiaWJqaGCcwsBZmpeSmpaZl5qCx4Q8FiDJ0ACkgIbsR5hiYZhsYGJkQZIpByCm_M9KZMgCAHQhRKU">>
}}]
[error] 2018-01-05T19:36:46.774715Z couchdb@127.0.0.1 <0.28639.1669> -------- CRASH REPORT Process  (<0.28639.1669>) with 0 neighbors exited with reason: $
http_request_failed,"PUT","http://localhost:5984/rs_dest_4/_local/2ea4663317c5b755669098f53783083c",{error,{code,500}}} at gen_server:terminate/7(line:826$
 <= proc_lib:init_p_do_apply/3(line:240); initial_call: {couch_replicator_scheduler_job,init,['Argument__1']}, ancestors: [couch_replicator_scheduler_sup,$
ouch_replicator_sup,...], messages: [], links: [<0.349.0>], dictionary: [{task_status_props,[{changes_pending,null},{checkpoint_interval,30000},...]},...]$
 trap_exit: true, status: running, heap_size: 6772, stack_size: 27, reductions: 18107
[error] 2018-01-05T19:36:46.774881Z couchdb@127.0.0.1 <0.349.0> -------- Supervisor couch_replicator_scheduler_sup had child undefined started with {couch$
replicator_scheduler_job,start_link,undefined} at <0.28639.1669> exit with reason {http_request_failed,"PUT","http://localhost:5984/rs_dest_4/_local/2ea46$
3317c5b755669098f53783083c",{error,{code,500}}} in context child_terminated

CouchDB 1.6.1

If multiple threads are triggering replications at the same time, the server will occasionally stall for long periods of time and occasionally return with timeout errors. "Long periods of time" numerically means longer than if each thread were to run synchronously, one after the other.

Verbose logging show the following message when the error occurs, then locks up:

[Fri, 05 Jan 2018 19:28:36 GMT] [error] [<0.4483.0>] {error_report,<0.32.0>,
                      {<0.4483.0>,crash_report,
                       [[{initial_call,
                          {mochiweb_acceptor,init,
                           ['Argument__1','Argument__2','Argument__3']}},
                         {pid,<0.4483.0>},
                         {registered_name,[]},
                         {error_info,
                          {exit,
                           {timeout,
                            {gen_server,call,[couch_httpd_vhost,get_state]}},
                           [{gen_server,call,2,
                             [{file,"gen_server.erl"},{line,204}]},
                            {couch_httpd_vhost,dispatch_host,1,
                             [{file,"couch_httpd_vhost.erl"},{line,96}]},
                            {couch_httpd,handle_request,5,
                             [{file,"couch_httpd.erl"},{line,217}]},
                            {mochiweb_http,headers,5,
                             [{file,"mochiweb_http.erl"},{line,94}]},
                            {proc_lib,init_p_do_apply,3,
                             [{file,"proc_lib.erl"},{line,240}]}]}},
                         {ancestors,
                          [couch_httpd,couch_secondary_services,
                           couch_server_sup,<0.33.0>]},
                         {messages,
                          [{#Ref<0.0.262145.92883>,
                            {vhosts_state,[],
                             ["_utils","_uuids","_session","_oauth","_users"],
                             #Fun<couch_httpd.8.11472519>}}]},
                         {links,[<0.99.0>,#Port<0.6333>]},
                         {dictionary,[{couch_rewrite_count,0}]},
                         {trap_exit,false},
                         {status,running},
                         {heap_size,2586},
                         {stack_size,27},
                         {reductions,14911}],
                        []]}}

Eventually, there is a request timeout error. This can cause execution time to reach over 1.5x the synchronous upper limit, and sometimes substantially longer. Occasionally, this can also cause CouchDB 1.6.1 to completely freeze with no log output or other indications of activity. When a crash occurs, the log output is:


[Fri, 05 Jan 2018 20:08:59 GMT] [error] [<0.6673.0>] ** Generic server couch_server terminating
** Last message in was {#Ref<0.0.1048579.8596>,
                        {ok,{db,<0.6686.0>,<0.6687.0>,nil,
                                <<"1515181979786019">>,<0.6688.0>,<0.6684.0>,
                                <0.6689.0>,
                                {db_header,6,1,0,
                                    {2125,{1,0,1963},95},
                                    {2042,1,83},
                                    nil,0,nil,nil,1000},
                                1,
                                {btree,<0.6684.0>,
                                    {2125,{1,0,1963},95},
                                    #Fun<couch_db_updater.10.21234455>,
                                    #Fun<couch_db_updater.11.21234455>,
                                    #Fun<couch_btree.5.131744168>,
                                    #Fun<couch_db_updater.12.21234455>,snappy},
                                {btree,<0.6684.0>,
                                    {2042,1,83},
                                    #Fun<couch_db_updater.13.21234455>,
                                    #Fun<couch_db_updater.14.21234455>,
                                    #Fun<couch_btree.5.131744168>,
                                    #Fun<couch_db_updater.15.21234455>,snappy},
                                {btree,<0.6684.0>,nil,
                                    #Fun<couch_btree.3.131744168>,
                                    #Fun<couch_btree.4.131744168>,
                                    #Fun<couch_btree.5.131744168>,nil,snappy},
                                1,<<"_users">>,
                                "/var/lib/couchdb/_users.couch",
                                [#Fun<couch_doc.8.104915127>],
                                [],nil,
                                {user_ctx,null,[],undefined},
                                nil,1000,
                                [before_header,after_header,on_file_open],
                                [{before_doc_update,
                                     #Fun<couch_users_db.before_doc_update.2>},
                                 {after_doc_read,
                                     #Fun<couch_users_db.after_doc_read.2>},
                                 sys_db,
                                 {user_ctx,
                                     {user_ctx,null,[<<"_admin">>],undefined}},
                                 nologifmissing,sys_db],
                                snappy,
                                #Fun<couch_users_db.before_doc_update.2>,
                                #Fun<couch_users_db.after_doc_read.2>}}}
** When Server state == {server,"/var/lib/couchdb",
                            {re_pattern,0,0,0,
                                <<69,82,67,80,140,0,0,0,16,0,0,0,1,0,0,0,255,
                                  255,255,255,255,255,255,255,0,0,0,0,0,0,0,0,
                                  0,0,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                                  0,0,0,0,0,0,0,0,0,0,0,125,0,72,25,106,0,0,0,
                                  0,0,0,0,0,0,0,0,0,254,255,255,7,0,0,0,0,0,0,
                                  0,0,0,0,0,0,0,0,0,0,106,0,0,0,0,16,171,255,
                                  3,0,0,0,128,254,255,255,7,0,0,0,0,0,0,0,0,0,
                                  0,0,0,0,0,0,0,98,27,114,0,72,0>>},
                            300,0,"Fri, 05 Jan 2018 19:52:51 GMT"}
** Reason for termination ==
** {kill,[{couch_server,handle_info,2,[{file,"couch_server.erl"},{line,479}]},
          {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]},
          {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,681}]},
          {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

[Fri, 05 Jan 2018 20:09:06 GMT] [error] [<0.6673.0>] {error_report,<0.32.0>,
                      {<0.6673.0>,crash_report,
                       [[{initial_call,{couch_server,init,['Argument__1']}},
                         {pid,<0.6673.0>},
                         {registered_name,couch_server},
                         {error_info,
                          {exit,kill,
                           [{gen_server,terminate,7,
                             [{file,"gen_server.erl"},{line,826}]},
                            {proc_lib,init_p_do_apply,3,
                             [{file,"proc_lib.erl"},{line,240}]}]}},
                         {ancestors,
                          [couch_primary_services,couch_server_sup,<0.33.0>]},
                         {messages,
                          [{'$gen_call',
                            {<0.6674.0>,#Ref<0.0.1048579.8601>},
                            {create,<<"_users">>,
                             [{before_doc_update,
                               #Fun<couch_users_db.before_doc_update.2>},
                              {after_doc_read,
                               #Fun<couch_users_db.after_doc_read.2>},
                              sys_db,
                              {user_ctx,
                               {user_ctx,null,[<<"_admin">>],undefined}},
                              nologifmissing,sys_db]}},
                           {'$gen_call',
                            {<0.6493.0>,#Ref<0.0.1048581.39353>},
                            {open,<<"rs_src_0">>,
                             [{user_ctx,
                               {user_ctx,null,
                                [<<"_admin">>],
                                <<"{couch_httpd_auth, default_authentication_handler}">>}}]}},
                            *********** SNIPPED (repeating) ***********
                         {links,[<0.83.0>]},
                         {dictionary,[]},
                         {trap_exit,true},
                         {status,running},
                         {heap_size,2586},
                         {stack_size,27},
                         {reductions,5686}],
                        []]}}

Possible Solution

Based on the fact that the issue only sometimes occurs and is more likely to show up when there are more threads, I am inclined to think this is a race condition. It also seems to happen more frequently if the replications are using the same source database, which might indicate that the race condition is involved with reading the source database. That is only a hunch, however.

Steps to Reproduce (for bugs)

A script to replicate the issue is here: https://gist.github.com/ReeceStevens/35d2cb06f820d3f054c6ff8dc226ef17

  1. Make sure a couchDB server is running on localhost:5984
  2. python test_parallel_replication.py --threads 10 --iterations 40 --single-source (we have had the most luck replicating the issue with these parameters)
  3. Look for a 500 error, or for the elapsed execution time to exceed the synchronous upper limit (you will see a "LONGER THAN SYNCHRONOUS" log message when this is the case).

Context

We are using CouchDB 1.6.1; during product testing, we will often perform replications of large documents simultaneously. We are seeing about 3 or 4 test failures every run that are related to a hanging replication process-- they are not the same tests failing, and when run in isolation they pass. We did not see the same behavior in CouchDB 2.1.1 but are blocked from moving to that version due to #745. I believe we did not see this issue in 2.1.1 because it fails much more quickly and we retry requests on failure.

Your Environment

  • Version used: 1.6.1 or 2.1.1
  • Operating System and version (desktop or mobile): Ubuntu 16.04 server
@elistevens
Copy link

As Reece noted, while we are currently using 1.6.1 in production, we're intending to move to 2.1.x as soon as our blocking issues (#745 and this, so far) are resolved. Please don't interpret anything here as a request for an updated 1.x version. Cheers!

@wohali
Copy link
Member

wohali commented Jan 16, 2018

Have you tested against 1.7.1? 1.6.1 has serious known security issues, as disclosed in multiple places.

As you know, it is highly unlikely there will be another 1.x release at this point.

@ReeceStevens
Copy link
Author

Hi @wohali ,

I have not yet tested against 1.7.1. Our application is safe to use 1.6.1 since we operate an internal couchDB instance (no external world access), but we are wanting to move to 2.1.x as soon as is feasible. See @elistevens message above.

I will get a 1.7.1 environment up and running to test; however, I anticipate we will still have the issues seen above since there are problems with both 1.6.1 and 2.1.1. I'll let you know what I find.

I should have clarified in my earlier message, we definitely aren't requesting an additional 1.x release. I wanted to include 1.6.1 and 2.1.1 information since we noticed abnormal behavior on both versions and thought the extra information could be useful for an investigation. Currently, the blocking issue we are experiencing is that the server crashes during highly concurrent replication.

Thanks!

@wohali
Copy link
Member

wohali commented Jan 16, 2018

My guess is this is the same issue as in #796. Have you tried increasing your fd limit and max_open_dbs? Are you running all of your databases with the default q=8 or have you tried q=1 databases?

@wohali
Copy link
Member

wohali commented Jan 19, 2018

When you are seeing this problem, please do a GET /_node/_local/_system and look at the message queue sizes. Do you see any that have a high number, such as couch_server?

@geordanr
Copy link

We increased max open files for CouchDB processes to 1048576; max_dbs_open is 500.

We lowered to q=4, but we can try lowering it further to 1.

@wohali
Copy link
Member

wohali commented Jan 19, 2018

Remember that max_dbs_open is actually # of shards open, so you have to consider # of dbs * # of shards.

If you expect to be running with a lot of databases, you may also want to set this in your local.ini:

[couchdb]
update_lru_on_read = false

@wohali
Copy link
Member

wohali commented Jan 19, 2018

@ReeceStevens I'm unable to run your test case on my test environment (2GB RAM) - Python throws a MemoryError on line 28.

Can you provide a script that generates entropy without the use of numpy? Thanks.

@ReeceStevens
Copy link
Author

@wohali I have modified the script to remove the numpy dependency-- let me know if that works for you. I will be following up on your previous comments soon, thanks for your patience.

@ReeceStevens
Copy link
Author

Hi there @wohali ,

Preface: I only had couchDB 2.0.0 to work with when trying your suggestions, but I believe the underlying behavior is identical to v2.1.1.

So I first watched message queues while the problem occurred-- I'm not sure what a relatively large message queue size is for couchDB but the only ones that rose significantly were file_server_2 and couch_index_server ("significantly" meaning on the order of 100). However, this was always a transient increase, and by the time couch started throwing 500 errors they were both back at zero. I never saw couch_server increase at all. FWIW, I will post the _system node response after couch crashes at the bottom of this response.

Next, I ran with the update_lru_on_read = false option set in my local.ini. This seemed to trigger the error more quickly. Before adding the option, couchDB would take about 24-26 iterations of my parallel replication script before crashing; after, it only took 4. This was consistent across multiple trials.

_system node response with update_lru_on_read not set

{
  "uptime": 228,
  "memory": {
    "other": 108667722,
    "atom": 504409,
    "atom_used": 472138,
    "processes": 34621392,
    "processes_used": 34608264,
    "binary": 813285896,
    "code": 12084677,
    "ets": 1842064
  },
  "run_queue": 0,
  "ets_table_count": 86,
  "context_switches": 7444084,
  "reductions": 1233973042,
  "garbage_collection_count": 842909,
  "words_reclaimed": 1573709799,
  "io_input": 7708805364,
  "io_output": 6367883539,
  "os_proc_count": 0,
  "stale_proc_count": 0,
  "process_count": 876,
  "process_limit": 262144,
  "message_queues": {
    "couch_file": {
      "50": 0,
      "90": 0,
      "99": 0,
      "count": 216,
      "min": 0,
      "max": 0
    },
    "couch_db_updater": {
      "50": 0,
      "90": 0,
      "99": 0,
      "count": 176,
      "min": 0,
      "max": 0
    },
    "error_logger": 0,
    "couch_auth_cache": 0,
    "mango_sup": 0,
    "httpd_sup": 0,
    "couch_log_sup": 0,
    "couch_log_server": 0,
    "erl_prim_loader": 0,
    "global_group": 0,
    "standard_error_sup": 0,
    "erl_epmd": 0,
    "couch_epi_sup": 0,
    "couch_epi_functions_gen_couch_index": 0,
    "couch_epi_functions_gen_chttpd_auth": 0,
    "couch_epi_functions_gen_couch_db": 0,
    "couch_epi_functions_gen_global_changes": 0,
    "kernel_safe_sup": 0,
    "couch_epi_functions_gen_chttpd_handlers": 0,
    "couch_epi_functions_gen_chttpd": 0,
    "auth": 0,
    "code_server": 0,
    "rexi_buffer_couchdb@localhost": 0,
    "rexi_server_couchdb@localhost": 0,
    "mem3_sync_nodes": 0,
    "application_controller": 0,
    "httpc_sup": 0,
    "mochiweb_clock": 0,
    "mem3_sync": 0,
    "config_sup": 0,
    "mem3_sup": 0,
    "config": 0,
    "httpc_profile_sup": 0,
    "mem3_shards": 0,
    "global_changes_sup": 0,
    "inet_gethost_native_sup": 0,
    "httpc_manager": 0,
    "global_changes_server": 0,
    "httpc_handler_sup": 0,
    "runtime_tools_sup": 0,
    "mem3_nodes": 0,
    "net_kernel": 0,
    "user": 0,
    "ddoc_cache_sup": 0,
    "ftp_sup": 0,
    "rexi_sup": 0,
    "ddoc_cache_opener": 0,
    "rexi_server_sup": 0,
    "rexi_server_mon": 0,
    "rexi_server": 0,
    "ddoc_cache_lru": 0,
    "ioq_sup": 0,
    "chttpd_auth_cache_lru": 0,
    "memsup": 0,
    "mem3_events": 0,
    "disksup": 0,
    "ioq": 0,
    "couch_secondary_services": 0,
    "cpu_sup": 0,
    "couch_primary_services": 0,
    "couch_replicator_sup": 0,
    "ibrowse_sup": 0,
    "couch_replicator_manager": 0,
    "couch_replicator_job_sup": 0,
    "couch_uuids": 0,
    "rexi_buffer_mon": 0,
    "config_event": 0,
    "rexi_buffer_sup": 0,
    "ibrowse": 0,
    "couch_event_sup2": 0,
    "couch_event_server": 0,
    "couch_task_status": 0,
    "couch_event_os_sup": 0,
    "couch_sup": 0,
    "release_handler": 0,
    "couch_server": 0,
    "os_mon_sup": 0,
    "overload": 0,
    "couch_proc_manager": 0,
    "sasl_sup": 0,
    "couch_stats_sup": 0,
    "couch_stats_process_tracker": 0,
    "couch_plugin": 0,
    "couch_replication": 0,
    "alarm_handler": 0,
    "couch_os_daemons": 0,
    "tftp_sup": 0,
    "sasl_safe_sup": 0,
    "couch_stats_aggregator": 0,
    "timer_server": 0,
    "couch_httpd_vhost": 0,
    "folsom_sup": 0,
    "couch_index_sup": 0,
    "folsom_sample_slide_sup": 0,
    "couch_index_server": 0,
    "couch_httpd": 0,
    "rex": 0,
    "standard_error": 0,
    "init": 0,
    "net_sup": 0,
    "tls_connection_sup": 0,
    "chttpd_sup": 0,
    "kernel_sup": 0,
    "couch_external_manager": 0,
    "global_name_server": 0,
    "folsom_metrics_histogram_ets": 0,
    "file_server_2": 0,
    "couch_drv": 0,
    "ssl_sup": 0,
    "folsom_meter_timer_server": 0,
    "chttpd_auth_cache": 0,
    "setup_sup": 0,
    "inet_gethost_native": 0,
    "inet_db": 0,
    "inets_sup": 0,
    "ssl_manager": 0,
    "ssl_listen_tracker_sup": 0,
    "couch_compaction_daemon": 0,
    "chttpd": 0
  },
  "internal_replication_jobs": 0,
  "distribution": {}
}

_system node response with update_lru_on_read = false

{
  "uptime": 357,
  "memory": {
    "other": 107225738,
    "atom": 504409,
    "atom_used": 472138,
    "processes": 26205048,
    "processes_used": 26174352,
    "binary": 572270512,
    "code": 12084677,
    "ets": 1700232
  },
  "run_queue": 0,
  "ets_table_count": 86,
  "context_switches": 5537983,
  "reductions": 867477179,
  "garbage_collection_count": 787554,
  "words_reclaimed": 1465399318,
  "io_input": 5019941144,
  "io_output": 2954750043,
  "os_proc_count": 0,
  "stale_proc_count": 0,
  "process_count": 736,
  "process_limit": 262144,
  "message_queues": {
    "couch_file": {
      "50": 0,
      "90": 0,
      "99": 0,
      "count": 168,
      "min": 0,
      "max": 0
    },
    "couch_db_updater": {
      "50": 0,
      "90": 0,
      "99": 0,
      "count": 128,
      "min": 0,
      "max": 0
    },
    "error_logger": 0,
    "couch_auth_cache": 0,
    "mango_sup": 0,
    "httpd_sup": 0,
    "couch_log_sup": 0,
    "couch_log_server": 0,
    "erl_prim_loader": 0,
    "global_group": 0,
    "standard_error_sup": 0,
    "erl_epmd": 0,
    "couch_epi_sup": 0,
    "couch_epi_functions_gen_couch_index": 0,
    "couch_epi_functions_gen_chttpd_auth": 0,
    "couch_epi_functions_gen_couch_db": 0,
    "couch_epi_functions_gen_global_changes": 0,
    "kernel_safe_sup": 0,
    "couch_epi_functions_gen_chttpd_handlers": 0,
    "couch_epi_functions_gen_chttpd": 0,
    "auth": 0,
    "code_server": 0,
    "rexi_buffer_couchdb@localhost": 0,
    "rexi_server_couchdb@localhost": 0,
    "mem3_sync_nodes": 0,
    "application_controller": 0,
    "httpc_sup": 0,
    "mochiweb_clock": 0,
    "mem3_sync": 0,
    "config_sup": 0,
    "mem3_sup": 0,
    "config": 0,
    "httpc_profile_sup": 0,
    "mem3_shards": 0,
    "global_changes_sup": 0,
    "inet_gethost_native_sup": 0,
    "httpc_manager": 0,
    "global_changes_server": 0,
    "httpc_handler_sup": 0,
    "runtime_tools_sup": 0,
    "mem3_nodes": 0,
    "net_kernel": 0,
    "user": 0,
    "ddoc_cache_sup": 0,
    "ftp_sup": 0,
    "rexi_sup": 0,
    "ddoc_cache_opener": 0,
    "rexi_server_sup": 0,
    "rexi_server_mon": 0,
    "rexi_server": 0,
    "ddoc_cache_lru": 0,
    "ioq_sup": 0,
    "chttpd_auth_cache_lru": 0,
    "memsup": 0,
    "mem3_events": 0,
    "disksup": 0,
    "ioq": 0,
    "couch_secondary_services": 0,
    "cpu_sup": 0,
    "couch_primary_services": 0,
    "couch_replicator_sup": 0,
    "ibrowse_sup": 0,
    "couch_replicator_manager": 0,
    "couch_replicator_job_sup": 0,
    "couch_uuids": 0,
    "rexi_buffer_mon": 0,
    "config_event": 0,
    "rexi_buffer_sup": 0,
    "ibrowse": 0,
    "couch_event_sup2": 0,
    "couch_event_server": 0,
    "couch_task_status": 0,
    "couch_event_os_sup": 0,
    "couch_sup": 0,
    "release_handler": 0,
    "couch_server": 0,
    "os_mon_sup": 0,
    "couch_replication": 0,
    "overload": 0,
    "couch_proc_manager": 0,
    "sasl_sup": 0,
    "couch_stats_sup": 0,
    "couch_stats_process_tracker": 0,
    "couch_plugin": 0,
    "alarm_handler": 0,
    "couch_os_daemons": 0,
    "tftp_sup": 0,
    "sasl_safe_sup": 0,
    "couch_stats_aggregator": 0,
    "timer_server": 0,
    "couch_httpd_vhost": 0,
    "folsom_sup": 0,
    "couch_index_sup": 0,
    "folsom_sample_slide_sup": 0,
    "couch_index_server": 0,
    "couch_httpd": 0,
    "rex": 0,
    "standard_error": 0,
    "init": 0,
    "net_sup": 0,
    "tls_connection_sup": 0,
    "chttpd_sup": 0,
    "kernel_sup": 0,
    "couch_external_manager": 0,
    "global_name_server": 0,
    "folsom_metrics_histogram_ets": 0,
    "file_server_2": 0,
    "couch_drv": 0,
    "ssl_sup": 0,
    "folsom_meter_timer_server": 0,
    "chttpd_auth_cache": 0,
    "setup_sup": 0,
    "inet_gethost_native": 0,
    "inet_db": 0,
    "inets_sup": 0,
    "ssl_manager": 0,
    "ssl_listen_tracker_sup": 0,
    "couch_compaction_daemon": 0,
    "chttpd": 0
  },
  "internal_replication_jobs": 0,
  "distribution": {}
}

@wohali
Copy link
Member

wohali commented Jan 24, 2018

Thanks @ReeceStevens . So far, I can't reproduce with your script. 10 replications are taking 5-6 seconds to process with an upper limit of ~41s or so. I tried increasing the number of iterations and threads to no avail. My test machine is 2GB, 1 CPU @ 3.3GHz, SSD.

Are you testing on HDD there? I can try and rig up a test on a slower disk.

@ReeceStevens
Copy link
Author

@wohali I had initially been testing on a couchDB server under load when I used 10 threads. If you're using an otherwise-unloaded database, you can up the thread count to generate the error. --threads 40 --single-source definitely did the trick every time for me, even on an unloaded database. 40 threads is obviously a bit overkill, but this issue seems to be a race condition and occurs inconsistently. To get it to trigger every time, I had to lean in on the database pretty hard.

If you set update_lru_on_read = false, it should happen very quickly, probably in under 10 iterations.

@ReeceStevens
Copy link
Author

And yes, the server we are testing is using an HDD

@wohali
Copy link
Member

wohali commented Jan 24, 2018

I can't get a replication on that test machine even using --threads 100 --iterations 100 --single-source and with update_lru_on_read = false I'm afraid 🙍

I'll have to track down an HDD and try again tomorrow.

@ReeceStevens
Copy link
Author

Ahh, that is unfortunate. Perhaps disk latency is playing a role in all of this. Thanks, definitely let me know if the repro script doesn't work on an HDD either.

@wohali
Copy link
Member

wohali commented Feb 6, 2018

HI @ReeceStevens , from my knowledge of your deployment, you are running on eCryptFS, right?

Can you try the same test again on the same hardware, but using a faster filesystem like ext4 or xfs?

@ReeceStevens
Copy link
Author

Hi @wohali -- unfortunately, resting encryption is a hard requirement of our application so we need to use eCryptFS. Is there anything else we can try?

@wohali
Copy link
Member

wohali commented Feb 7, 2018

@ReeceStevens Yup, I'm aware, I just mean for testing. I haven't been able to reproduce the bug, so I was hoping you might have a test system on the same hardware you can test against that doesn't use eCryptFS. There's another bug filed on here that I still believe might be disk latency related, and I'm trying to eliminate possibilities.

@ReeceStevens
Copy link
Author

@wohali sorry for silence, I've been trying to get a test environment together for you where I can replicate the issue. I'm testing on a linux machine with a slow spinning disk, filesystem ext4, and stressing the CPU to see if I can generate the error. So far, I have not been able to reproduce the error I saw in our testing system. I will run the tests for a bit longer and then try the same thing setting up an eCryptFS partition, and I will let you know how that goes.

@wohali
Copy link
Member

wohali commented Mar 5, 2018

@ReeceStevens In testing together, it looks like the candidate fix for #745 has resolved this problem. I'm going to tentatively close this issue out.

If you have any recurrence of this issue while testing, please let me know.

@wohali wohali closed this as completed Mar 5, 2018
H--o-l added a commit to adrienverge/coucharchive that referenced this issue Sep 24, 2018
1. According to couchdb's docs [2] and a couchdb's maintainer [1]:
   > If you expect to be running with a lot of databases, you may also
   > want to set this in your local.ini:
   >     [couchdb]
   >     update_lru_on_read = false

   > This has lead to significant performance enhancements on very busy
   > clusters.

2. This new conf (`update_lru_on_read = false`) is the default since
   version 2.2.

As the `restore` operation can failed a lot due to performance issue, I propose
to set this option.
It makes no arm and could improve performance for people using coucharchive with
couchdb version < 2.2.

[1]: apache/couchdb#1093 (comment)
[2]: http://docs.couchdb.org/en/2.2.0/whatsnew/2.2.html#id1
H--o-l added a commit to adrienverge/coucharchive that referenced this issue Sep 24, 2018
1. According to couchdb's docs [2] and a couchdb's maintainer [1]:
   > If you expect to be running with a lot of databases, you may also
   > want to set this in your local.ini:
   >     [couchdb]
   >     update_lru_on_read = false

   > This has lead to significant performance enhancements on very busy
   > clusters.

2. This new conf (`update_lru_on_read = false`) is the default since
   version 2.2.

As the `restore` operation can failed a lot due to performance issue, I propose
to set this option.
It makes no arm and could improve performance for people using coucharchive with
couchdb version < 2.2.

[1]: apache/couchdb#1093 (comment)
[2]: http://docs.couchdb.org/en/2.2.0/whatsnew/2.2.html#id1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants