Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication with attachments never completes, {mp_parser_died,noproc} error #745

Closed
wohali opened this Issue Aug 10, 2017 · 36 comments

Comments

Projects
None yet
6 participants
@wohali
Copy link
Member

wohali commented Aug 10, 2017

Expected Behavior

Replication of a DB with attachments into 2.1.0 should be successful.

Current Behavior

Replication crashes after a while with the following stack trace:

[notice] 2017-08-10T07:07:19.694049Z couchdb@127.0.0.1 <0.4248.0> 41b32fb786 127.0.0.1:5984 127.0.0.1 undefined PUT /dbname/05a410bd-3d15-4d32-a410-bd3d156d32c2?new_edits=false 201 ok 812
[error] 2017-08-10T07:07:19.712888Z couchdb@127.0.0.1 emulator -------- Error in process <0.2147.1> on node 'couchdb@127.0.0.1' with exit value:
{{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}

[info] 2017-08-10T07:07:19.712956Z couchdb@127.0.0.1 <0.489.0> -------- Replication connection to: "127.0.0.1":5984 died with reason {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
[error] 2017-08-10T07:07:19.713776Z couchdb@127.0.0.1 <0.4020.0> ef685c906e req_err(3669112652) badmatch : ok
    [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1 L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6 L91">>,<<"proc_lib:init_p_do_apply/3 L247">>]
[notice] 2017-08-10T07:07:19.714171Z couchdb@127.0.0.1 <0.4020.0> ef685c906e 127.0.0.1:5984 127.0.0.1 undefined PUT /dbname/1ab680f5-eb77-4450-b680-f5eb774450a2?new_edits=false 500 ok 1
[error] 2017-08-10T07:07:19.714284Z couchdb@127.0.0.1 <0.22189.0> -------- Replicator, request PUT to "http://127.0.0.1:5984/dbname/1ab680f5-eb77-4450-b680-f5eb774450a2?new_edits=false" failed due to error {error,
    {'EXIT',
        {{{nocatch,{mp_parser_died,noproc}},
          [{couch_att,'-foldl/4-fun-0-',3,
               [{file,"src/couch_att.erl"},{line,591}]},
           {couch_att,fold_streamed_data,4,
               [{file,"src/couch_att.erl"},{line,642}]},
           {couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}]},
           {couch_httpd_multipart,atts_to_mp,4,
               [{file,"src/couch_httpd_multipart.erl"},{line,208}]}]},
         {gen_server,call,
             [<0.5894.0>,
              {send_req,
                  {{url,
                       "http://127.0.0.1:5984/dbname/1ab680f5-eb77-4450-b680-f5eb774450a2?new_edits=false",
                       "127.0.0.1",5984,undefined,undefined,
                       "/dbname/1ab680f5-eb77-4450-b680-f5eb774450a2?new_edits=false",
                       http,ipv4_address},
                   [{"Accept","application/json"},
                    {"Content-Length",278575},
                    {"Content-Type",
                     "multipart/related; boundary=\"dac3f5492529b83c6ba2be5e0894827f\""},
                    {"User-Agent","CouchDB-Replicator/2.1.0-f527f2a"}],
                   put,
                   {#Fun<couch_replicator_api_wrap.11.133909485>,
                    {<<{DOCUMENT HAS BEEN PARTIALLY CENSORED}>>,
                     [{att,
                          <<"9352f01c630c34550c81bf4c57ded4e9e5607e08f4c9a94b1383b111c5950b19">>,
                          <<"application/pdf">>,276333,276333,
                          <<90,37,115,74,154,16,223,165,110,51,141,171,118,182,
                            182,62>>,
                          3,
                          {follows,<0.22188.0>,#Ref<0.0.262145.216597>},
                          identity}],
                     <<"dac3f5492529b83c6ba2be5e0894827f">>,278575}},
                   [{response_format,binary},
                    {inactivity_timeout,30000},
                    {socket_options,[{keepalive,true},{nodelay,false}]}],
                   infinity}},
              infinity]}}}}
[notice] 2017-08-10T07:07:19.716645Z couchdb@127.0.0.1 <0.20928.0> -------- Retrying GET to https://mb-d46f6b75-bedb-496e-97e6-f230be51e571:*****@couchdb.icure.cloud:443/icure-mb-d46f6b75-bedb-496e-97e6-f230be51e571-healthdata/1ab680f5-eb77-4450-b680-f5eb774450a2?revs=true&open_revs=%5B%224-14513d194dd675cea97e2f415c71856b%22%5D&latest=true in 1.0 seconds due to error {http_request_failed,[80,85,84],[104,116,116,112,58,47,47,49,50,55,46,48,46,48,46,49,58,53,57,56,52,47,105,99,117,114,101,45,104,101,97,108,116,104,100,97,116,97,47,49,97,98,54,56,48,102,53,45,101,98,55,55,45,52,52,53,48,45,98,54,56,48,45,102,53,101,98,55,55,52,52,53,48,97,50,63,110,101,119,95,101,100,105,116,115,61,102,97,108,115,101],{error,{error,{'EXIT',{{{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,[115,114,99,47,99,111,117,99,104,95,97,116,116,46,101,114,108]},{line,591}]},{couch_att,fold_streamed_data,4,[{file,[115,114,99,47,99,111,117,99,104,95,97,116,116,46,101,114,108]},{line,642}]},{couch_att,foldl,4,[{file,[115,114,99,47,99,111,117,99,104,95,97,116,116,46,101,114,108]},{line,595}]},{couch_httpd_multipart,atts_to_mp,4,[{file,[115,114,99,47,99,111,117,99,104,95,104,116,116,112,100,95,109,117,108,116,105,112,97,114,116,46,101,114,108]},{line,208}]}]},{gen_server,call,[<0.5894.0>,{send_req,{{url,[104,116,116,112,58,47,47,49,50,55,46,48,46,48,46,49,58,53,57,56,52,47,105,99,117,114,101,45,104,101,97,108,116,104,100,97,116,97,47,49,97,98,54,56,48,102,53,45,101,98,55,55,45,52,52,53,48,45,98,54,56,48,45,102,53,101,98,55,55,52,52,53,48,97,50,63,110,101,119,95,101,100,105,116,115,61,102,97,108,115,101],[49,50,55,46,48,46,48,46,49],5984,undefined,undefined,[47,105,99,117,114,101,45,104,101,97,108,116,104,100,97,116,97,47,49,97,98,54,56,48,102,53,45,101,98,55,55,45,52,52,53,48,45,98,54,56,48,45,102,53,101,98,55,55,52,52,53,48,97,50,63,110,101,119,95,101,100,105,116,115,61,102,97,108,115,101],http,ipv4_address},[{[65,99,99,101,112,116],[97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110]},{[67,111,110,116,101,110,116,45,76,101,110,103,116,104],278575},{[67,111,110,116,101,110,116,45,84,121,112,101],[109,117,108,116,105,112,97,114,116,47,114,101,108,97,116,101,100,59,32,98,111,117,110,100,97,114,121,61,34,100,97,99,51,102,53,52,57,50,53,50,57,98,56,51,99,54,98,97,50,98,101,53,101,48,56,57,52,56,50,55,102,34]},{[85,115,101,114,45,65,103,101,110,116],[67,111,117,99,104,68,66,45,82,101,112,108,105,99,97,116,111,114,47,50,46,49,46,48,45,102,53,50,55,102,50,97]}],put,{#Fun<couch_replicator_api_wrap.11.133909485>,{<<123,34,95,105,100,34,58,34,49,97,98,54,56,48,102,53,45,101,98,55,55,45,52,52,53,48,45,98,54,56,48,45,102,53,101,98,55,55,52,52,53,48,97,50,34,44,34,95,114,101,118,34,58,34,52,45,49,52,53,49,51,100,49,57,52,100,100,54,55,53,99,101,97,57,55,101,50,102,52,49,53,99,55,49,56,53,54,98,34,44,34,99,114,101,97,116,101,100,34,58,49,51,57,52,53,50,54,50,50,52,57,50,55,44,34,109,111,100,105,102,105,101,100,34,58,49,51,57,52,53,50,54,53,51,48,56,51,52,44,34,99,111,100,101,115,34,58,91,93,44,34,116,97,103,115,34,58,91,93,44,34,115,101,99,114,101,116,70,111,114,101,105,103,110,75,101,121,115,34,58,91,93,44,34,99,114,121,112,116,101,100,70,111,114,101,105,103,110,75,101,121,115,34,58,123,125,44,34,100,101,108,101,103,97,116,105,111,110,115,34,58,123,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,58,91,123,34,111,119,110,101,114,34,58,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,44,34,100,101,108,101,103,97,116,101,100,84,111,34,58,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,44,34,107,101,121,34,58,34,102,100,48,50,101,99,49,97,102,50,100,102,53,101,49,52,50,99,51,99,101,54,101,98,57,55,52,53,49,48,99,57,57,49,97,49,57,52,53,56,51,48,56,51,97,97,50,98,98,100,97,54,100,57,51,54,56,57,98,50,49,99,52,101,97,55,102,56,55,50,100,101,53,53,55,53,97,99,55,102,52,97,53,48,97,53,48,100,99,50,99,53,57,48,57,55,52,56,48,57,98,101,51,99,97,102,49,48,57,56,52,101,49,53,97,99,99,52,101,52,101,99,55,57,101,51,54,102,50,52,97,99,52,51,50,55,53,49,100,48,51,50,48,97,48,97,50,55,55,54,53,49,100,99,56,102,56,57,52,99,100,51,51,52,55,49,102,49,53,51,49,55,101,97,51,97,100,102,100,101,51,100,57,102,57,53,98,54,57,51,54,55,34,125,93,44,34,54,56,102,52,100,99,48,53,45,102,55,98,55,45,52,55,102,98,45,98,52,100,99,45,48,53,102,55,98,55,49,55,102,98,54,53,34,58,91,123,34,111,119,110,101,114,34,58,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,44,34,100,101,108,101,103,97,116,101,100,84,111,34,58,34,54,56,102,52,100,99,48,53,45,102,55,98,55,45,52,55,102,98,45,98,52,100,99,45,48,53,102,55,98,55,49,55,102,98,54,53,34,44,34,107,101,121,34,58,34,55,51,97,49,98,100,49,52,102,101,51,54,51,50,57,98,50,55,49,100,97,99,53,99,53,49,49,52,52,102,48,98,56,56,101,98,51,99,101,50,51,50,97,48,54,51,57,56,54,100,57,99,97,50,51,101,55,57,54,52,97,100,100,57,55,99,102,55,97,48,55,50,55,50,56,98,53,52,101,101,56,98,98,101,53,57,55,100,97,56,98,102,99,51,99,54,51,50,97,53,55,51,54,53,100,50,50,56,54,55,50,53,51,49,57,100,56,99,50,57,102,99,100,48,102,98,57,102,57,101,97,50,53,55,53,101,50,57,53,55,102,49,100,53,53,56,99,52,54,48,51,51,54,99,97,53,49,99,57,97,52,97,51,101,56,50,100,97,50,101,55,100,54,102,101,102,55,57,98,50,50,99,55,54,57,55,50,101,99,50,56,57,34,125,93,44,34,50,101,54,48,54,97,53,52,45,99,50,97,99,45,52,55,52,102,45,97,48,54,97,45,53,52,99,50,97,99,100,55,52,102,50,102,34,58,91,123,34,111,119,110,101,114,34,58,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,44,34,100,101,108,101,103,97,116,101,100,84,111,34,58,34,50,101,54,48,54,97,53,52,45,99,50,97,99,45,52,55,52,102,45,97,48,54,97,45,53,52,99,50,97,99,100,55,52,102,50,102,34,44,34,107,101,121,34,58,34,53,98,53,56,97,56,99,49,51,102,54,52,102,57,55,57,97,101,100,100,99,100,50,49,56,98,98,57,102,101,50,51,49,55,97,51,52,101,56,100,52,57,56,55,52,99,102,57,102,52,57,51,56,100,49,57,97,99,98,100,50,50,97,97,56,52,50,50,53,102,50,102,51,51,100,100,50,97,51,100,101,97,57,55,48,56,55,57,99,50,57,51,53,50,50,49,53,102,100,102,50,101,57,57,51,98,53,56,101,49,48,50,51,101,54,51,102,99,55,54,50,51,99,53,97,101,48,98,101,53,54,99,55,53,48,57,102,50,55,101,50,101,100,55,97,99,54,55,54,55,102,98,50,99,55,101,48,101,49,57,50,51,55,54,98,101,53,48,50,57,56,97,51,51,48,55,48,54,55,54,52,99,102,100,54,99,56,57,53,100,55,102,34,125,93,125,44,34,97,116,116,97,99,104,109,101,110,116,69,110,99,114,121,112,116,105,111,110,75,101,121,115,34,58,91,93,44,34,97,116,116,97,99,104,109,101,110,116,73,100,34,58,34,57,51,53,50,102,48,49,99,54,51,48,99,51,52,53,53,48,99,56,49,98,102,52,99,53,55,100,101,100,52,101,57,101,53,54,48,55,101,48,56,102,52,99,57,97,57,52,98,49,51,56,51,98,49,49,49,99,53,57,53,48,98,49,57,34,44,34,100,111,99,117,109,101,110,116,84,121,112,101,34,58,34,105,110,118,111,105,99,101,34,44,34,109,97,105,110,85,116,105,34,58,34,99,111,109,46,97,100,111,98,101,46,112,100,102,34,44,34,110,97,109,101,34,58,34,68,79,80,80,76,69,82,32,77,73,78,70,32,50,56,32,48,50,32,49,52,32,40,49,49,47,48,51,47,49,52,41,34,44,34,111,116,104,101,114,85,116,105,115,34,58,91,34,100,121,110,46,97,103,107,56,119,115,109,50,34,93,44,34,106,97,118,97,95,116,121,112,101,34,58,34,111,114,103,46,116,97,107,116,105,107,46,105,99,117,114,101,46,101,110,116,105,116,105,101,115,46,68,111,99,117,109,101,110,116,34,44,34,114,101,118,95,104,105,115,116,111,114,121,34,58,123,125,44,34,95,114,101,118,105,115,105,111,110,115,34,58,123,34,115,116,97,114,116,34,58,52,44,34,105,100,115,34,58,91,34,49,52,53,49,51,100,49,57,52,100,100,54,55,53,99,101,97,57,55,101,50,102,52,49,53,99,55,49,56,53,54,98,34,44,34,49,52,101,55,98,57,54,50,57,101,99,97,50,56,55,48,102,48,48,101,100,102,100,51,100,99,56,54,56,101,54,49,34,44,34,98,101,99,98,50,48,51,56,55,102,56,52,98,48,51,48,98,54,55,101,49,48,56,57,48,54,48,53,101,50,54,50,34,44,34,57,54,55,50,48,50,48,57,100,49,54,101,57,57,100,98,56,52,54,56,97,50,51,54,57,49,97,56,53,99,101,97,34,93,125,44,34,95,97,116,116,97,99,104,109,101,110,116,115,34,58,123,34,57,51,53,50,102,48,49,99,54,51,48,99,51,52,53,53,48,99,56,49,98,102,52,99,53,55,100,101,100,52,101,57,101,53,54,48,55,101,48,56,102,52,99,57,97,57,52,98,49,51,56,51,98,49,49,49,99,53,57,53,48,98,49,57,34,58,123,34,99,111,110,116,101,110,116,95,116,121,112,101,34,58,34,97,112,112,108,105,99,97,116,105,111,110,47,112,100,102,34,44,34,114,101,118,112,111,115,34,58,51,44,34,100,105,103,101,115,116,34,58,34,109,100,53,45,87,105,86,122,83,112,111,81,51,54,86,117,77,52,50,114,100,114,97,50,80,103,61,61,34,44,34,108,101,110,103,116,104,34,58,50,55,54,51,51,51,44,34,102,111,108,108,111,119,115,34,58,116,114,117,101,125,125,125>>,[{att,<<57,51,53,50,102,48,49,99,54,51,48,99,51,52,53,53,48,99,56,49,98,102,52,99,53,55,100,101,100,52,101,57,101,53,54,48,55,101,48,56,102,52,99,57,97,57,52,98,49,51,56,51,98,49,49,49,99,53,57,53,48,98,49,57>>,<<97,112,112,108,105,99,97,116,105,111,110,47,112,100,102>>,276333,276333,<<90,37,115,74,154,16,223,165,110,51,141,171,118,182,182,62>>,3,{follows,<0.22188.0>,#Ref<0.0.262145.216597>},identity}],<<100,97,99,51,102,53,52,57,50,53,50,57,98,56,51,99,54,98,97,50,98,101,53,101,48,56,57,52,56,50,55,102>>,278575}},[{response_format,binary},{inactivity_timeout,30000},{socket_options,[{keepalive,true},{nodelay,false}]}],infinity}},infinity]}}}}}}
[notice] 2017-08-10T07:07:19.727533Z couchdb@127.0.0.1 <0.3632.0> 351785348c 127.0.0.1:5984 127.0.0.1 undefined PUT /dbname/10363625-cd90-4233-b636-25cd90e23378?new_edits=false 201 ok 16
[notice] 2017-08-10T07:07:19.762312Z couchdb@127.0.0.1 <0.3879.0> 

Replication restarts, the error repeats and replication never finishes.

Feels like an instance of #574 which we thought had been resolved.

Your Environment

  • Version used: 2.1.0 release
  • Operating System and version (desktop or mobile): macOS 10.11
@wohali

This comment has been minimized.

Copy link
Member Author

wohali commented Oct 5, 2017

Seeing this now in multiple production environments. In one case, it is potentially completely freezing a node participating in continuous replication with large attachments. In the other, it's a one-time replication that must be restarted many times before it runs to completion.

Discussion on IRC today with @nickva follows.

17:17 < vatamane> #574 was mostly about how 413 (request too big) are handled
17:17 <+Wohali> right, and in both cases these are large attachments
17:18 < vatamane> do you think they trigger the 413 that is, are they bigger
                  than the maximum size request setting?
17:19 <+Wohali> very likely
17:20 < vatamane> k, yeah this is tricky one
17:20 < vatamane> I was trying to think what the patch would be for ibrowse and
                  got thoroughly confused by its parsing state machine
17:22 < vatamane> mochiweb (the server) might also have to behave nicely to
                  ensure it actually sends out the 413 response before closing
                  the streams
17:23 <+Wohali> hm
17:23 <+Wohali> https://github.com/cmullaparthi/ibrowse/issues/105
17:24 <+Wohali> and of course https://github.com/cmullaparthi/ibrowse/issues/146
17:24 <+Wohali> which might prevent us from moving to 4.3
17:24 <+Wohali> or i guess 4.4 now
17:28 < vatamane> the fix here might also need the setting of erlang's socket
                  options
17:29 < vatamane> namely {exit_on_close, false}
17:30 < vatamane> i meant to say ibrowse lets us set socket options, i remember
                  trying but it wasn't enough
17:32 < vatamane> also i remember testing the server side with this script
https://gist.github.com/nickva/84bbe3a51b9ceda8bca8256148be1a18
17:32 < vatamane> it opens a plain socket for upload then even on send failure
                  tries to receive data
17:32 <+Wohali> right so we agree the issue is probably ibrowse?
17:33 < vatamane> 80% or so sure
17:36 <+Wohali> we could run that eunit test in a loop, I mean
17:37 < vatamane> yap could do that
17:37 < vatamane> I was doing it that way
17:37 < vatamane> with debugging and logging enabled
@calonso

This comment has been minimized.

Copy link

calonso commented Oct 13, 2017

Hi everyone!!

I think I have some more information on this issue in the form of a side effect. My setup is a small cluster, with just 3 nodes continuously replicating a few databases from another, bigger one. Only 3 databases out of all the ones being replicated hold attachments and, by chance, the same node is responsible for replicating the 3 of them. That node throws the described error quite often (a few thousand times per hour), depending on the speed at which documents are received.

That particular node shows a continuous increment on the process_count metric read from the _system endpoint. Growing at a similar rate of this errors' rate. That metric grows from about 1.2k processes that the nodes start with up to a bit above 5k when it gets frozen. It stops responding on the clustered (5984) endpoint and doesn't replicate any more data. But annoyingly it is not considered as down in the cluster, so the other nodes are not taking his responsibilities over.

After connecting the Observer to that node, to see which processes are there I could see a lot of erlang:apply/2 in function couch_httpd_multipart:maybe_send_data/1 with 0 reductions and 0 messages in the queue and also a lot of mochiweb_acceptor:init/4 in function couch_doc:-doc_from_multi_part_stream/3-fun-1-/1 Some of them with 1 message on the queue, some of them with 0 and 0 reductions as well...

Also this node has quite many 'erlang:apply/2' processes in function 'couch_http_multipart:mp_parse_attrs/2'.

I think there may be something preventing the processes from exiting and that's why they pile up until it freezes.

Hope this helps.

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Oct 13, 2017

Hi Carlos,

Was wondering how many attachment you have roughly and their approximate size distribution.

How about large document or document ID lengths larger 4KB?

And this is still CouchDB 2.1.0 like mentioned above?

Also what are the values of these configuration parameters:

couchdb.max_document_size
httpd.max_http_request_size

Note that the default request size for httpd.max_http_request_size is 64MB. If you use the default, and your attachments are large, consider raising the limit there.

Basically trying see if this is an issue of target cluster rejecting requests because of some of those limits, or there is something else.

@calonso

This comment has been minimized.

Copy link

calonso commented Oct 14, 2017

Hi Nick,

I've been reviewing some of the mp_parser_died errors I see on the logs and I see that the documents, with its attachments end up appearing on the database, I suppose the scheduler retries them and the replication ends up working (I've seen a few appearing as an error on the logs two or three times and I can see them on the DB, others just fail once and they are on the DB as well, I haven't found any error whose document is not found on the DB, but I haven't reviewed all errors one by one either).

The documents' sizes are not big, at least the ones I've reviewed. The ones I have reviewed sizes' range from 4 to 60 Kb pdf and xls docs. I haven't iterated through all of them to compute the distribution you suggest. Is there an easier CouchDB way to get that overview on attachments size?

I'm using CouchDB 2.1.0 here and about the configs they both should be the default ones as I haven't specified any of them. Checking the config values from _node/<node>/_config there's nothing specified for couchdb.max_document_size and 67108864 as httpd.max_http_request_size, which I guess is the default value.

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Oct 15, 2017

Hi Carlos,

Thanks for the additional info! So it seems like with retries they eventually finish. We'd still would rather not have these errors to start with...

I doesn't seem like request/document/attachment size limits are not involved in this case.

Now I am thinking perhaps it could be unreliable network connections or a large number of replications.

In your setup how reliable is the network. Any chance there is intermittent connectivity issues, or high latency, maybe running out of sockets?

Another question is how many replications are running at the same time, would there be more than 500 per cluster node? That's currently the max jobs value for scheduler and if there are more than that, scheduling replicator would stop some and start others as it cycles through them. Wondering if that is an issue.

@calonso

This comment has been minimized.

Copy link

calonso commented Oct 15, 2017

Hi Nick,

So although I've definitely seen some replication errors pointing to a closed connection on the source from time to time, they are very sparse and I don't think we're affected by unreliable network either as source is a cluster hosted on Softlayer central-US I think and target is a cluster located in Europe-West region of Google Compute Engine. I think both platforms, while located far away in terms of distance, they have very reliable and strong network links.

About the number of replications I don't think we're anywhere near that figure. I'm replicating on a three nodes cluster, each of them having 9 running replications.

Regards

@elistevens

This comment has been minimized.

Copy link

elistevens commented Dec 4, 2017

We believe that we're seeing this in internal testing on single-node hosts doing local replication too. Our attachment sizes can be in the gigabytes.

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Dec 4, 2017

Hi @elistevens,

Thanks for your report. Would you be able to make a short script to reproduce the issue. Or at least describe the steps in more details, for example something like: 1: clone couch at version X, erlang version Y, OS version Z etc, 2: build, 3: setup with these config parameters, 4: create 2 dbs, 5: populate with attachment of this size, ...).

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Dec 4, 2017

@calonso, Sorry for the delayed response. There a minor fix in that code in 2.1.1, would you be able to retry it with that latest version, to see if results in the same error? If you do upgrade, take a look a release notes regarding vm.args file and localhost vs 127.0.0.1 node names.

@calonso

This comment has been minimized.

Copy link

calonso commented Dec 5, 2017

Hi @nickva.

We updated to 2.1.1 a while ago and unfortunately we keep seeing the same error... :(

Thanks!

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Dec 6, 2017

@calonso, thanks for checking, it helps to know that.

@elistevens

This comment has been minimized.

Copy link

elistevens commented Dec 7, 2017

Bah, my earlier draft response got eaten by a browser shutdown.

I don't have an easy repro script, sadly. We're seeing the issue under load during our test runs, but any single test seems to work fine when run in isolation. Our largest attachments are in the range of 100MB to 1GB. I know that's against recommended practices, but that wasn't clear when the bones of our architecture was laid down in ~2011.

We are running 2.1.1 on Ubuntu, using the official .debs.

@wohali

This comment has been minimized.

Copy link
Member Author

wohali commented Dec 13, 2017

Hey @nickva I spent time today on getting a repro for this, as it's affecting more and more people. Bear with me on the setup, it's a little involved.

Set up a VM with the following parameters (I used VMWare Workstation):

  • 20GB HDD, 512 MB RAM (!), 1 CPU
  • Debian 8.latest (or 9, I tested on 8), 64-bit
  • CouchDB master - have it running on localhost:5984 with admin:password as the creds, n=1, and logging at debug level
  • sudo apt-get install stress python3 python3-virtualenv virtualenv python3-cxx-dev
  • mkdir 745 && cd 745 && virtualenv -p python3 venv && source venv/bin/activate
  • pip install RandomWords RandomIO requests docopt schema urllib3 chardet certifi idna
  • wget https://gist.github.com/wohali/1cd19b78c0a417dbeb9f66b3229f7b58/raw/6539d48fa9e05b021f344b80fbf0d7c3e7fcd6e4/makeit.py

Now you're ready to setup the test:

$ curl -X PUT http://admin:password@localhost:5984/foo
$ python ./makeit.py 10

Repeat the above a few times - get the DB to 1GB or larger. You can increase 10 but at some point you'll run out of RAM, so be careful. This script creates sample docs with a few fields and a 50MB attachment full of random bytes.

Now to run the test:

  • In one window: tail -f | grep your couch log for mp_parser.
  • In another window, stress the machine's CPU and network: stress --timeout 90m --cpu 1 --io 4. (You can add disk access to this with -d 8 if desired.
  • In the window running the Python virtualenv, start the replication:
    curl http://admin:password@localhost:5984/_replicate -H "Content-Type: application/json" --data '{"create_target": true, "source": "foo", "target": "bar"}'

If the above succeeds, curl -X DELETE http://admin:password@localhost:5984/bar and try to replicate again.

This produces a failure for me within 10 minutes. The command line returns:

{"error":"error","reason":"{worker_died,<0.26846.9>,{process_died,<0.27187.9>,kaboom}}"}

and the logfile has errors identical to that in the original post above, and in #574.

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Jan 5, 2018

@wohali thanks for that test! I'll take a look at it when I get a chance. The kaboom thing is interesting, wonder if that's something we saw before or something new in relation to this bug. That comes from getting all the open revisions of a document. Definitely a good data point to have.

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Jan 18, 2018

Leaving it here as it might be relevant:

A similar issue was noticed with someone using attachments in the 10MB range. One attachment had larger size around 50MB.

Investigating on #couchdb-dev IRC channel, implicated the couch_doc,'-doc_from_multi_part_stream/4-fun-1-',1, function (it looks ugly because it was an anonymous function with its name mangled).

Also possibly related in that case was that there were intermittent network problems - nodes being connected and disconnected.

@wohali

This comment has been minimized.

Copy link
Member Author

wohali commented Jan 19, 2018

FYI the intermittent network problems are not a prerequisite for this problem to surface.

However, I think we are going in the right direction thinking this is related to incorrect attachment length calculation and/or incomplete network transfers.

@wohali

This comment has been minimized.

Copy link
Member Author

wohali commented Jan 19, 2018

One other thing - there was a previous attempt at changing some of this behaviour that never landed that references some old JIRA tickets:

#138

@wohali

This comment has been minimized.

Copy link
Member Author

wohali commented Jan 22, 2018

@nickva @davisp The client has stated that the "similar issue" (with the couch_doc,'-doc_from_multi_part_stream/4-fun-1-',1, function) only started after an upgrade from 2.0.0rc3 to 2.1.1.

Hopefully that narrows the git bisect a bit further?

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Jan 22, 2018

@wohali thanks, it does help!

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Jan 24, 2018

(From discussion on IRC)

This might be related to setting a lower max http request limit:

e767b34

Before that was defaulting to 4GB which what the code has, but default.ini file set it to 64MB so that became the value being used. Max request will limit will prevent larger attachments to replicate. Also 413 error is not always raised cleanly: (see #574 also referenced in the top description)

To confirm if this is the cause or is affecting this issue at all, can try to bump:

[httpd]
max_http_request_size = 67108864 ; 64 MB

To a higher value, one that's larger than a two or three times the largest attachment or document perhaps (to account for some overhead).

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Jan 24, 2018

Related to this, we also started to enforce http request limits more strictly:

5d7170c

@janl

This comment has been minimized.

Copy link
Member

janl commented Jan 31, 2018

Moar data from affected nodes. I’ve listed process by current_function and sorted by occurrence

current function count/group script (click to reveal)
io:format("~p", [
    lists:keysort(2,
        maps:to_list(lists:foldl(
            fun(Elm, Acc) ->
                case Elm of
                    {M, F, A} ->
                        N = maps:get({M, F, A}, Acc, 0),
                        maps:put({M, F, A}, N + 1, Acc);
                    Else ->
                        Acc
                    end
            end,
            #{},
            lists:map(
                fun(Pid) ->
                    case process_info(Pid) of
                        undefined -> [];
                        Info -> proplists:get_value(current_function, Info)
                    end
                end, 
                processes()
            )
        ))
    )
])

Output

Output from an affected node:

[{{code_server,loop,1},1},
 {{couch_replicator_scheduler,stats_updater_loop,1},1},
 {{cpu_sup,measurement_server_loop,1},1},
 {{cpu_sup,port_server_loop,2},1},
 {{erl_eval,do_apply,6},1},
 {{erl_prim_loader,loop,3},1},
 {{erlang,hibernate,3},1},
 {{gen,do_call,4},1},
 {{global,loop_the_locker,1},1},
 {{global,loop_the_registrar,0},1},
 {{inet_gethost_native,main_loop,1},1},
 {{init,loop,1},1},
 {{mem3_shards,'-start_changes_listener/1-fun-0-',1},1},
 {{memsup,port_idle,1},1},
 {{net_kernel,ticker_loop,2},1},
 {{shell,shell_rep,4},1},
 {{standard_error,server_loop,1},1},
 {{user,server_loop,2},1},
 {{couch_changes,wait_updated,3},2},
 {{prim_inet,recv0,3},2},
 {{dist_util,con_loop,9},3},
 {{gen_event,fetch_msg,5},6},
 {{couch_os_process,'-init/1-fun-0-',2},23},
 {{application_master,loop_it,4},25},
 {{application_master,main_loop,2},25},
 {{prim_inet,accept0,2},29},
 {{couch_httpd_multipart,mp_parse_atts,2},31},
 {{fabric_db_update_listener,cleanup_monitor,3},210},
 {{fabric_db_update_listener,wait_db_updated,1},210},
 {{rexi_monitor,wait_monitors,1},210},
 {{rexi_utils,process_message,6},212},
 {{couch_event_listener,loop,2},412},
 {{couch_httpd_multipart,maybe_send_data,1},881},
 {{couch_doc,'-doc_from_multi_part_stream/4-fun-1-',1},912},
 {{gen_server,loop,6},2816}]

Unaffected node in the same cluster:

[{{code_server,loop,1},1},
 {{couch_ejson_compare,less,2},1},
 {{couch_index_server,get_index,3},1},
 {{couch_replicator_scheduler,stats_updater_loop,1},1},
 {{cpu_sup,measurement_server_loop,1},1},
 {{cpu_sup,port_server_loop,2},1},
 {{erl_eval,do_apply,6},1},
 {{erl_prim_loader,loop,3},1},
 {{fabric_util,get_shard,4},1},
 {{global,loop_the_locker,1},1},
 {{global,loop_the_registrar,0},1},
 {{inet_gethost_native,main_loop,1},1},
 {{init,loop,1},1},
 {{mem3_shards,'-start_changes_listener/1-fun-0-',1},1},
 {{memsup,port_idle,1},1},
 {{net_kernel,ticker_loop,2},1},
 {{prim_inet,recv0,3},1},
 {{shell,shell_rep,4},1},
 {{standard_error,server_loop,1},1},
 {{user,server_loop,2},1},
 {{couch_changes,wait_updated,3},2},
 {{dist_util,con_loop,9},3},
 {{erlang,hibernate,3},3},
 {{gen_event,fetch_msg,5},6},
 {{rexi,wait_for_ack,2},8},
 {{couch_os_process,'-init/1-fun-0-',2},16},
 {{application_master,loop_it,4},25},
 {{application_master,main_loop,2},25},
 {{prim_inet,accept0,2},28},
 {{couch_httpd_multipart,mp_parse_atts,2},37},
 {{fabric_db_update_listener,wait_db_updated,1},112},
 {{fabric_db_update_listener,cleanup_monitor,3},113},
 {{rexi_monitor,wait_monitors,1},114},
 {{rexi_utils,process_message,6},114},
 {{couch_event_listener,loop,2},344},
 {{couch_httpd_multipart,maybe_send_data,1},361},
 {{couch_doc,'-doc_from_multi_part_stream/4-fun-1-',1},398},
 {{gen_server,loop,6},10270}]

Output from a cluster that doesn’t have attachments:

{{code_server,loop,1},1},
 {{cpu_sup,measurement_server_loop,1},1},
 {{cpu_sup,port_server_loop,2},1},
 {{erl_eval,do_apply,6},1},
 {{erl_prim_loader,loop,3},1},
 {{erts_code_purger,loop,0},1},
 {{fabric_db_update_listener,cleanup_monitor,3},1},
 {{fabric_db_update_listener,wait_db_updated,1},1},
 {{global,loop_the_locker,1},1},
 {{global,loop_the_registrar,0},1},
 {{init,loop,1},1},
 {{net_kernel,ticker_loop,2},1},
 {{rexi_monitor,wait_monitors,1},1},
 {{rexi_utils,process_message,6},1},
 {{shell,shell_rep,4},1},
 {{standard_error,server_loop,1},1},
 {{timer,sleep,1},1},
 {{user,server_loop,2},1},
 {{dist_util,con_loop,2},3},
 {{gen_event,fetch_msg,5},6},
 {{couch_changes,wait_updated,3},10},
 {{couch_event_listener,loop,2},19},
 {{application_master,loop_it,4},24},
 {{application_master,main_loop,2},24},
 {{couch_os_process,'-init/1-fun-0-',2},32},
 {{prim_inet,accept0,2},33},
 {{erlang,hibernate,3},75},
 {{gen_server,loop,6},2521}]
@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Jan 31, 2018

Possible solution from davisp discussed on IRC:

https://gist.github.com/davisp/27cd7ab54cdffeaa6e96590df4f988f9

janl added a commit that referenced this issue Feb 16, 2018

janl added a commit to janl/couchdb that referenced this issue Feb 20, 2018

janl added a commit to janl/couchdb that referenced this issue Feb 20, 2018

nickva added a commit that referenced this issue Feb 21, 2018

Avoid unconditional retries in replicator's http client
In some cases the higher level code from `couch_replicator_api_wrap` needs to
handle retries explicitly and cannot cope with retries happening in the lower
level http client. In such cases it sets `retries = 0`.

For example:

https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L271-L275

The http client then should avoid unconditional retries and instead consult
`retries` value. If `retries = 0`, it shouldn't retry and instead bubble the
exception up to the caller.

This bug was discovered when attachments were replicated to a target cluster
and the target cluster's resources were constrainted. Since attachment `PUT`
requests were made from the context of an open_revs `GET` request, `PUT`
request timed out, and they would retry. However, because the retry didn't
bubble up to the `open_revs` code, the second `PUT` request would die with a
`noproc` error, since the old parser had exited by then. See issue #745 for
more.

nickva added a commit that referenced this issue Feb 22, 2018

Avoid unconditional retries in replicator's http client
In some cases the higher level code from `couch_replicator_api_wrap` needs to
handle retries explicitly and cannot cope with retries happening in the lower
level http client. In such cases it sets `retries = 0`.

For example:

https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L271-L275

The http client then should avoid unconditional retries and instead consult
`retries` value. If `retries = 0`, it shouldn't retry and instead bubble the
exception up to the caller.

This bug was discovered when attachments were replicated to a target cluster
and the target cluster's resources were constrainted. Since attachment `PUT`
requests were made from the context of an open_revs `GET` request, `PUT`
request timed out, and they would retry. However, because the retry didn't
bubble up to the `open_revs` code, the second `PUT` request would die with a
`noproc` error, since the old parser had exited by then. See issue #745 for
more.

nickva added a commit that referenced this issue Feb 22, 2018

Avoid unconditional retries in replicator's http client
In some cases the higher level code from `couch_replicator_api_wrap` needs to
handle retries explicitly and cannot cope with retries happening in the lower
level http client. In such cases it sets `retries = 0`.

For example:

https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L271-L275

The http client then should avoid unconditional retries and instead consult
`retries` value. If `retries = 0`, it shouldn't retry and instead bubble the
exception up to the caller.

This bug was discovered when attachments were replicated to a target cluster
and the target cluster's resources were constrainted. Since attachment `PUT`
requests were made from the context of an open_revs `GET` request, `PUT`
request timed out, and they would retry. However, because the retry didn't
bubble up to the `open_revs` code, the second `PUT` request would die with a
`noproc` error, since the old parser had exited by then. See issue #745 for
more.

davisp added a commit that referenced this issue Feb 22, 2018

Prevent chttpd multipart zombie processes
Occasionally it's possible to lose track of our RPC workers in the main
multipart parsing code. This change monitors each worker process and
then exits if all workers have exited before the parser considers itself
finished.

Fixes part of #745

@davisp davisp referenced this issue Feb 22, 2018

Merged

Prevent chttpd multipart zombie processes #1178

1 of 3 tasks complete

davisp added a commit that referenced this issue Feb 23, 2018

Prevent chttpd multipart zombie processes
Occasionally it's possible to lose track of our RPC workers in the main
multipart parsing code. This change monitors each worker process and
then exits if all workers have exited before the parser considers itself
finished.

Fixes part of #745
@wohali

This comment has been minimized.

Copy link
Member Author

wohali commented Mar 1, 2018

So in testing with a client, this no longer hangs/crashes/eats all the RAM, but it does still cause an issue where a too-large request body fails to transmit a document. The replicator thinks it HAS successfully transferred the document, and declares replication successful. A subsequent attempt to GET the document results in a 404.

Here is a censored excerpt from the logfile of the situation:

[notice] 2018-02-27T01:16:58.004513Z couchdb@127.0.0.1 <0.31509.279> 0f1a3bf01b localhost:5984 127.0.0.1 undefined GET /_scheduler/docs/_replicator/862944988a0e42e8e7567da18c863571 200 ok 0
[notice] 2018-02-27T01:17:07.843508Z couchdb@127.0.0.1 <0.24829.275> -------- Starting replication 2def458d381dd99fa1f4e4887b5b9775+create_target (http://localhost:5984/db1/ -> http://localhost:5984/db2/) from doc _replicator:862944988a0e42e8e7567da18c863571 worker_procesess:4 worker_batch_size:500 session_id:bfafe761a21c1ea012228fc2df6790a9
[notice] 2018-02-27T01:17:07.843558Z couchdb@127.0.0.1 <0.24829.275> -------- Document `862944988a0e42e8e7567da18c863571` triggered replication `2def458d381dd99fa1f4e4887b5b9775+create_target`

[notice] 2018-02-27T01:17:10.600494Z couchdb@127.0.0.1 <0.18679.277> 66a9b51dfd localhost:5984 127.0.0.1 undefined GET /db1/foo?revs=true&open_revs=%5B%222-d32df74e77cfebcc08455cda37518117%22%5D&latest=true 200 ok 2668
[error] 2018-02-27T01:17:10.869278Z couchdb@127.0.0.1 <0.23437.278> -------- Replicator: error writing document `foo` to `http://localhost:5984/db2/`: {error,request_body_too_large}
[notice] 2018-02-27T01:17:27.107006Z couchdb@127.0.0.1 <0.24829.275> -------- Replication `2def458d381dd99fa1f4e4887b5b9775+create_target` completed (triggered by `862944988a0e42e8e7567da18c863571`)

[notice] 2018-02-27T01:18:05.754340Z couchdb@127.0.0.1 <0.24230.274> 2a9c5d5507 localhost:5984 127.0.0.1 undefined GET /db2/foo 404 ok 1

Note that in extensive testing, this has only happened four times, so I'm not sure I can provide an easy reproducer here, but we'll keep at it.

Couch was running at the info log level for this test, so I'm going bump it up to debug level and try the test again, hoping for a duplicate.

@nickva

This comment has been minimized.

Copy link
Contributor

nickva commented Mar 1, 2018

If some requests fail with a 413 it's not surprising that it completes. It should bump the doc_write_failures stats in the completion record to indicate how many documents it failed to write.

The question is why does that one request fail with a 413 to start with.

Good call on debug logs. Also what are the are the doc sizes involved, how many revisions per document. Any attachments? Then what are the [couchdb] max_document_size , [couchdb] max_attachment_size and [httpd] max_http_request_size params.

@wohali

This comment has been minimized.

Copy link
Member Author

wohali commented Mar 1, 2018

Yes attachments. you can see revisions per doc are low. in the log.

Everything else is default. (Working around this by bumping the defaults is just sweeping the error under the rug...)

Of course, we should be replicating the document. Why is it even getting a 413 in the first place? It's replicating a document from the same server to the same server, no settings have changed, surely it should be able to PUT a document it just did a GET of from itself. I believe this test (I didn't write it) runs in a loop, so the data is being replicated over and over from one database to the next on the same server.

Finally, even with a bumped doc_write_failures value, calling this a completed replication is a VERY surprising result. Unless you think to check /_scheduler/docs for the failed document count, or compare source/target document counts, you'd never know there was a failure.

@wohali

This comment has been minimized.

Copy link
Member Author

wohali commented Mar 2, 2018

We're now hard-rejecting attachments greater than max_http_request_size, sadly.

@nickva I have an excellent reproducible case:

  1. Build couchdb, master, latest.
  2. Set up my makeit.py script from this comment.
  3. Run dev/run -n 1 --with-admin-party-please.
  4. Run curl -X PUT localhost:15984/foo
  5. Edit the URL in makeit.py to reflect http://localhost:15984/foo.
  6. After entering the virtualenv for makeit.py, run: python ./makeit.py 10 --size=75000000
  7. Start a replication: curl -X PUT localhost:15984/_replicator/bar -d '{"source": "http://localhost:15984/foo", "target": "http://localhost:15984/bar", "create_target": true}'
  8. Watch the sparks fly.

/cc @janl

@janl

This comment has been minimized.

Copy link
Member

janl commented Mar 2, 2018

Great repro Joan. I played with it and came up with this:

The python script uses the standalone attachment API: /db/doc/att The handler for this request does NOT apply max_http_request_size (which happens in chttpd:body/2 or couch_httpd:check_max_request_length(), neither of which is used by the standalone attachment API).

The twist now is that the replicator uses multipart requests and not standalone attachment requests. Multipart requests are subject to the max_http_request_size limit.

This leads to the observed behaviour that you can create an attachment in one db and can NOT replicate that attachment to another db on the same CouchDB node (or another node with the same max_http_request_size limit).

Applying max_http_request_size in the standalone attachment API is trivial[1], but leads to the next unfortunate behaviour:

Say you create a doc with two attachments, with a length that is just under max_http_request_size, each individual attachment write will succeed, but replicating it to another db will, again, produce a multipart request that overall is > max_http_request_size.

I haven’t checked this, but a conflicting doc with one attachment < max_http_request_size where the attachment data is conflicted might also produce a multipart http request > max_http_request_size to replicate both conflicting revisions and attachment bodies.

This leads us to having to decide:

  1. is max_http_request_size a hard hard hard limit or do we accept requests larger than that, if they are multipart http requests?
  • if yes, do we apply the max_document_size and max_attachment_size to individual chunks of the multipart request?
  1. if not 1., do we need to rewrite the replicator to not produce requests > max_http_request_size and potentially do attachments individually?

References:
[1]:

--- a/src/chttpd/src/chttpd_db.erl
+++ b/src/chttpd/src/chttpd_db.erl
@@ -1218,6 +1218,7 @@ db_attachment_req(#httpd{method=Method, user_ctx=Ctx}=Req, Db, DocId, FileNamePa
                 undefined -> <<"application/octet-stream">>;
                 CType -> list_to_binary(CType)
             end,
+           couch_httpd:check_max_request_length(Req),
            Data = fabric:att_receiver(Req, chttpd:body_length(Req)),
            ContentLen = case couch_httpd:header_value(Req,"Content-Length") of
                undefined -> undefined;
@janl

This comment has been minimized.

Copy link
Member

janl commented Mar 2, 2018

Shorter repro that runs quickly, tests the 1 attachment > max_http_request_size as well as the 2 attachments < max_http_request_size but att1 + att2 > max_http_request_size cases.

Look for the two instances of "doc_write_failures":1 in the output.

#!/bin/sh

COUCH=http://127.0.0.1:15984
INT=http://127.0.0.1:15986
DBA=$COUCH/db
DBB=$COUCH/dbb

# cleanup
curl -X DELETE $DBA
curl -X DELETE $DBB

# setup
curl -X PUT $DBA
curl -X PUT $DBB

# config
curl -X PUT $INT/_config/httpd/max_http_request_size -d '"1500"'
curl -X PUT $INT/_config/replicator/retries_per_request -d '"1"'

# create an att > max_http_request_size, should succeed
# 3000 here as not to run into _local checkpoint size limits

curl -X PUT http://127.0.0.1:15984/db/doc/att --data-binary "$BODY3000" -Hcontent-type:application/octet-stream

# replicate, should suceed, but with one doc_write_failure
curl -X POST $COUCH/_replicate -d "{\"source\": \"$DBA\", \"target\": \"$DBB\"}" -H content-type:application/json



# create two atts, each < max_http_request_size, but att1+att2 > max_http_request_size


# cleanup
curl -X DELETE $DBA
curl -X DELETE $DBB

# setup
curl -X PUT $DBA
curl -X PUT $DBB



REV=`curl -sX PUT http://127.0.0.1:15984/db/doc1/att --data-binary "$BODY1500" -Hcontent-type:application/octet-stream | cut -b 31-64`

curl -X PUT http://127.0.0.1:15984/db/doc1/att2?rev=$REV --data-binary "$BODY1500" -Hcontent-type:application/octet-stream

# replicate, should suceed, but with one doc_write_failure
curl -X POST $COUCH/_replicate -d "{\"source\": \"$DBA\", \"target\": \"$DBB\"}" -H content-type:application/json
@janl

This comment has been minimized.

Copy link
Member

janl commented Mar 5, 2018

I suggest we close this in favour of #1200.

tl;dr: CouchDB master works as expected, but has an unfortunate behaviour leading to replication failures when attachments are > max_http_request_size, the solution of which, we’re discussing in #1200.

@janl janl closed this Mar 5, 2018

janl added a commit that referenced this issue Mar 5, 2018

tonysun83 added a commit that referenced this issue Mar 8, 2018

tonysun83 added a commit that referenced this issue Mar 8, 2018

nickva added a commit to cloudant/couchdb that referenced this issue Mar 8, 2018

Revert "re-enable "flaky" test in quest to nail down apache#745"
This reverts commit 4a73d03.

Latest Mochiweb 2.17 might have helped a bit but after runnig `soak-eunit
suites=couch_replicator_small_max_request_size_target` make it fail after 10-15
runs locally for me.

nickva added a commit that referenced this issue Mar 8, 2018

Revert "re-enable "flaky" test in quest to nail down #745"
This reverts commit 4a73d03.

Latest Mochiweb 2.17 might have helped a bit but after runnig `soak-eunit
suites=couch_replicator_small_max_request_size_target` make it fail after 10-15
runs locally for me.

nickva added a commit to cloudant/couchdb that referenced this issue Mar 23, 2018

janl added a commit that referenced this issue Mar 26, 2018

jiangphcn added a commit that referenced this issue May 18, 2018

Revert "re-enable "flaky" test in quest to nail down #745"
This reverts commit 4a73d03.

Latest Mochiweb 2.17 might have helped a bit but after runnig `soak-eunit
suites=couch_replicator_small_max_request_size_target` make it fail after 10-15
runs locally for me.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.