Database compaction stuck in a loop #2941

markusd · 2020-06-12T04:52:18Z

Description

3 of 9 nodes in my CouchDB cluster were stuck in a loop during database compaction. These nodes were submitting significantly more disk IO than the nodes not having the problem, which is how I noticed this in the first place. I do not think _active_tasks had the tasks in it, at least not consistently (maybe they appeared and disappeared intermittently).

The compaction files were very old compared to the current time of the database file:

-rw-rw-r--  1 couchdb root 16249164041 Jun 10 10:28 inventories_db.1590763566.couch
-rw-rw-r--  1 couchdb root  2852633672 Jun  2 06:43 inventories_db.1590763566.couch.compact.data
-rw-rw-r--  1 couchdb root     9287509 Jun  2 06:43 inventories_db.1590763566.couch.compact.meta

The logs showed a loop of this every few seconds:

[notice] 2020-06-10T10:54:57.537879Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: adding <<"shards/00000000-0e38e38d/inventories_db.1590763566">> to internal compactor queue with priority 1662730747
[notice] 2020-06-10T10:54:57.538019Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: Starting compaction for shards/00000000-0e38e38d/inventories_db.1590763566 (priority 1662730747)
[info] 2020-06-10T10:54:57.538072Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.10254.140> -------- Starting compaction for db "shards/00000000-0e38e38d/inventories_db.1590763566" at 333559
[notice] 2020-06-10T10:54:57.538354Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: Started compaction for shards/00000000-0e38e38d/inventories_db.1590763566
[warning] 2020-06-10T10:54:59.521669Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- exit for compaction of ["shards/00000000-0e38e38d/inventories_db.1590763566"]: {function_clause,[{couch_emsort,set_options,[{ems,<0.7233.140>,undefined,10,100,0,0},{[9052929,8823270,8599979,8373646,8144824,7929093,7702555,7474977,7250283,7024316],7025740}],[{file,"src/couch_emsort.erl"},{line,157}]},{couch_emsort,open,2,[{file,"src/couch_emsort.erl"},{line,154}]},{couch_bt_engine_compactor,bind_emsort,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,634}]},{couch_bt_engine_compactor,open_compaction_files,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,109}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,62}]}]}
[error] 2020-06-10T10:54:59.522030Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m emulator -------- Error in process <0.21911.135> on node 'couchdb@c-couchdb-2-m-6.c-couchdb-2-m' with exit value:
{function_clause,[{couch_emsort,set_options,[{ems,<0.7233.140>,undefined,10,100,0,0},{[9052929,8823270,8599979,8373646,8144824,7929093,7702555,7474977,7250283,7024316],7025740}],[{file,"src/couch_emsort.erl"},{line,157}]},{couch_emsort,open,2,[{file,"src/couch_emsort.erl"},{line,154}]},{couch_bt_engine_compactor,bind_emsort,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,634}]},{couch_bt_engine_compactor,open_compaction_files,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,109}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,62}]}]}
[info] 2020-06-10T10:54:59.522029Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.227.0> -------- db shards/00000000-0e38e38d/inventories_db.1590763566 died with reason {function_clause,[{couch_emsort,set_options,[{ems,<0.7233.140>,undefined,10,100,0,0},{[9052929,8823270,8599979,8373646,8144824,7929093,7702555,7474977,7250283,7024316],7025740}],[{file,"src/couch_emsort.erl"},{line,157}]},{couch_emsort,open,2,[{file,"src/couch_emsort.erl"},{line,154}]},{couch_bt_engine_compactor,bind_emsort,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,634}]},{couch_bt_engine_compactor,open_compaction_files,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,109}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,62}]}]}

[notice] 2020-06-10T10:55:04.524763Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: adding <<"shards/00000000-0e38e38d/inventories_db.1590763566">> to internal compactor queue with priority 1662730747

I deleted the .compact files manually, which restarted the compaction. It ran to completion on all nodes without issues.

Steps to Reproduce

I do not know, I simply replicated a database (6 million docs, 250 GB, q=18, n=3) and the automatic background compaction started and got stuck.

Expected Behaviour

Compaction to not get stuck and not to consume loads of disk IO without making progress.

Your Environment

CouchDB version used: 3.1.0

The text was updated successfully, but these errors were encountered:

wohali · 2020-06-12T06:30:22Z

Looks like a real bug. Thanks for the info on a workaround!

nickva · 2020-07-11T05:00:59Z

Thanks for the report @markusd and good analysis, @wohali. It does look like a bug.

If this was a compaction file left over from before the upgrade to 3.1.0, this is what might have happened:

Previously the emsort:get_state/1 result was just the root value {BB, PrevPos} then in 3.1.0 it was converted to be [{root, Root}, ...]. There is an upgrade clause, but it checks for an integer only and I think we'd want to check for the {_BB, _PrevPos} pattern instead there.

@davisp, what do you think, is that about right?

rnewson · 2020-07-13T08:16:30Z

Agreed this is a real bug and the cause is the commit you pointed out (123bf82). The state is then passed as an option, and set_options (rightly) crashes if passed an unexpected option.

kathy1121 · 2020-07-13T13:27:32Z

Thanks Markus, I met the same issue and solved as you suggested. Much appreciate!!!!!

nickva · 2020-07-13T15:49:25Z

Merged the fix to 3.x branch: #3001

bdoyle0182 · 2020-12-09T23:25:45Z

we're seeing a similar issue with random compactions on shards when upgrading from 2.x to 3.1.1. But might be completely unrelated. The compaction metadata file blew up to about 500gb over 24 hours for a shard that is about 30gb constantly hitting this error. Similarly we're also seeing large disk io on the nodes this is happening on versus nodes this is not happening on.

<0.5181.0> -------- exit for compaction of ["shards/60000000-7fffffff/core_activations.1589336264"]: {badarith,[{couch_file,get_pread_locnum,3,[{file,"src/couch_file.erl"},{line,730}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{couch_file,read_multi_raw_iolists_int,2,[{file,"src/couch_file.erl"},{line,719}]},{couch_file,handle_call,3,[{file,"src/couch_file.erl"},{line,507}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}

-------- CRASH REPORT Process (<0.32443.1691>) with 3 neighbors crashed with reason: bad arithmetic expression at couch_file:get_pread_locnum/3(line:730) <= lists:map/2(line:1239) <= couch_file:read_multi_raw_iolists_int/2(line:719) <= couch_file:handle_call/3(line:507) <= gen_server:try_handle_call/4(line:636) <= gen_server:handle_msg/6(line:665) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_file,init,['Argument__1']}, ancestors: [<0.3352.1692>], message_queue_len: 0, messages: [], links: [<0.3352.1692>], dictionary: [{couch_file_fd,{{file_descriptor,prim_file,{#Port<0.1924847>,92}},"..."}},...], trap_exit: false, status: running, heap_size: 28690, stack_size: 27, reductions: 13483

markusd added bug needs-triage labels Jun 12, 2020

wohali removed the needs-triage label Jun 12, 2020

nickva mentioned this issue Jul 13, 2020

Fix compactor bind_emsort clause #3001

Merged

tkrapp mentioned this issue Jul 14, 2020

HTTP-Error - CouchDB View does not return all data #3004

Closed

janl closed this as completed Jul 26, 2020

bdoyle0182 mentioned this issue Dec 9, 2020

Compaction Failures Similar to #2941 on Upgrade From 2.x to 3.1.1 #3292

Open

schneuwlym mentioned this issue May 19, 2021

Compaction dies constantly after a certain amount of documents #3571

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database compaction stuck in a loop #2941

Database compaction stuck in a loop #2941

markusd commented Jun 12, 2020

wohali commented Jun 12, 2020

nickva commented Jul 11, 2020

rnewson commented Jul 13, 2020

kathy1121 commented Jul 13, 2020

nickva commented Jul 13, 2020

bdoyle0182 commented Dec 9, 2020 •

edited

Loading

Database compaction stuck in a loop #2941

Database compaction stuck in a loop #2941

Comments

markusd commented Jun 12, 2020

Description

Steps to Reproduce

Expected Behaviour

Your Environment

wohali commented Jun 12, 2020

nickva commented Jul 11, 2020

rnewson commented Jul 13, 2020

kathy1121 commented Jul 13, 2020

nickva commented Jul 13, 2020

bdoyle0182 commented Dec 9, 2020 • edited Loading

bdoyle0182 commented Dec 9, 2020 •

edited

Loading