Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database compaction stuck in a loop #2941

Closed
markusd opened this issue Jun 12, 2020 · 6 comments
Closed

Database compaction stuck in a loop #2941

markusd opened this issue Jun 12, 2020 · 6 comments
Labels

Comments

@markusd
Copy link

markusd commented Jun 12, 2020

Description

3 of 9 nodes in my CouchDB cluster were stuck in a loop during database compaction. These nodes were submitting significantly more disk IO than the nodes not having the problem, which is how I noticed this in the first place. I do not think _active_tasks had the tasks in it, at least not consistently (maybe they appeared and disappeared intermittently).

The compaction files were very old compared to the current time of the database file:

-rw-rw-r--  1 couchdb root 16249164041 Jun 10 10:28 inventories_db.1590763566.couch
-rw-rw-r--  1 couchdb root  2852633672 Jun  2 06:43 inventories_db.1590763566.couch.compact.data
-rw-rw-r--  1 couchdb root     9287509 Jun  2 06:43 inventories_db.1590763566.couch.compact.meta

The logs showed a loop of this every few seconds:

[notice] 2020-06-10T10:54:57.537879Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: adding <<"shards/00000000-0e38e38d/inventories_db.1590763566">> to internal compactor queue with priority 1662730747
[notice] 2020-06-10T10:54:57.538019Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: Starting compaction for shards/00000000-0e38e38d/inventories_db.1590763566 (priority 1662730747)
[info] 2020-06-10T10:54:57.538072Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.10254.140> -------- Starting compaction for db "shards/00000000-0e38e38d/inventories_db.1590763566" at 333559
[notice] 2020-06-10T10:54:57.538354Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: Started compaction for shards/00000000-0e38e38d/inventories_db.1590763566
[warning] 2020-06-10T10:54:59.521669Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- exit for compaction of ["shards/00000000-0e38e38d/inventories_db.1590763566"]: {function_clause,[{couch_emsort,set_options,[{ems,<0.7233.140>,undefined,10,100,0,0},{[9052929,8823270,8599979,8373646,8144824,7929093,7702555,7474977,7250283,7024316],7025740}],[{file,"src/couch_emsort.erl"},{line,157}]},{couch_emsort,open,2,[{file,"src/couch_emsort.erl"},{line,154}]},{couch_bt_engine_compactor,bind_emsort,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,634}]},{couch_bt_engine_compactor,open_compaction_files,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,109}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,62}]}]}
[error] 2020-06-10T10:54:59.522030Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m emulator -------- Error in process <0.21911.135> on node 'couchdb@c-couchdb-2-m-6.c-couchdb-2-m' with exit value:
{function_clause,[{couch_emsort,set_options,[{ems,<0.7233.140>,undefined,10,100,0,0},{[9052929,8823270,8599979,8373646,8144824,7929093,7702555,7474977,7250283,7024316],7025740}],[{file,"src/couch_emsort.erl"},{line,157}]},{couch_emsort,open,2,[{file,"src/couch_emsort.erl"},{line,154}]},{couch_bt_engine_compactor,bind_emsort,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,634}]},{couch_bt_engine_compactor,open_compaction_files,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,109}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,62}]}]}
[info] 2020-06-10T10:54:59.522029Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.227.0> -------- db shards/00000000-0e38e38d/inventories_db.1590763566 died with reason {function_clause,[{couch_emsort,set_options,[{ems,<0.7233.140>,undefined,10,100,0,0},{[9052929,8823270,8599979,8373646,8144824,7929093,7702555,7474977,7250283,7024316],7025740}],[{file,"src/couch_emsort.erl"},{line,157}]},{couch_emsort,open,2,[{file,"src/couch_emsort.erl"},{line,154}]},{couch_bt_engine_compactor,bind_emsort,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,634}]},{couch_bt_engine_compactor,open_compaction_files,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,109}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,62}]}]}

[notice] 2020-06-10T10:55:04.524763Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: adding <<"shards/00000000-0e38e38d/inventories_db.1590763566">> to internal compactor queue with priority 1662730747

I deleted the .compact files manually, which restarted the compaction. It ran to completion on all nodes without issues.

Steps to Reproduce

I do not know, I simply replicated a database (6 million docs, 250 GB, q=18, n=3) and the automatic background compaction started and got stuck.

Expected Behaviour

Compaction to not get stuck and not to consume loads of disk IO without making progress.

Your Environment

  • CouchDB version used: 3.1.0
@wohali
Copy link
Member

wohali commented Jun 12, 2020

Looks like a real bug. Thanks for the info on a workaround!

@nickva
Copy link
Contributor

nickva commented Jul 11, 2020

Thanks for the report @markusd and good analysis, @wohali. It does look like a bug.

If this was a compaction file left over from before the upgrade to 3.1.0, this is what might have happened:

Previously the emsort:get_state/1 result was just the root value {BB, PrevPos} then in 3.1.0 it was converted to be [{root, Root}, ...]. There is an upgrade clause, but it checks for an integer only and I think we'd want to check for the {_BB, _PrevPos} pattern instead there.

@davisp, what do you think, is that about right?

@rnewson
Copy link
Member

rnewson commented Jul 13, 2020

Agreed this is a real bug and the cause is the commit you pointed out (123bf82). The state is then passed as an option, and set_options (rightly) crashes if passed an unexpected option.

@kathy1121
Copy link

Thanks Markus, I met the same issue and solved as you suggested. Much appreciate!!!!!

@nickva
Copy link
Contributor

nickva commented Jul 13, 2020

Merged the fix to 3.x branch: #3001

@bdoyle0182
Copy link

bdoyle0182 commented Dec 9, 2020

we're seeing a similar issue with random compactions on shards when upgrading from 2.x to 3.1.1. But might be completely unrelated. The compaction metadata file blew up to about 500gb over 24 hours for a shard that is about 30gb constantly hitting this error. Similarly we're also seeing large disk io on the nodes this is happening on versus nodes this is not happening on.

<0.5181.0> -------- exit for compaction of ["shards/60000000-7fffffff/core_activations.1589336264"]: {badarith,[{couch_file,get_pread_locnum,3,[{file,"src/couch_file.erl"},{line,730}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{couch_file,read_multi_raw_iolists_int,2,[{file,"src/couch_file.erl"},{line,719}]},{couch_file,handle_call,3,[{file,"src/couch_file.erl"},{line,507}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}

-------- CRASH REPORT Process (<0.32443.1691>) with 3 neighbors crashed with reason: bad arithmetic expression at couch_file:get_pread_locnum/3(line:730) <= lists:map/2(line:1239) <= couch_file:read_multi_raw_iolists_int/2(line:719) <= couch_file:handle_call/3(line:507) <= gen_server:try_handle_call/4(line:636) <= gen_server:handle_msg/6(line:665) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_file,init,['Argument__1']}, ancestors: [<0.3352.1692>], message_queue_len: 0, messages: [], links: [<0.3352.1692>], dictionary: [{couch_file_fd,{{file_descriptor,prim_file,{#Port<0.1924847>,92}},"..."}},...], trap_exit: false, status: running, heap_size: 28690, stack_size: 27, reductions: 13483

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants