Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction Failures Similar to #2941 on Upgrade From 2.x to 3.1.1 #3292

Open
bdoyle0182 opened this issue Dec 9, 2020 · 8 comments
Open

Comments

@bdoyle0182
Copy link

bdoyle0182 commented Dec 9, 2020

Description

we're seeing a similar issue to #2941 with random compactions on shards when upgrading from 2.x to 3.1.1. But might be completely unrelated. The compaction metadata file blew up to about 500gb over 24 hours for a shard that is about 30gb constantly hitting this error. Similarly we're also seeing large disk io on the nodes this is happening on versus nodes this is not happening on like in #2941. I've deleted the compaction files like discussed in the previous issue and seems to be working fine now, the compaction is running and the errors have stopped. disk io has gone back down.

<0.5181.0> -------- exit for compaction of ["shards/60000000-7fffffff/core_activations.1589336264"]: {badarith,[{couch_file,get_pread_locnum,3,[{file,"src/couch_file.erl"},{line,730}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{couch_file,read_multi_raw_iolists_int,2,[{file,"src/couch_file.erl"},{line,719}]},{couch_file,handle_call,3,[{file,"src/couch_file.erl"},{line,507}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}

-------- CRASH REPORT Process (<0.32443.1691>) with 3 neighbors crashed with reason: bad arithmetic expression at couch_file:get_pread_locnum/3(line:730) <= lists:map/2(line:1239) <= couch_file:read_multi_raw_iolists_int/2(line:719) <= couch_file:handle_call/3(line:507) <= gen_server:try_handle_call/4(line:636) <= gen_server:handle_msg/6(line:665) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_file,init,['Argument__1']}, ancestors: [<0.3352.1692>], message_queue_len: 0, messages: [], links: [<0.3352.1692>], dictionary: [{couch_file_fd,{{file_descriptor,prim_file,{#Port<0.1924847>,92}},"..."}},...], trap_exit: false, status: running, heap_size: 28690, stack_size: 27, reductions: 13483

an example of a shard I haven't cleaned up yet
-rw-r--r-- 1 1501 1501 16G Dec 9 17:13 core_activations.1589323247.couch
-rw-r--r-- 1 1501 1501 4.0G Dec 9 17:11 core_activations.1589323247.couch.compact.data
-rw-r--r-- 1 1501 1501 159G Dec 9 17:13 core_activations.1589323247.couch.compact.meta

Steps to Reproduce

Upgrade to 3.1.1 from 2.x mid compaction

Expected Behaviour

Compaction to complete as expected

Your Environment

If any specific environment details would be helpful just let me know

  • CouchDB version used: 3.1.1
  • Browser name and version:
  • Operating system and version: centOS
@wohali
Copy link
Member

wohali commented Dec 9, 2020

Are you able to test the fix that was merged in #3001 ?

@bdoyle0182
Copy link
Author

I was under the impression the fix was included in 3.1.1 which we directly upgraded to from 2.x according to the release notes under bug fixes
https://docs.couchdb.org/en/latest/whatsnew/3.1.html

@wohali
Copy link
Member

wohali commented Dec 10, 2020

Yup, you're right, my mistake.

@nickva Does this look like anything that #3001 would have missed?

@nickva
Copy link
Contributor

nickva commented Dec 10, 2020

This looks like a different failure from #3001.

@bdoyle0182 what version of 2.x was running on the old nodes? What Erlang VM version, OS (which version of CentOS), and file system was used?

Looking through the issues so far it's first instance of badarith error from that part of the code I think.

@nickva
Copy link
Contributor

nickva commented Dec 10, 2020

Looking at the code the crash is coming from

get_pread_locnum(File, Pos, Len) ->
BlockOffset = Pos rem ?SIZE_BLOCK,

It looks like Pos (position) there is not an integer but something like undefined or eof.

That gets called from couch_file:pread_terms which is only called from the compactor's copy_meta_data:

Acc = merge_docids(Iter, Acc0),
{ok, Infos} = couch_file:pread_terms(SrcFd, Acc#merge_st.locs),

or compactor's merge_doc_ids

merge_docids(Iter, #merge_st{locs=Locs}=Acc) when length(Locs) > 1000 ->
#merge_st{
src_fd=SrcFd,
id_tree=IdTree0,
seq_tree=SeqTree0,
rem_seqs=RemSeqs
} = Acc,
{ok, Infos} = couch_file:pread_terms(SrcFd, Locs),

@bdoyle0182
Copy link
Author

bdoyle0182 commented Dec 10, 2020

Current couchdb: 3.1.1
Previous couchdb: 2.3.1

Current erlang: 20.3.8.24
Previous erlang: 19.2

centos: CentOS Linux release 7.9.2009

file system: overlay

@bdoyle0182
Copy link
Author

Just to update the compaction did successfully complete after being re-triggered so just deleting the compaction files seems like a fine remediation.

@nickva
Copy link
Contributor

nickva commented Dec 11, 2020

@bdoyle0182 makes sense, thanks for confirming. This probably is an upgrade issue, we have upgrade code to handle the main .couch files but not necessarily the .compact.* files I think.

@davisp I wonder if we can automatically detect the upgrade scenario and auto-delete or at least ignore the older compaction files when the format is upgraded?

We should update the docs to advise users to complete compactions on the 2.x nodes before they are upgraded to 3.x, or alternatively to delete the .compact.meta and .compact.data after the upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants