Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction dies constantly after a certain amount of documents #3571

Closed
schneuwlym opened this issue May 19, 2021 · 10 comments
Closed

Compaction dies constantly after a certain amount of documents #3571

schneuwlym opened this issue May 19, 2021 · 10 comments

Comments

@schneuwlym
Copy link

Description

We have an issue with our CouchDB 3.1.1. We are using the default compaction configuration and this seems to work fine till the database reaches a certain amount of documents (~76K). Then the compaction dies and it is no longer able to finish the task. The compaction is restarted every 2 seconds and it always dies immediately. Till now, the problem is consistent and I didn't find any way (except of deleting the database) to fix the issue.

I read some other compaction related issues, but here I only used version 3.1.1. So no upgrade, no migration or something similar.

What I tried so far:

This is the log, which is repeated every two seconds:

[notice] 2021-05-19T14:42:38.848090Z couchdb@127.0.0.1 <0.460.0> -------- ratio_dbs: adding <<"shards/80000000-ffffffff/directory.1621404274">> to internal compactor queue with priority 2.100073355455779
[info] 2021-05-19T14:42:38.848533Z couchdb@127.0.0.1 <0.5146.0> -------- Starting compaction for db "shards/80000000-ffffffff/directory.1621404274" at 40726
[notice] 2021-05-19T14:42:38.848615Z couchdb@127.0.0.1 <0.460.0> -------- ratio_dbs: Starting compaction for shards/80000000-ffffffff/directory.1621404274 (priority 2.100073355455779)
[notice] 2021-05-19T14:42:38.849705Z couchdb@127.0.0.1 <0.460.0> -------- ratio_dbs: Started compaction for shards/80000000-ffffffff/directory.1621404274
[warning] 2021-05-19T14:42:38.893633Z couchdb@127.0.0.1 <0.460.0> -------- exit for compaction of ["shards/80000000-ffffffff/directory.1621404274"]: {undef,[{math,ceil,[1.6],[]},{couch_emsort,num_merges,2,[{file,"src/couch_emsort.erl"},{line,366}]},{couch_bt_engine_compactor,sort_meta_data,1,[{file,"src/couch_bt_engine_compactor.erl"},{line,508}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,75}]}]}
[error] 2021-05-19T14:42:38.894691Z couchdb@127.0.0.1 emulator -------- Error in process <0.5148.0> on node 'couchdb@127.0.0.1' with exit value:
{undef,[{math,ceil,[1.6],[]},{couch_emsort,num_merges,2,[{file,"src/couch_emsort.erl"},{line,366}]},{couch_bt_engine_compactor,sort_meta_data,1,[{file,"src/couch_bt_engine_compactor.erl"},{line,508}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,75}]}]}

[info] 2021-05-19T14:42:38.894453Z couchdb@127.0.0.1 <0.226.0> -------- db shards/80000000-ffffffff/directory.1621404274 died with reason {undef,[{math,ceil,[1.6],[]},{couch_emsort,num_merges,2,[{file,"src/couch_emsort.erl"},{line,366}]},{couch_bt_engine_compactor,sort_meta_data,1,[{file,"src/couch_bt_engine_compactor.erl"},{line,508}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,75}]}]}

If the problem occurs, inserting data is still possible, but often I get the following error message (btw, I'm using python-cloudant)

500 Server Error: Internal Server Error unknown_error undefined for url: http://localhost:5984/directory

Steps to Reproduce

  1. Clean database
  2. Create a script, which creates documents in an endless loop (pur json, no attachments, just one revision)
  3. After around 76K documents the compactor starts to fail.
  4. Inserts are still possible, but time and again, the insert fails with (see above 500 Server Error)

I did the mentioned stress test above on 3 nodes in parallel. All 3 nodes started to fail around the same amount of documents (70K-80K).

  • In the first node, I created the documents single threaded
  • In the second node, I created the documents using two threads
  • In the third node, I created the documents using four threads

Following the script I used to reproduce the issue in my setup:

#!/usr/bin/env python

import signal
import sys
from cloudant.client import CouchDB
from cloudant.document import Document
from copy import deepcopy
from threading import Thread


USERNAME = 'admin'
PASSWORD = 'admin'
COUCHDB_URL = 'http://localhost:5984'
DB_NAME = 'directory'


cdb = CouchDB(USERNAME, PASSWORD, url=COUCHDB_URL, connect=True, auto_renew=True)

account_skeletton = { 'parameter 1': 0,
                      'parameter 2': True,
                      'parameter 3': '',
                      'parameter 4': '',
                      'parameter 5': [],
                      'parameter 6': [],
                      'description': '',
                      'enabled': True,
                      'firstname': '',
                      'parameter 7': False,
                      'lastname': '',
                      'parameter 8': '',
                      'number': '',
                      'parameter 9': '9301162291d5a0480270d97d6c4a6da3edd75aa5',
                      'parameter 10': 'cos02',
                      'parameter 11': '112233',
                      'parameter 12': 1620118266.572422,
                      'parameter 13': 0,
                      'parameter 14': 0.0,
                      'parameter 15': False,
                      'parameter 16': 4,
                      'parameter 17': '',
                      'parameter 18': '',
                      'parameter 19': 'user',
                      'userid': '',
                      'parameter 20': '',
                      'parameter 21': '',
                      'parameter 22': True}


if DB_NAME not in cdb.all_dbs():
    cdb.create_database(DB_NAME)


def signal_handler(sig, frame):
    print('You pressed Ctrl+C!')
    sys.exit(0)


def create_documents(start=0, thread_id=0):
    try:
        for i in xrange(start, 999999):
            number = '{}{:06}'.format(thread_id, i)
            print('create_documents: Creating document {}'.format(number))
            with Document(cdb[DB_NAME], number) as document:
                document.update(deepcopy(account_skeletton))
                document['firstname'] = 'FN {}'.format(number)
                document['lastname'] = 'LN {}'.format(number)
                document['number'] = number
                document['userid'] = number
    except Exception as err:
        print('create_documents: {}'.format(err))


def create_documents_threaded(threads=2):
    for i in xrange(threads):
        t = Thread(target=create_documents, args=(0, i))
        t.daemon = True
        t.start()
    
    signal.signal(signal.SIGINT, signal_handler)
    print('Press Ctrl+C')
    signal.pause()

Expected Behaviour

Compaction doesn't fail :-)

Your Environment

  • CouchDB version used:
    {"couchdb":"Welcome","version":"3.1.1","git_sha":"ce596c65d","uuid":"08fb7cd0a10f35f6215a531742f7b356","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}
  • python-cloudant: 2.14.0
  • python2.7
  • Operating system and version:
    • Own Linux distribution
  • CouchDB running in a VM
    • Single Core (also changed to 2 cores, no difference)
    • 1GB Ram (also increased it to 1GB, no difference)
  • To trigger this issue, I used an isolated node, no replication, no clustering

Additional Context

Following you can find the configuration. Most of it is default:

curl http://admin:admin@localhost:5984/_node/couchdb@127.0.0.1/_config | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2823  100  2823    0     0   310k      0 --:--:-- --:--:-- --:--:--  344k
{
    "admins": {
        "admin": "-pbkdf2-d5b128e39ebe61b4f50fb9c2e3241c0ea1bc28f9,6b6e6d21c67f685f753d8fa1fe72db71,10"
    },
    "attachments": {
        "compressible_types": "text/*, application/javascript, application/json, application/xml",
        "compression_level": "8"
    },
    "chttpd": {
        "backlog": "512",
        "bind_address": "0.0.0.0",
        "max_db_number_for_dbs_info_req": "100",
        "port": "5984",
        "prefer_minimal": "Cache-Control, Content-Length, Content-Range, Content-Type, ETag, Server, Transfer-Encoding, Vary",
        "require_valid_user": "false",
        "server_options": "[{recbuf, undefined}]",
        "socket_options": "[{sndbuf, 262144}, {nodelay, true}]"
    },
    "cluster": {
        "n": "3",
        "q": "2"
    },
    "cors": {
        "credentials": "false"
    },
    "couch_httpd_auth": {
        "allow_persistent_cookies": "true",
        "auth_cache_size": "50",
        "authentication_db": "_users",
        "authentication_redirect": "/_utils/session.html",
        "iterations": "10",
        "require_valid_user": "false",
        "secret": "a0ec90afc5f896e3cf90e8c4adc9dafa",
        "timeout": "600"
    },
    "couch_peruser": {
        "database_prefix": "userdb-",
        "delete_dbs": "false",
        "enable": "false"
    },
    "couchdb": {
        "attachment_stream_buffer_size": "4096",
        "changes_doc_ids_optimization_threshold": "100",
        "database_dir": "/var/crypt/couchdb/couchdb",
        "default_engine": "couch",
        "default_security": "everyone",
        "file_compression": "snappy",
        "max_dbs_open": "500",
        "max_document_size": "8000000",
        "os_process_timeout": "5000",
        "single_node": "true",
        "users_db_security_editable": "false",
        "uuid": "08fb7cd0a10f35f6215a531742f7b356",
        "view_index_dir": "/var/crypt/couchdb/couchdb"
    },
    "couchdb_engines": {
        "couch": "couch_bt_engine"
    },
    "csp": {
        "enable": "true"
    },
    "feature_flags": {
        "partitioned||*": "true"
    },
    "httpd": {
        "allow_jsonp": "false",
        "authentication_handlers": "{couch_httpd_auth, cookie_authentication_handler}, {couch_httpd_auth, default_authentication_handler}",
        "bind_address": "127.0.0.1",
        "enable_cors": "false",
        "enable_xframe_options": "false",
        "max_http_request_size": "4294967296",
        "port": "5986",
        "secure_rewrites": "true",
        "socket_options": "[{sndbuf, 262144}]"
    },
    "indexers": {
        "couch_mrview": "true"
    },
    "ioq": {
        "concurrency": "10",
        "ratio": "0.01"
    },
    "ioq.bypass": {
        "compaction": "false",
        "os_process": "true",
        "read": "true",
        "shard_sync": "false",
        "view_update": "true",
        "write": "true"
    },
    "log": {
        "file": "/var/log/couchdb/couchdb.log",
        "level": "info",
        "writer": "file"
    },
    "query_server_config": {
        "os_process_limit": "100",
        "reduce_limit": "true"
    },
    "replicator": {
        "connection_timeout": "30000",
        "http_connections": "20",
        "interval": "60000",
        "max_churn": "20",
        "max_jobs": "500",
        "retries_per_request": "5",
        "socket_options": "[{keepalive, true}, {nodelay, false}]",
        "ssl_certificate_max_depth": "3",
        "startup_jitter": "5000",
        "verify_ssl_certificates": "true",
        "worker_batch_size": "500",
        "worker_processes": "4"
    },
    "smoosh": {
        "db_channels": "upgrade_dbs,ratio_dbs",
        "view_channels": "upgrade_views,ratio_views"
    },
    "ssl": {
        "port": "6984"
    },
    "uuids": {
        "algorithm": "sequential",
        "max_count": "1000"
    },
    "vendor": {
        "name": "The Apache Software Foundation"
    }
}
@oldrich-svec
Copy link

oldrich-svec commented May 21, 2021

We have a similar issue (Ubuntu Server 20.04, Docker version of CouchDB 3.1.1).

We have a 30GB database which is being replicated from an another machine. Looking at the files I can see that the database files take someting like 100GB + another ca. 30GB for the compaction files.

The compaction starts but always dies before it finishes. So the database gets never compacted and the compaction files are hanging there forever.

I would also add that the source machine (where the replication goes from) runs CouchDB 3.1.0 on Windows Server 2019 and there the compaction seems to work just fine.

Some logs:

couchdb-backup-service-01 | [info] 2021-05-21T08:27:57.466850Z nonode@nohost <0.222.0> -------- db shards/80000000-ffffffff/yoda_filesystem.1614838752 died with reason {{badarg,[{erlang,monitor,[process,{main,'clouseau@127.0.0.1'}],[]},{ioq,submit_request,2,[{file,"src/ioq.erl"},{line,187}]},{ioq,maybe_submit_request,1,[{file,"src/ioq.erl"},{line,150}]},{ioq,handle_info,2,[{file,"src/ioq.erl"},{line,123}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,686}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},{gen_server,call,[ioq,{request,<0.24352.1>,{append_bin,[<<0,0,32,0>>,[<<6,66,160,149,155,150,189,119,66,50,121,52,14,190,9,143...}}
couchdb-backup-service-01 | [warning] 2021-05-21T08:27:57.467050Z nonode@nohost <0.428.0> -------- exit for compaction of ["shards/00000000-7fffffff/yoda_filesystem.1614838752"]: {{badarg,[{erlang,monitor,[process,{main,'clouseau@127.0.0.1'}],[]},{ioq,submit_request,2,[{file,"src/ioq.erl"},{line,187}]},{ioq,maybe_submit_request,1,[{file,"src/ioq.erl"},{line,150}]},{ioq,handle_info,2,[{file,"src/ioq.erl"},{line,123}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,686}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},{gen_server,call,[ioq,{request,<0.22588.1>,{pread_iolist,40340132370},compaction,<0.22589.1>,undefined},infinity]}}
couchdb-backup-service-01 | [info] 2021-05-21T08:27:57.467190Z nonode@nohost <0.222.0> -------- db shards/00000000-7fffffff/yoda_filesystem.1614838752 died with reason {{badarg,[{erlang,monitor,[process,{main,'clouseau@127.0.0.1'}],[]},{ioq,submit_request,2,[{file,"src/ioq.erl"},{line,187}]},{ioq,maybe_submit_request,1,[{file,"src/ioq.erl"},{line,150}]},{ioq,handle_info,2,[{file,"src/ioq.erl"},{line,123}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,686}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},{gen_server,call,[ioq,{request,<0.22588.1>,{pread_iolist,40340132370},compaction,<0.22589.1>,undefined},infinity]}}

couchdb-backup-service-01 | [info] 2021-05-21T08:27:57.468479Z nonode@nohost <0.28448.1> -------- Starting compaction for db "shards/80000000-ffffffff/yoda_filesystem.1614838752" at 125776
couchdb-backup-service-01 | [notice] 2021-05-21T08:27:57.468708Z nonode@nohost <0.428.0> -------- ratio_dbs: Started compaction for shards/80000000-ffffffff/yoda_filesystem.1614838752
couchdb-backup-service-01 | [error] 2021-05-21T08:27:57.469504Z nonode@nohost <0.22508.1> -------- gen_server ioq terminated with reason: bad argument in call to erlang:monitor(process, {main,'clouseau@127.0.0.1'}) at ioq:submit_request/2(line:187) <= ioq:maybe_submit_request/1(line:150) <= ioq:handle_info/2(line:123) <= gen_server:try_dispatch/4(line:616) <= gen_server:handle_msg/6(line:686) <= proc_lib:init_p_do_apply/3(line:247)
couchdb-backup-service-01 |   last msg: timeout
couchdb-backup-service-01 |      state: {state,10,0.01,{[{request,{main,'clouseau@127.0.0.1'},{open,<0.28430.1>,<<"shards/80000000-ffffffff/yodadev_features.1614849634/de6fcb4cc39fbd3e0c1d621d441e9057">>,<<"standard">>},other,{<0.28430.1>,#Ref<0.2791382983.2

@AdrianTute
Copy link

Short appendix to what @schneuwlym wrote.
I was able to completely turn off auto-compaction (smoosh).
After insertion of ~70k records, I triggered manually a compaction task.
It was crashing at the very end (Fauxton showed progress ~99%) with the above mentioned error.
CPU and memory usage was decent all the time.

@nickva
Copy link
Contributor

nickva commented Jun 4, 2021

{undef,[{math,ceil,[1.6],[]},{couch_emsort,num_merges,2, is quite odd. I can't figure out where it is coming from

I see erlang ceil function in

num_merges(BBChunk, NumBB) when NumBB > BBChunk ->
RevNumBB = ceil(NumBB / BBChunk),
FwdNumBB = ceil(RevNumBB / BBChunk),
2 + num_merges(BBChunk, FwdNumBB).
is calling erlang:ceil in https://github.com/erlang/otp/blob/8b29b1ca870e6b31a0f3da067ebf4b1b4ceaa969/erts/preloaded/src/erlang.erl#L566-L570 which seems to call a C NIF function but not math:ceil which the error indicates.

@schneuwlym what version of Erlang are you running? Wonder if there is something related to that. ceil is a fairly new function in Erlang 20+ only.

@schneuwlym
Copy link
Author

Hi nickva

Thanks for your reply. We are using Version 19.3.

erl -eval '{ok, Version} = file:read_file(filename:join([code:root_dir(), "releases", erlang:system_info(otp_release), "OTP_VERSION"])), io:fwrite(Version), halt().' -noshell
19.3

@nickva
Copy link
Contributor

nickva commented Jun 5, 2021

@schneuwlym Erlang 19 would explain why you got an undef error there. That ceil/1 function is not present in Erlang 19. Unfortunately Erlang 19 is not supported any longer for CouchDB 3.x releases.

From the error message it seems as if someone had "patched" the CouchDB release to compile on 19.x and replaced the undefined ceil/1 function (which would have prevented compiling on < 20.0 releases) with math:ceil/1. However, math:ceil/1 is also not defined in < 20.0 release but we'd only find out about it at runtime.

4> catch ceil(1.6).    
{'EXIT',{{shell_undef,ceil,1,[]},
         [{shell,shell_undef,2,[{file,"shell.erl"},{line,1061}]},

5> catch math:ceil(1.6).
{'EXIT',{undef,[{math,ceil,[1.6],[]},
                {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,674}]},

nickva added a commit that referenced this issue Jun 5, 2021
It doesn't really work as we have functionality relying on 20.0+
features. One particular instance is in [1].

Issue: #3571

[1] https://github.com/apache/couchdb/blob/ce596c65d9d7f0bc5d9937bcaf6253b343015690/src/couch/src/couch_emsort.erl#L363-L366
nickva added a commit that referenced this issue Jun 5, 2021
It doesn't really work as we have functionality relying on 20.0+
features. One particular instance is in [1].

Issue: #3571

[1] https://github.com/apache/couchdb/blob/ce596c65d9d7f0bc5d9937bcaf6253b343015690/src/couch/src/couch_emsort.erl#L363-L366
@schneuwlym
Copy link
Author

Hi nickva, thanks for your reply.

Indeed it seems that our packager patched the source to build couchdb

--- a/src/couch/src/couch_emsort.erl.clean      2021-01-14 17:18:40.436549175 +0000
+++ a/src/couch/src/couch_emsort.erl    2021-01-14 17:17:39.128103923 +0000
@@ -133,6 +133,8 @@
 -export([add/2, merge/1, merge/2, sort/1, iter/1, next/1]).
 -export([num_kvs/1, num_merges/1]).

+-import(math, [ceil/1]).
+
 -record(ems, {
     fd,
     root,
--- a/src/couch/rebar.config.script.clean       2021-01-14 17:28:34.570100193 +0000
+++ b/src/couch/rebar.config.script     2021-01-14 19:06:56.523186136 +0000
@@ -107,7 +107,7 @@
         };
     {unix, _} when SMVsn == "1.8.5" ->
         {
-            "-DXP_UNIX -I/usr/include/js -I/usr/local/include/js",
+            "-DXP_UNIX " ++ os:getenv("JS_CFLAGS"),
             "-L/usr/local/lib -lmozjs185 -lm"
         };
     {win32, _} when SMVsn == "60" ->
@@ -164,7 +164,7 @@
 CouchJSEnv = case SMVsn of
     "1.8.5" ->
         [
-            {"CFLAGS", JS_CFLAGS ++ " " ++ CURL_CFLAGS},
+            {"CFLAGS", JS_CFLAGS ++ " " ++ CURL_CFLAGS ++ os:getenv("JS_CFLAGS")},
             {"LDFLAGS", JS_LDFLAGS ++ " " ++ CURL_LDFLAGS}
         ];
     _ ->

But what I don't understand is, the dependency page (https://docs.couchdb.org/en/3.1.1/install/unix.html#dependencies) mentions Erlang OTP 19.x as requirement. Do I don't understand the line "Erlang OTP (19.x, 20.x >= 21.3.8.5, 21.x >= 21.2.3, 22.x >= 22.0.5)" or is this information wrong? Since it is comma sparated, I assumed that 19.x is fully supported...

What Erlang version should we try? 22 or should we already try the latest, eg 24? Since 24 is not mentioned in the list I guess we should go with 22, right?

Regards
Mathias

@nickva
Copy link
Contributor

nickva commented Jun 7, 2021

@schneuwlym that was a mistake on our part, we have documented it as "soft" supported in release notes for 3.0: https://docs.couchdb.org/en/3.1.1/whatsnew/3.0.html

19.x - “soft” support only. No longer tested, but should work.

Basically saying we're not going to go out of our way to break it but it may break at some point accidentally and we're not testing it. With ceil the idea I think was that a failed compilation error would indicate that it obviously won't build. In retrospect, we should have taken a firmer stance and explicitly indicated we are not supporting Erlang versions < 20 in documentation at that point.

I already updated the rebar config file to disallow Erlang 19 and will update the dependencies list in unix.html docs file too.

As for which versions to try. The binary packages we release are shipped with the latest versions of 20. In production at Cloudant I have seen 20 run for a few years without any issues. So could pick 20.3.8.26 for example. However, the downside there is Erlang developers promise to support only the last two versions behind the current one. If that's a concern perhaps pick the latest patch version of 23 and make sure to periodically check for fixes.

@wohali
Copy link
Member

wohali commented Jun 7, 2021

Note that 23.x and 24.x are not yet supported in CouchDB 3, unless you are building from the 3.x branch directly. For 3.1.1 you cannot go any newer than 22.x.

See: https://docs.couchdb.org/en/3.1.1/install/unix.html#installation-from-source for the versions supported at the time 3.1.1 was released. We acknowledge 19.x was incorrectly included there, and will support 23.x and 24.x with the forthcoming 3.2 release.

@nickva
Copy link
Contributor

nickva commented Jun 7, 2021

@wohali good point, thanks for clarifying

@schneuwlym
Copy link
Author

Hi

Updating the Erlang compiler to 22 definitely seems to fix our issue!

Thank you very much for your help!

Best regards
Mathias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants