New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hanging or 500 errors during concurrent replication #1093
Comments
As Reece noted, while we are currently using 1.6.1 in production, we're intending to move to 2.1.x as soon as our blocking issues (#745 and this, so far) are resolved. Please don't interpret anything here as a request for an updated 1.x version. Cheers! |
Have you tested against 1.7.1? 1.6.1 has serious known security issues, as disclosed in multiple places. As you know, it is highly unlikely there will be another 1.x release at this point. |
Hi @wohali , I have not yet tested against 1.7.1. Our application is safe to use 1.6.1 since we operate an internal couchDB instance (no external world access), but we are wanting to move to 2.1.x as soon as is feasible. See @elistevens message above. I will get a 1.7.1 environment up and running to test; however, I anticipate we will still have the issues seen above since there are problems with both 1.6.1 and 2.1.1. I'll let you know what I find. I should have clarified in my earlier message, we definitely aren't requesting an additional 1.x release. I wanted to include 1.6.1 and 2.1.1 information since we noticed abnormal behavior on both versions and thought the extra information could be useful for an investigation. Currently, the blocking issue we are experiencing is that the server crashes during highly concurrent replication. Thanks! |
My guess is this is the same issue as in #796. Have you tried increasing your fd limit and max_open_dbs? Are you running all of your databases with the default |
When you are seeing this problem, please do a |
We increased max open files for CouchDB processes to 1048576; We lowered to |
Remember that If you expect to be running with a lot of databases, you may also want to set this in your local.ini:
|
@ReeceStevens I'm unable to run your test case on my test environment (2GB RAM) - Python throws a MemoryError on line 28. Can you provide a script that generates entropy without the use of numpy? Thanks. |
@wohali I have modified the script to remove the numpy dependency-- let me know if that works for you. I will be following up on your previous comments soon, thanks for your patience. |
Hi there @wohali , Preface: I only had couchDB 2.0.0 to work with when trying your suggestions, but I believe the underlying behavior is identical to v2.1.1. So I first watched message queues while the problem occurred-- I'm not sure what a relatively large message queue size is for couchDB but the only ones that rose significantly were Next, I ran with the
|
Thanks @ReeceStevens . So far, I can't reproduce with your script. 10 replications are taking 5-6 seconds to process with an upper limit of ~41s or so. I tried increasing the number of iterations and threads to no avail. My test machine is 2GB, 1 CPU @ 3.3GHz, SSD. Are you testing on HDD there? I can try and rig up a test on a slower disk. |
@wohali I had initially been testing on a couchDB server under load when I used 10 threads. If you're using an otherwise-unloaded database, you can up the thread count to generate the error. If you set |
And yes, the server we are testing is using an HDD |
I can't get a replication on that test machine even using I'll have to track down an HDD and try again tomorrow. |
Ahh, that is unfortunate. Perhaps disk latency is playing a role in all of this. Thanks, definitely let me know if the repro script doesn't work on an HDD either. |
HI @ReeceStevens , from my knowledge of your deployment, you are running on eCryptFS, right? Can you try the same test again on the same hardware, but using a faster filesystem like |
Hi @wohali -- unfortunately, resting encryption is a hard requirement of our application so we need to use eCryptFS. Is there anything else we can try? |
@ReeceStevens Yup, I'm aware, I just mean for testing. I haven't been able to reproduce the bug, so I was hoping you might have a test system on the same hardware you can test against that doesn't use eCryptFS. There's another bug filed on here that I still believe might be disk latency related, and I'm trying to eliminate possibilities. |
@wohali sorry for silence, I've been trying to get a test environment together for you where I can replicate the issue. I'm testing on a linux machine with a slow spinning disk, filesystem |
@ReeceStevens In testing together, it looks like the candidate fix for #745 has resolved this problem. I'm going to tentatively close this issue out. If you have any recurrence of this issue while testing, please let me know. |
1. According to couchdb's docs [2] and a couchdb's maintainer [1]: > If you expect to be running with a lot of databases, you may also > want to set this in your local.ini: > [couchdb] > update_lru_on_read = false > This has lead to significant performance enhancements on very busy > clusters. 2. This new conf (`update_lru_on_read = false`) is the default since version 2.2. As the `restore` operation can failed a lot due to performance issue, I propose to set this option. It makes no arm and could improve performance for people using coucharchive with couchdb version < 2.2. [1]: apache/couchdb#1093 (comment) [2]: http://docs.couchdb.org/en/2.2.0/whatsnew/2.2.html#id1
1. According to couchdb's docs [2] and a couchdb's maintainer [1]: > If you expect to be running with a lot of databases, you may also > want to set this in your local.ini: > [couchdb] > update_lru_on_read = false > This has lead to significant performance enhancements on very busy > clusters. 2. This new conf (`update_lru_on_read = false`) is the default since version 2.2. As the `restore` operation can failed a lot due to performance issue, I propose to set this option. It makes no arm and could improve performance for people using coucharchive with couchdb version < 2.2. [1]: apache/couchdb#1093 (comment) [2]: http://docs.couchdb.org/en/2.2.0/whatsnew/2.2.html#id1
Expected Behavior
If multiple replications are occurring in different processes, they should not cause the database to return a 500 error (in the case of CouchDB 2.1.1) or hang for extended periods of time (CouchDB 1.6.1).
Current Behavior
I apologize in advance for the length of some of these log snippets-- I am including the entirety of lines that are beginning with
[error]
, which sometimes can be quite a bit of information.CouchDB 2.1.1
If multiple threads are triggering replications at the same time, an occasional server 500 error will occur. This is more likely to occur if all replications are referring to the same source database.
Logs show the following message:
and occasionally:
CouchDB 1.6.1
If multiple threads are triggering replications at the same time, the server will occasionally stall for long periods of time and occasionally return with timeout errors. "Long periods of time" numerically means longer than if each thread were to run synchronously, one after the other.
Verbose logging show the following message when the error occurs, then locks up:
Eventually, there is a request timeout error. This can cause execution time to reach over 1.5x the synchronous upper limit, and sometimes substantially longer. Occasionally, this can also cause CouchDB 1.6.1 to completely freeze with no log output or other indications of activity. When a crash occurs, the log output is:
Possible Solution
Based on the fact that the issue only sometimes occurs and is more likely to show up when there are more threads, I am inclined to think this is a race condition. It also seems to happen more frequently if the replications are using the same source database, which might indicate that the race condition is involved with reading the source database. That is only a hunch, however.
Steps to Reproduce (for bugs)
A script to replicate the issue is here: https://gist.github.com/ReeceStevens/35d2cb06f820d3f054c6ff8dc226ef17
localhost:5984
python test_parallel_replication.py --threads 10 --iterations 40 --single-source
(we have had the most luck replicating the issue with these parameters)Context
We are using CouchDB 1.6.1; during product testing, we will often perform replications of large documents simultaneously. We are seeing about 3 or 4 test failures every run that are related to a hanging replication process-- they are not the same tests failing, and when run in isolation they pass. We did not see the same behavior in CouchDB 2.1.1 but are blocked from moving to that version due to #745. I believe we did not see this issue in 2.1.1 because it fails much more quickly and we retry requests on failure.
Your Environment
The text was updated successfully, but these errors were encountered: