New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Distributed Index] Index writers intermittantly fail to register with CBGT #1392
Comments
@sethrosetter was seeing this in performance tests as well. |
I just reproduced this while testing the changes in couchbaselabs/mobile-testkit#192.
The test code waited 60 seconds for the pindexes to get spread among the index writers before giving up. The test is querying Sync Gateway node 6029f3e7b24e155f to query the CBGT REST API, but when I manually queried node c31eb50eea9c5a83 I got the same response. @steveyen I think I'm going to need your help debugging CBGT on this one. Btw we're using CBGT commit f654301a739cfe122fc6ca7f8e3368d654db5503 from December 18, 2015. |
The issue looks like it's easily reproducible.. the second time I ran it I got the same error.
|
On the main issue... From the "CBGT CFG output", only a single node is listed there, so probably need to sit down (virtually) with @tleyden tomorrow to understand this test better. On a side issue... The "CBGT diag output", somehow there was junk in that file that made it not-JSON (not json parsable), which I manually cleaned up and gist'ed here... https://gist.github.com/steveyen/92bd56519c5d2cadfb87 I think it wasn't valid JSON output because sync-gateway's PIndex implementation seems to not be implementing the Dest.Stats() method correctly... https://godoc.org/github.com/couchbase/cbgt#Dest I see the sync-gateway implementation looks like... Instead, here's an example, "bare minimum" no-op implementation... https://github.com/couchbase/cbgt/blob/master/pindex_impl_blackhole.go#L124 |
Some other thoughts, btw, in case you wanted to diagnose a little before we can hook up (or can do this together).
Given that, my next step would be to check that they both have the same view of the cfg. |
Reproduced with valid diag: |
Unexpected errors: |
Current theory:
|
Thanks for finding this! cbgt fix up for review here: http://review.couchbase.org/#/c/58359/1 |
Also, a branch pushed up here for testing if you need: https://github.com/steveyen/cbgt/tree/fix-CfgCB-no-caching |
@steveyen still seeing this error:
Full SG logs: https://gist.github.com/tleyden/f7265c748f247351499f |
@steveyen after doing a few more tests with the latest experimental cbgt commit I saw this situation happen once: https://gist.github.com/tleyden/1c506f3f6de8c5ffd885
Sync Gw logs:
I guess the only noteworthy thing is that the Sync Gw that didn't get any pindexes had this log message:
but the Sync Gateway process didn't immediately exit in this case .. which might have something to do with the Sync Gateway code. |
Thanks @tleyden - in that last one, looks like both nodes made it into nodeDefsKnown, but only one node made it into nodeDefsWanted, which explains the unbalanced pindex assignments. The KEY_ENOENT response code is surprising, so I'll have to come up with some explanation for that. I have a theory, though, for the previous experiment where you reported the KEY_EEXISTS error response, and am in progress of thinking of some solution. |
Hi @tleyden - on the KEY_ENOENT, that actually doesn't look like cbgt related error message, and I couldn't find it from grep'ing cbgt codebase. Perhaps it's a sync-gateway related error message? I see it comes from go-couchbase... https://github.com/couchbase/go-couchbase/blob/master/util/viewmgmt.go#L34 Related, iirc, cbgt doesn't use views. |
Hi @tleyden https://github.com/steveyen/cbgt/tree/fix-CfgCB-no-caching Could you give that a retry/retest? |
@steveyen the first try passed! running it in a loop ... |
Ok retesting in progress. |
Here's the results for 9450dbf73a9b577ce9a7e0d273f67cfa24a4375e, looks like one of the Sync Gateways exited on startup: https://gist.github.com/tleyden/e892996c7b7d89b07fca I'm assuming the Here is the cbgt diag output for the SG node that didn't exit: |
^^ updated previous comment to add the cbgt diag output |
Thanks @tleyden - I see another bug in my attempted bug fix. Another fix coming shortly. |
Hi @tleyden |
Tested cbgt commit 66190973e1311b5a06fc173b64765292126c1051, looks like both nodes are still not in nodedefsknown: https://gist.github.com/tleyden/5e49cb5ef6cbd427eb93 SG logs: |
^^ updated comment with logs |
@steveyen actually in this last test, I think the pindexes are correctly partitioned, it's just that the nodeDefsKnown isn't accurate. I am going to change the validation to look at the pindex distribution rather than nodeDefsKnown, unless you have any major objections. |
@steveyen indeed, changing the verification code to ignore If you make any further fixes regarding |
@steveyen actually I take it back! Shortly after writing #1392 (comment), I saw another case with an unbalanced pindex distribution: Diag: Sg logs: |
I added a functional test running under Jenkins to detect this error. |
Note to self: steps to re-run Get sync gw code (one time)
Update cbgt submodule
Build on Jenkins
Kick off functional test
|
Note sure why, but the next clue to followup on (for me) is that on one of the nodes, halfway through one of the logs (sg1.log), the nodedef got unregistered or disappeared from the nodesdefs-wanted...
In here: https://gist.github.com/tleyden/7aa381512ca9d1be20cd |
See couchbase/sync_gateway#1392 In a cluster using cfg_cb, there can be startup races where multiple nodes are registering, and overwrite each other because (previously) the CAS implementation was a local-only cfgMem cache. This change implements CAS correctly via server-side Couchbase, at the tradeoff of performance in order to have more correctness. Change-Id: I9aeed7ba0a17f98a91203ba68935515f768af2e3
Tested cbgt commit f8ec14652230214ac01ed6c67a689d1de7c0fa79 locally via:
Got the following cbgt diag output: |
Since I didn't capture the SG logs in the previous run, I did another run on cbgt commit f8ec14652230214ac01ed6c67a689d1de7c0fa79, this time with the sg logs: |
Tested cbgt commit 77314d030a687affd1925bb603f1f35fa2188b5d: |
@steveyen for this last test, I was surprised that there was nothing in the sg logs that said anything about adding pindexes |
Testing cbgt commit 63e659a47b9058d3ef825867ba4e787c87f79c75: |
Tested cbgt commit f01d8cd4500302f73e955060536610a52a0d68bd, got a fatal error initializing cbgt:
Full logs: https://gist.github.com/tleyden/c334446a104be58f06cb#file-sg1-logs-L128 |
Testing cbgt commit ece3eb8a68fc584cd59dd8f8552c4c75e670b0b0 So far 31/31 runs have passed |
See couchbase/sync_gateway#1392 In a cluster using cfg_cb, there can be startup races where multiple nodes are registering, and overwrite each other because (previously) the CAS implementation was a local-only cfgMem cache. This change implements CAS correctly via server-side Couchbase, at the tradeoff of performance in order to have more correctness. Change-Id: I9aeed7ba0a17f98a91203ba68935515f768af2e3 Reviewed-on: http://review.couchbase.org/58359 Reviewed-by: Steve Yen <steve.yen@gmail.com> Tested-by: Steve Yen <steve.yen@gmail.com>
pushed that set of cbgt commits to master, ending here: couchbase/cbgt@06193ca |
I was running a perf scenario targeting 2 IR / 2 IW
Couchbase Sync Gateway/feature/distributed_index(807177a)
The run parameters were 10k pushers / 10k pullers for the entire cluster.
I kicked it off using the following command
Once the test started, I nothing was being written to the index bucket and saw the following error in the logs.
https://gist.github.com/sethrosetter/780ccff52cf7ac5852e4#file-gistfile1-txt-L13
Full logs here - https://s3.amazonaws.com/sync-gateway-logs/snapshot-2015-12-31-15-38-57-sglogs.zip
Unfortunately, this is not reproducible every time. When I reran the test, all of the writers registered correctly.
The text was updated successfully, but these errors were encountered: