Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Distributed Index] Index writers intermittantly fail to register with CBGT #1392

Closed
sethrosetter opened this issue Jan 1, 2016 · 40 comments
Assignees
Labels
Milestone

Comments

@sethrosetter
Copy link
Contributor

I was running a perf scenario targeting 2 IR / 2 IW
Couchbase Sync Gateway/feature/distributed_index(807177a)

The run parameters were 10k pushers / 10k pullers for the entire cluster.

I kicked it off using the following command

python performance_tests/run_tests.py --use-gateload --gen-gateload-config --reset-sync-gw --number-pushers=5000 --number-pullers=5000 --test-id="10k-10k-2IR-2IW"

Once the test started, I nothing was being written to the index bucket and saw the following error in the logs.
https://gist.github.com/sethrosetter/780ccff52cf7ac5852e4#file-gistfile1-txt-L13

Full logs here - https://s3.amazonaws.com/sync-gateway-logs/snapshot-2015-12-31-15-38-57-sglogs.zip

Unfortunately, this is not reproducible every time. When I reran the test, all of the writers registered correctly.

@sethrosetter sethrosetter changed the title [Distributed Index] Index writers fail to register with CBGT [Distributed Index] Index writers intermittantly fail to register with CBGT Jan 1, 2016
@adamcfraser adamcfraser added the bug label Jan 4, 2016
@adamcfraser adamcfraser added this to the 1.3 milestone Jan 4, 2016
@tleyden tleyden modified the milestones: 1.2.0, 1.3 Jan 5, 2016
@tleyden
Copy link
Contributor

tleyden commented Jan 5, 2016

@sethrosetter was seeing this in performance tests as well.

@tleyden
Copy link
Contributor

tleyden commented Jan 7, 2016

I just reproduced this while testing the changes in couchbaselabs/mobile-testkit#192.

The test code waited 60 seconds for the pindexes to get spread among the index writers before giving up.

The test is querying Sync Gateway node 6029f3e7b24e155f to query the CBGT REST API, but when I manually queried node c31eb50eea9c5a83 I got the same response.

@steveyen I think I'm going to need your help debugging CBGT on this one. Btw we're using CBGT commit f654301a739cfe122fc6ca7f8e3368d654db5503 from December 18, 2015.

@tleyden
Copy link
Contributor

tleyden commented Jan 7, 2016

The issue looks like it's easily reproducible.. the second time I ran it I got the same error.

python performance_tests/run_tests.py --number-pullers 1000 --number-pushers 1000 --use-gateload --gen-gateload-config --reset-sync-gw --test-id="perftest"

@steveyen
Copy link
Member

steveyen commented Jan 7, 2016

On the main issue...

From the "CBGT CFG output", only a single node is listed there, so probably need to sit down (virtually) with @tleyden tomorrow to understand this test better.

On a side issue...

The "CBGT diag output", somehow there was junk in that file that made it not-JSON (not json parsable), which I manually cleaned up and gist'ed here...

https://gist.github.com/steveyen/92bd56519c5d2cadfb87

I think it wasn't valid JSON output because sync-gateway's PIndex implementation seems to not be implementing the Dest.Stats() method correctly...

https://godoc.org/github.com/couchbase/cbgt#Dest

I see the sync-gateway implementation looks like...

https://github.com/couchbase/sync_gateway/blob/master/src/github.com/couchbase/sync_gateway/base/sgw_pindex.go#L352

Instead, here's an example, "bare minimum" no-op implementation...

https://github.com/couchbase/cbgt/blob/master/pindex_impl_blackhole.go#L124

@steveyen
Copy link
Member

steveyen commented Jan 7, 2016

Some other thoughts, btw, in case you wanted to diagnose a little before we can hook up (or can do this together).

  • each node has its own diag output, so if you grab cbgt diag from each node, that would help.
  • one thing I would double check is to make sure the nodes are pointed at the same cfg backend and can both successfully access that shared cfg database/bucket.

Given that, my next step would be to check that they both have the same view of the cfg.

@tleyden
Copy link
Contributor

tleyden commented Jan 7, 2016

@tleyden
Copy link
Contributor

tleyden commented Jan 7, 2016

Unexpected errors:

@tleyden
Copy link
Contributor

tleyden commented Jan 7, 2016

Current theory:

  • There is a race condition in CBGT that happens when Sync Gateways are started up at the exact same time
  • Each CBGT pulls it's CFG from Couchbase Server and caches locally
  • Each CBGT writes itself to the NodeDefsKnown
  • Each CBGT does a non-Cas Set to Couchbase Server
  • One CBGT clobbers the other one, which becomes "lost"

@steveyen
Copy link
Member

steveyen commented Jan 8, 2016

Thanks for finding this!

cbgt fix up for review here: http://review.couchbase.org/#/c/58359/1

@steveyen
Copy link
Member

steveyen commented Jan 8, 2016

Also, a branch pushed up here for testing if you need: https://github.com/steveyen/cbgt/tree/fix-CfgCB-no-caching

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

@steveyen still seeing this error:

2016/01/08 01:44:15 MCResponse status=KEY_EEXISTS, opcode=SET, opaque=0, msg: Data exists for key

Full SG logs: https://gist.github.com/tleyden/f7265c748f247351499f

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

@steveyen after doing a few more tests with the latest experimental cbgt commit I saw this situation happen once:

https://gist.github.com/tleyden/1c506f3f6de8c5ffd885

  • Both nodes are in nodeDefsKnown
  • All pindexes are assigned to a single node (even after the test waited approximately 60 seconds)

Sync Gw logs:

I guess the only noteworthy thing is that the Sync Gw that didn't get any pindexes had this log message:

2016/01/08 02:19:45 Error checking view version: MCResponse status=KEY_ENOENT, opcode=GET, opaque=0, msg: Not found

but the Sync Gateway process didn't immediately exit in this case .. which might have something to do with the Sync Gateway code.

@steveyen
Copy link
Member

steveyen commented Jan 8, 2016

Thanks @tleyden - in that last one, looks like both nodes made it into nodeDefsKnown, but only one node made it into nodeDefsWanted, which explains the unbalanced pindex assignments.

The KEY_ENOENT response code is surprising, so I'll have to come up with some explanation for that.

I have a theory, though, for the previous experiment where you reported the KEY_EEXISTS error response, and am in progress of thinking of some solution.

@steveyen
Copy link
Member

steveyen commented Jan 8, 2016

Hi @tleyden - on the KEY_ENOENT, that actually doesn't look like cbgt related error message, and I couldn't find it from grep'ing cbgt codebase. Perhaps it's a sync-gateway related error message? I see it comes from go-couchbase...

https://github.com/couchbase/go-couchbase/blob/master/util/viewmgmt.go#L34

Related, iirc, cbgt doesn't use views.

@steveyen
Copy link
Member

steveyen commented Jan 8, 2016

Hi @tleyden
I pushed an additional commit up to the branch...

https://github.com/steveyen/cbgt/tree/fix-CfgCB-no-caching
steveyen/cbgt@9f80823

Could you give that a retry/retest?
Thanks,
Steve

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

@steveyen the first try passed! running it in a loop ...

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

Ok retesting in progress.

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

Here's the results for 9450dbf73a9b577ce9a7e0d273f67cfa24a4375e, looks like one of the Sync Gateways exited on startup:

https://gist.github.com/tleyden/e892996c7b7d89b07fca

I'm assuming the 2016/01/08 16:27:57 MCResponse status=KEY_EEXISTS, opcode=SET, opaque=0, msg: Data exists for key message is coming from CBGT due to the format of the date string. Not sure what is causing it to exit, I'm assuming there is some sync gateway code that calls exit... will try to track that down and improve the logging.

Here is the cbgt diag output for the SG node that didn't exit:

https://gist.github.com/9b5f39a1d89fc55eb775

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

^^ updated previous comment to add the cbgt diag output

@steveyen
Copy link
Member

steveyen commented Jan 8, 2016

Thanks @tleyden - I see another bug in my attempted bug fix. Another fix coming shortly.

@steveyen
Copy link
Member

steveyen commented Jan 8, 2016

Hi @tleyden
Yet another fix just pushed up to...

https://github.com/steveyen/cbgt/tree/fix-CfgCB-no-caching

steveyen/cbgt@6619097

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

Tested cbgt commit 66190973e1311b5a06fc173b64765292126c1051, looks like both nodes are still not in nodedefsknown:

https://gist.github.com/tleyden/5e49cb5ef6cbd427eb93

SG logs:

https://gist.github.com/tleyden/c85c54b8fe4bbef9ee04

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

^^ updated comment with logs

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

@steveyen actually in this last test, I think the pindexes are correctly partitioned, it's just that the nodeDefsKnown isn't accurate. I am going to change the validation to look at the pindex distribution rather than nodeDefsKnown, unless you have any major objections.

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

@steveyen indeed, changing the verification code to ignore nodeDefsKnown and just look at the pindex distribution underplanPIndexes` seems to have worked -- it's run about 5 times so far without detecting any issues, so it looks fixed from our end.

If you make any further fixes regarding nodeDefsKnown and you want me to retest, let me know.

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

@steveyen actually I take it back! Shortly after writing #1392 (comment), I saw another case with an unbalanced pindex distribution:

Diag:

Sg logs:

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

I added a functional test running under Jenkins to detect this error.

@tleyden
Copy link
Contributor

tleyden commented Jan 8, 2016

Note to self: steps to re-run

Get sync gw code (one time)

  • git clone sync gateway
  • switch to feature/issue_1392_cbgt branch
  • git submodule init && git submodule update

Update cbgt submodule

  • cd to cbgt submodule
  • git checkout -t fix-CfgCB-no-caching
  • git pull (to get latest commit)
  • cd to sync gw project dir
  • git add .
  • git commit .. git push

Build on Jenkins

Kick off functional test

@steveyen
Copy link
Member

steveyen commented Jan 9, 2016

Note sure why, but the next clue to followup on (for me) is that on one of the nodes, halfway through one of the logs (sg1.log), the nodedef got unregistered or disappeared from the nodesdefs-wanted...

2016/01/08 22:02:43 planner: PlannerOnce, err: planner: no NodeDef, uuid: 632e8da1877161a9

In here: https://gist.github.com/tleyden/7aa381512ca9d1be20cd

steveyen added a commit to steveyen/cbgt that referenced this issue Jan 11, 2016
See couchbase/sync_gateway#1392

In a cluster using cfg_cb, there can be startup races where multiple
nodes are registering, and overwrite each other because (previously)
the CAS implementation was a local-only cfgMem cache.

This change implements CAS correctly via server-side Couchbase, at the
tradeoff of performance in order to have more correctness.

Change-Id: I9aeed7ba0a17f98a91203ba68935515f768af2e3
@tleyden
Copy link
Contributor

tleyden commented Jan 11, 2016

Tested cbgt commit f8ec14652230214ac01ed6c67a689d1de7c0fa79 locally via:

$ python provision/install_sync_gateway.py --branch feature/issue_1392_cbgt
$ for NUM in `seq 1 1 100`; do python -m pytest --capture=no functional_tests/test_cbgt_pindex.py; done

Got the following cbgt diag output:

@tleyden
Copy link
Contributor

tleyden commented Jan 11, 2016

Since I didn't capture the SG logs in the previous run, I did another run on cbgt commit f8ec14652230214ac01ed6c67a689d1de7c0fa79, this time with the sg logs:

@tleyden
Copy link
Contributor

tleyden commented Jan 11, 2016

Tested cbgt commit 77314d030a687affd1925bb603f1f35fa2188b5d:

@tleyden
Copy link
Contributor

tleyden commented Jan 11, 2016

@steveyen for this last test, I was surprised that there was nothing in the sg logs that said anything about adding pindexes

@tleyden
Copy link
Contributor

tleyden commented Jan 11, 2016

Testing cbgt commit 63e659a47b9058d3ef825867ba4e787c87f79c75:

@tleyden
Copy link
Contributor

tleyden commented Jan 11, 2016

Tested cbgt commit f01d8cd4500302f73e955060536610a52a0d68bd, got a fatal error initializing cbgt:

2016/01/11 23:31:00 Fatal Error initializing CBGT: Unable to initialize CBGT index: manager_api: could not save indexDefs, err: CAS mismatch

Full logs:

https://gist.github.com/tleyden/c334446a104be58f06cb#file-sg1-logs-L128

@tleyden
Copy link
Contributor

tleyden commented Jan 12, 2016

Testing cbgt commit ece3eb8a68fc584cd59dd8f8552c4c75e670b0b0

So far 31/31 runs have passed

ns-codereview pushed a commit to couchbase/cbgt that referenced this issue Jan 12, 2016
See couchbase/sync_gateway#1392

In a cluster using cfg_cb, there can be startup races where multiple
nodes are registering, and overwrite each other because (previously)
the CAS implementation was a local-only cfgMem cache.

This change implements CAS correctly via server-side Couchbase, at the
tradeoff of performance in order to have more correctness.

Change-Id: I9aeed7ba0a17f98a91203ba68935515f768af2e3
Reviewed-on: http://review.couchbase.org/58359
Reviewed-by: Steve Yen <steve.yen@gmail.com>
Tested-by: Steve Yen <steve.yen@gmail.com>
@steveyen
Copy link
Member

pushed that set of cbgt commits to master, ending here: couchbase/cbgt@06193ca

@tleyden
Copy link
Contributor

tleyden commented Jan 12, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants