Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize cache after DCP start #3558

Closed
adamcfraser opened this issue May 11, 2018 · 12 comments

Comments

@adamcfraser
Copy link
Contributor

commented May 11, 2018

We should retrieve lastSequence() after starting the DCP feed, to guarantee that there's no race between cache initialization and DCP startup.

@adamcfraser adamcfraser added this to the 2.2.0 milestone May 11, 2018

@adamcfraser adamcfraser changed the title Investigate potential startup race for cache.initialSequence Initialize cache after DCP start May 11, 2018

@adamcfraser adamcfraser added icebox ready and removed icebox labels May 14, 2018

@adamcfraser

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2018

First step should be to write a unit test to repro - might require leaky bucket.

tleyden pushed a commit that referenced this issue May 21, 2018

@tleyden tleyden added in progress review and removed ready labels May 21, 2018

@tleyden

This comment has been minimized.

Copy link
Contributor

commented May 23, 2018

@adamcfraser here is my understanding of the issue -- can you confirm this? (annotation below)

┌────┐                   ┌────┐                     ┌─────────┐                        ┌─────────┐
│Sg1 │                   │Sg2 │                     │  CB KV  │                        │ CB DCP  │
└────┘                   └────┘                     └─────────┘                        └─────────┘
                                                                                                  
                                                                                                  
                    Get Latest Sequence                                                           
  ────────────────────────────────────────────────────────▶                                       
                                                                                                  
  ◀────────────────────────────────────────────────────────                                       
                           100                                                                    
                                                                                                  
                                                                                                  
                               Write Doc w/ Seq 101                                               
                           ───────────────────────────────▶                                       
                                                                                                  
                                        Mutation notification Doc w/ Seq 101                      
                           ◀───────────────────────────────────────────────────────────────────   
                                                                                                  
                                                                                                  
                                                                                                  
                                             Start DCP Feed                                       
  ────────────────────────────────────────────────────────────────────────────────────────────▶   
                                                                                                  
                                                                                                  
                                                                                                  
      Write Doc w/ Seq 102                                                                        
  ───────────────────────────────▶                                                                
                                                                                                  
                                                                                                  
                                   Mutation notification Doc w/ Seq 102                           
  ◀────────────────────────────────────────────────────────────────────────────────────────────   
  • SG1 starts up while SG2 has already been running for some time
  • SG1 gets the last seq of 100
  • After SG1 gets the last seq of 100, but before SG1 starts the DCP feed, SG2 writes a doc with seq 101
  • SG1 starts the DCP feed, but it's cache is in an invalid state because the cache thinks it's is valid from seq 100, however it's missing the doc with seq 101, and so really it's only valid from seq 102 (which happens later)
@adamcfraser

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2018

That's the general idea, but there are some additional considerations:

  • For nodes doing import/migrate (or accel), 'Start the DCP Feed' resumes from where it left off. In other scenarios, 'Start the DCP feed' retrieves the vbucket high seqnos and starts from there.
  • It isn't necessary for the doc w/ seq 102 to originate from sg1
  • Caches (and valid from) are per channel. 'Get latest sequence' is relevant because it's actually 'Get latest sequence, and store it as the initial validFrom value whenever a new channel cache are instantiated'.
  • Channel caches are initialized in response to a document in that channel arriving over DCP, or an incoming changes request for that channel. Changes requests aren't serviced for a database until the DCP feed has been started.

As discussed on #3598, I think the cleanest approach is to defer initialization of the cache's 'initial validFrom' until the first channel cache is initialized. That's guaranteed to only happen after DCP has been started.

@tleyden

This comment has been minimized.

Copy link
Contributor

commented May 23, 2018

Ok thanks. I was able to repro locally w/ the following steps.

Steps to repro

  • On a feature branch, add an artificial sleep during startup to database.go:
	// If this is an xattr import node, resume DCP feed where we left off.  Otherwise only listen for new changes (FeedNoBackfill)
	feedMode := uint64(sgbucket.FeedNoBackfill)
	if context.UseXattrs() && context.autoImport {
		feedMode = sgbucket.FeedResume
	}


	base.Infof(base.KeyAll, "Got last seq: %v. Sleeping for a while", lastSeq)
	time.Sleep(time.Second * 60)
	base.Infof(base.KeyAll, "Done sleeping.  ")
	
	// If not using channel index or using channel index and tracking docs, start the tap feed
	if options.IndexOptions == nil || options.TrackDocs {
		base.Infof(base.KeyDCP, "Starting mutation feed on bucket %v due to either channel cache mode or doc tracking (auto-import/bucketshadow)", base.MD(bucket.GetName()))

		if err = context.mutationListener.Start(bucket, options.TrackDocs, feedMode, func(bucket string, err error) {

  • Create a docker network called cb and create a Couchbase Server running in that network, with container name couchbase: docker network create cb && docker run -d --name couchbase --network cb -p 8091:8091 couchbase/server:5.5.0-beta
  • Run SG1 via docker run -it -p 14984:4984 -p 14985:4985 --network cb --volume /Users/tleyden/tmp/sg-docker:/tmp/ tleyden5iwx/sync_gateway:feature_issue_3558_int_cache_post_dcp sync_gateway /tmp/basic-couchbase-bucket.json and config: https://gist.github.com/tleyden/b3809502840005862cda8f23ce0ebd80
  • Wait until SG1 starts up
  • Insert a doc to SG1 via echo '{"value":"foo", "channels":["testuser"]}' | http POST http://localhost:24984/db/ -- in my case this doc was given Seq 3
  • Run SG2 via docker run -it -p 24984:4984 -p 24985:4985 --network cb --volume /Users/tleyden/tmp/sg-docker:/tmp/ tleyden5iwx/sync_gateway:feature_issue_3558_int_cache_post_dcp sync_gateway /tmp/basic-couchbase-bucket.json
  • While SG2 is paused for 60 seconds during startup, after getting latest sequence but before starting DCP, insert several more docs to SG1 using the same curl request as above
  • Wait for SG2 to startup after 60 second pause
  • Query SG2 changes feed

Actual

curl -X GET \
  'http://localhost:24984/db/_changes?limit=1000&style=main_only&active_only=false&include_docs=false&filter=&channels=testuser&doc_ids=[testdoc]&feed=normal&since=&heartbeat=0&timeout=300000'

{
    "results": [
        {
            "seq": 2,
            "id": "_user/testuser",
            "changes": []
        },
        {
            "seq": 3,
            "id": "d0d7680140e8aba3d303e5d3d475fb1c",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        }
    ],
    "last_seq": "3"
}

Expected

{
    "results": [
        {
            "seq": 2,
            "id": "_user/testuser",
            "changes": []
        },
        {
            "seq": 3,
            "id": "d0d7680140e8aba3d303e5d3d475fb1c",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 4,
            "id": "5b3ce61b0b75b4519725ae798a052f0b",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 5,
            "id": "720c216780bebd007935eae97094c867",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 6,
            "id": "07ef5406314287d74ee2d27c87c0030d",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 7,
            "id": "4a9eb5d3958fcce44b9d254121c89b76",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        }
    ],
    "last_seq": "7"
}
@adamcfraser

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2018

@tleyden It would be a clearer repro scenario if you wrote an additional doc after SG2's DCP feed starts. The current 'actual' matches is a valid result (no missing data, based on last_seq in the changes response). If you write an additional doc after SG2's DCP feed starts, I expect you'd see sequences 2,3,8 in the changes response, which better illustrates the cache problem.

@tleyden

This comment has been minimized.

Copy link
Contributor

commented May 23, 2018

Extending the previous steps to reproduce, I just wrote a doc to SG1 using the same curl request as above, and now:

SG2 changes feed

{
    "results": [
        {
            "seq": "2",
            "id": "_user/testuser",
            "changes": []
        },
        {
            "seq": "3",
            "id": "d0d7680140e8aba3d303e5d3d475fb1c",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": "3::4",
            "id": "5b3ce61b0b75b4519725ae798a052f0b",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": "3::8",
            "id": "b05d8b1597c276ae43033f40ff5c1bef",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        }
    ],
    "last_seq": "3::8"
}

SG1 changes feed

{
    "results": [
        {
            "seq": 2,
            "id": "_user/testuser",
            "changes": []
        },
        {
            "seq": 3,
            "id": "d0d7680140e8aba3d303e5d3d475fb1c",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 4,
            "id": "5b3ce61b0b75b4519725ae798a052f0b",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 5,
            "id": "720c216780bebd007935eae97094c867",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 6,
            "id": "07ef5406314287d74ee2d27c87c0030d",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 7,
            "id": "4a9eb5d3958fcce44b9d254121c89b76",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 8,
            "id": "b05d8b1597c276ae43033f40ff5c1bef",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        }
    ],
    "last_seq": "8"
}
@adamcfraser

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2018

@tleyden Ah that's interesting - the sequence buffering (done while processing the DCP feed) knows that it expects to see all sequences after last_seq, hasn't seen the expected sequences between 4 and 7, and has flagged them as skipped.

The result is that we won't see any correctness issues until SG abandons the skipped sequences (after 60 minutes).

@tleyden

This comment has been minimized.

Copy link
Contributor

commented May 24, 2018

After waiting 60 mins, SG2 seems to have autocorrected itself and is now showing the expected changes:

curl -X GET \
  'http://localhost:24984/db/_changes?limit=1000&style=main_only&active_only=false&include_docs=false&filter=&channels=testuser&doc_ids=[testdoc]&feed=normal&since=&heartbeat=0&timeout=300000'

{
    "results": [
        {
            "seq": 2,
            "id": "_user/testuser",
            "changes": []
        },
        {
            "seq": 3,
            "id": "d0d7680140e8aba3d303e5d3d475fb1c",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 4,
            "id": "5b3ce61b0b75b4519725ae798a052f0b",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 5,
            "id": "720c216780bebd007935eae97094c867",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 6,
            "id": "07ef5406314287d74ee2d27c87c0030d",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 7,
            "id": "4a9eb5d3958fcce44b9d254121c89b76",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        },
        {
            "seq": 8,
            "id": "b05d8b1597c276ae43033f40ff5c1bef",
            "changes": [
                {
                    "rev": "1-5eb803e4848d3235a5f60e061b637541"
                }
            ]
        }
    ],
    "last_seq": "8"
}

In the SG2 logs, it looks like it's backfilling the cache by querying the bucket:

https://gist.github.com/tleyden/1b9ca649bfea35514622ce4e8de9d092#file-gistfile1-txt-L84-L91

SG1 logs for comparison:

https://gist.github.com/tleyden/d834a327fb30eb15aa062806afc9c709

@tleyden

This comment has been minimized.

Copy link
Contributor

commented May 24, 2018

@adamcfraser my initial attempt on this change:

the cleanest approach is to defer initialization of the cache's 'initial validFrom' until the first channel cache is initialized. That's guaranteed to only happen after DCP has been started.

is in d49743e, and I'm no longer able to repro the issue via the steps in #3558 (comment) using that commit.

Currently it re initializes (lazily) c.initialSequence and c.nextSequence on the first callback to DocChanged with the latest value of c.context.LastSequence(). My first attempt was to remove initialization of those values from ChangeCache.Init() and completely defer initialization until the first callback to DocChanged, however this was returning an empty changes feed. (I haven't tried to dig into that much further)

@adamcfraser

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2018

On my previous comment (about sequences being abandoned after an hour) - I'd forgotten the star channel query we do before abandoning would find the 'missing' sequences, so that explains the auto-correction you saw.

@tleyden

This comment has been minimized.

Copy link
Contributor

commented May 24, 2018

the star channel query we do before abandoning would find the 'missing' sequences

here?

tleyden added a commit that referenced this issue May 30, 2018

Fixes issue #3558 Initialize cache after DCP start (#3607)
* Add experimental dockerfile

* Fix attempt for build err

* Trying to figure out how to get source commit/branch

* Revert "Trying to figure out how to get source commit/branch"

This reverts commit bd5e502.

* Add artificial delay after getting last seq, but before starting dcp feed

* Reload c.initialSequence in DocChanged() as long as c.initialSequenceLazyLoaded == false

* Notes from call w/ Adam

* Add c.getInitialSequence() + _getNextSequence()

* Fixes

* Add comments

* Notes from call w/ Adam

* Sketch out changes and test

* Change the changeindex iface

* Fix data race

WARNING: DATA RACE
Write at 0x00c4201fa2e0 by goroutine 9:
  github.com/couchbase/sync_gateway/db.(*changeCache)._addToCache()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:689 +0x4ff
  github.com/couchbase/sync_gateway/db.(*changeCache).processEntry()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:650 +0x1f6
  github.com/couchbase/sync_gateway/db.(*changeCache).DocChangedSynchronous()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:549 +0x20a4

Previous read at 0x00c4201fa2e0 by goroutine 42:
  github.com/couchbase/sync_gateway/db.(*changeCache).waitForSequence()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/util_testing.go:25 +0xbd
  github.com/couchbase/sync_gateway/db.TestDocDeletionFromChannelCoalesced()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/changes_test.go:233 +0x5fe
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:746 +0x16c

* Use getNextSequence() in a few more places

* Fix another race

https://gist.github.com/tleyden/dccaa3cf2be85160051a793333c24467

* Remove dockerfile, moved to a separate PR

* Gofmt

* PR cleanup

* PR cleanup

* Remove check that’s leftover from first attempt

* Add comments

* First round of PR feedback

* Address PR feedback about avoiding external locking
@tleyden

This comment has been minimized.

Copy link
Contributor

commented May 31, 2018

Fixed in #3607

@tleyden tleyden closed this May 31, 2018

bbrks added a commit that referenced this issue Jun 7, 2018

Fixes issue #3558 Initialize cache after DCP start (#3607)
* Add experimental dockerfile

* Fix attempt for build err

* Trying to figure out how to get source commit/branch

* Revert "Trying to figure out how to get source commit/branch"

This reverts commit bd5e502.

* Add artificial delay after getting last seq, but before starting dcp feed

* Reload c.initialSequence in DocChanged() as long as c.initialSequenceLazyLoaded == false

* Notes from call w/ Adam

* Add c.getInitialSequence() + _getNextSequence()

* Fixes

* Add comments

* Notes from call w/ Adam

* Sketch out changes and test

* Change the changeindex iface

* Fix data race

WARNING: DATA RACE
Write at 0x00c4201fa2e0 by goroutine 9:
  github.com/couchbase/sync_gateway/db.(*changeCache)._addToCache()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:689 +0x4ff
  github.com/couchbase/sync_gateway/db.(*changeCache).processEntry()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:650 +0x1f6
  github.com/couchbase/sync_gateway/db.(*changeCache).DocChangedSynchronous()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:549 +0x20a4

Previous read at 0x00c4201fa2e0 by goroutine 42:
  github.com/couchbase/sync_gateway/db.(*changeCache).waitForSequence()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/util_testing.go:25 +0xbd
  github.com/couchbase/sync_gateway/db.TestDocDeletionFromChannelCoalesced()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/changes_test.go:233 +0x5fe
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:746 +0x16c

* Use getNextSequence() in a few more places

* Fix another race

https://gist.github.com/tleyden/dccaa3cf2be85160051a793333c24467

* Remove dockerfile, moved to a separate PR

* Gofmt

* PR cleanup

* PR cleanup

* Remove check that’s leftover from first attempt

* Add comments

* First round of PR feedback

* Address PR feedback about avoiding external locking

adamcfraser added a commit that referenced this issue Oct 4, 2018

Fixes issue #3558 Initialize cache after DCP start (#3607)
* Add experimental dockerfile

* Fix attempt for build err

* Trying to figure out how to get source commit/branch

* Revert "Trying to figure out how to get source commit/branch"

This reverts commit bd5e502.

* Add artificial delay after getting last seq, but before starting dcp feed

* Reload c.initialSequence in DocChanged() as long as c.initialSequenceLazyLoaded == false

* Notes from call w/ Adam

* Add c.getInitialSequence() + _getNextSequence()

* Fixes

* Add comments

* Notes from call w/ Adam

* Sketch out changes and test

* Change the changeindex iface

* Fix data race

WARNING: DATA RACE
Write at 0x00c4201fa2e0 by goroutine 9:
  github.com/couchbase/sync_gateway/db.(*changeCache)._addToCache()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:689 +0x4ff
  github.com/couchbase/sync_gateway/db.(*changeCache).processEntry()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:650 +0x1f6
  github.com/couchbase/sync_gateway/db.(*changeCache).DocChangedSynchronous()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:549 +0x20a4

Previous read at 0x00c4201fa2e0 by goroutine 42:
  github.com/couchbase/sync_gateway/db.(*changeCache).waitForSequence()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/util_testing.go:25 +0xbd
  github.com/couchbase/sync_gateway/db.TestDocDeletionFromChannelCoalesced()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/changes_test.go:233 +0x5fe
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:746 +0x16c

* Use getNextSequence() in a few more places

* Fix another race

https://gist.github.com/tleyden/dccaa3cf2be85160051a793333c24467

* Remove dockerfile, moved to a separate PR

* Gofmt

* PR cleanup

* PR cleanup

* Remove check that’s leftover from first attempt

* Add comments

* First round of PR feedback

* Address PR feedback about avoiding external locking

adamcfraser added a commit that referenced this issue Oct 4, 2018

Fixes issue #3558 Initialize cache after DCP start (#3607) (#3780)
* Add experimental dockerfile

* Fix attempt for build err

* Trying to figure out how to get source commit/branch

* Revert "Trying to figure out how to get source commit/branch"

This reverts commit bd5e502.

* Add artificial delay after getting last seq, but before starting dcp feed

* Reload c.initialSequence in DocChanged() as long as c.initialSequenceLazyLoaded == false

* Notes from call w/ Adam

* Add c.getInitialSequence() + _getNextSequence()

* Fixes

* Add comments

* Notes from call w/ Adam

* Sketch out changes and test

* Change the changeindex iface

* Fix data race

WARNING: DATA RACE
Write at 0x00c4201fa2e0 by goroutine 9:
  github.com/couchbase/sync_gateway/db.(*changeCache)._addToCache()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:689 +0x4ff
  github.com/couchbase/sync_gateway/db.(*changeCache).processEntry()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:650 +0x1f6
  github.com/couchbase/sync_gateway/db.(*changeCache).DocChangedSynchronous()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:549 +0x20a4

Previous read at 0x00c4201fa2e0 by goroutine 42:
  github.com/couchbase/sync_gateway/db.(*changeCache).waitForSequence()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/util_testing.go:25 +0xbd
  github.com/couchbase/sync_gateway/db.TestDocDeletionFromChannelCoalesced()
      /drone/godeps/src/github.com/couchbase/sync_gateway/db/changes_test.go:233 +0x5fe
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:746 +0x16c

* Use getNextSequence() in a few more places

* Fix another race

https://gist.github.com/tleyden/dccaa3cf2be85160051a793333c24467

* Remove dockerfile, moved to a separate PR

* Gofmt

* PR cleanup

* PR cleanup

* Remove check that’s leftover from first attempt

* Add comments

* First round of PR feedback

* Address PR feedback about avoiding external locking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.