New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0 replication doesn't notify clients when unable to send rev message #3738

Closed
adamcfraser opened this Issue Sep 11, 2018 · 8 comments

Comments

Projects
None yet
5 participants
@adamcfraser
Copy link
Contributor

adamcfraser commented Sep 11, 2018

During pull replication, the following sequence of messages is expected:

  1. Sync Gateway sends changes message to client.
  2. Client returns response indicating which revisions they need.
  3. Sync Gateway sends rev message for each revision that's requested.
  4. Client returns response indicating whether revision was successfully persisted.

In the scenario that Sync Gateway is unable to perform step 3 (most commonly because the revision body is no longer available), it responds by not sending any rev message at all.

Couchbase Lite is currently monitoring replication status by comparing the number of revisions requested in step 2 with the number of rev messages received in step 3. In the event that Sync Gateway fails to send a rev message, replication becomes stuck in a busy state on the client.

There are at least two expected scenarios where SG is known to be unable to send a rev, that shouldn't terminate replication:

  • Document has been purged from SG's backing bucket, but is still resident in SG's channel cache
  • Document has been updated on SG under convergence, so old revision body is unavailable

For expected cases, Sync Gateway could send an rev message with a new error property in this scenario to allow CBL to reconcile it's count. This property could also be used in future to differentiate between error scenarios to trigger additional CBL action.

This doesn't address how to handle unexpected errors (that replication should eventually retry). I don't know of a mechanism that Lite could do this, other than stopping the entire replication and restarting with the previous since value. Before triggering that, should Sync Gateway perform a backoff retry on rev message?

@djpongh djpongh added the ffc label Sep 11, 2018

@snej

This comment has been minimized.

Copy link
Member

snej commented Sep 11, 2018

LiteCore has a mechanism for this but I don’t recall the details (I’m away from the computer). IIRC there’s a different message it sends in case of error. I’ll post details later.

@snej

This comment has been minimized.

Copy link
Member

snej commented Sep 12, 2018

It sends a no-reply norev message, with an error property containing the error number (as well as the same id, rev, sequence properties as a rev message.)

@adamcfraser adamcfraser added this to the Iridium milestone Sep 12, 2018

@adamcfraser adamcfraser added the backlog label Sep 17, 2018

@adamcfraser adamcfraser modified the milestones: Iridium, 2.1.1 Sep 17, 2018

@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Sep 17, 2018

Should be straightforward to repro this scenario in a unit test along the following lines:

  1. Write 5 documents to a channel
  2. Issue a purge for the document through SG
  3. Flush rev cache using FlushRevisionCache
  4. Perform a one-shot pull replication via the 2.0 replicator

Expected:

  • four rev messages, one norev message

Actual:

  • four rev messages
  • (with actual CBL, results in replication stuck in 'busy' waiting for 5th rev/norev)

@adamcfraser adamcfraser added ready and removed backlog labels Sep 17, 2018

tleyden pushed a commit to couchbase/couchbase-lite-core that referenced this issue Sep 17, 2018

@tleyden

This comment has been minimized.

Copy link
Contributor

tleyden commented Sep 17, 2018

@snej I added PR to update the spec and assigned it to you: couchbase/couchbase-lite-core#566

snej added a commit to couchbase/couchbase-lite-core that referenced this issue Sep 17, 2018

tleyden pushed a commit that referenced this issue Sep 19, 2018

@pasin

This comment has been minimized.

Copy link

pasin commented Sep 20, 2018

couchbase/couchbase-lite-ios#2208 has been opened, and it seems to be the same issue here.

tleyden added a commit to couchbase/couchbase-lite-core that referenced this issue Sep 25, 2018

@tleyden

This comment has been minimized.

Copy link
Contributor

tleyden commented Sep 25, 2018

@pasin can you check this PR to see if it matches the Couchbase Lite expectations on what fields should be sent in the norev messages?

couchbase/couchbase-lite-core#569

@tleyden tleyden added in progress review and removed ready labels Sep 25, 2018

tleyden pushed a commit that referenced this issue Sep 25, 2018

tleyden added a commit that referenced this issue Sep 26, 2018

Feature/issue 3738 norev (#3763)
* Repro attempt for SG #3738

* /_flush_rev_cache endpoint for easy manual testing

* TODOs on fix

* Clean up test

* First pass at implmenting norev

* Update test helper to deal with norev messages

* Translate error to status code

* Avoid panic if there is a marshal err for the sequence

* Revert "/_flush_rev_cache endpoint for easy manual testing"

This reverts commit 9e54548.

* Update test comment

* PR feedback
@pasin

This comment has been minimized.

Copy link

pasin commented Sep 26, 2018

@tleyden From Puller.cc, it doesn't seem that LiteCore handles error and reason in particular.

@tleyden

This comment has been minimized.

Copy link
Contributor

tleyden commented Sep 27, 2018

Fixed in #3763

@tleyden tleyden closed this Sep 27, 2018

snej added a commit to couchbase/couchbase-lite-core that referenced this issue Oct 2, 2018

Update description of error field, add "reason" field. (#569)
* Update description of error field, add "reason" field.

Related to couchbase/sync_gateway#3738

* Since norev messages are always solicited, update text regarding whether optional

@djpongh djpongh added the bug label Oct 8, 2018

adamcfraser added a commit that referenced this issue Oct 19, 2018

Feature/issue 3738 norev (#3763)
* Repro attempt for SG #3738

* /_flush_rev_cache endpoint for easy manual testing

* TODOs on fix

* Clean up test

* First pass at implmenting norev

* Update test helper to deal with norev messages

* Translate error to status code

* Avoid panic if there is a marshal err for the sequence

* Revert "/_flush_rev_cache endpoint for easy manual testing"

This reverts commit 9e54548.

* Update test comment

* PR feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment