New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"_changes" feed not 100% reliable #3721

Closed
stemail23 opened this Issue Aug 24, 2018 · 16 comments

Comments

Projects
None yet
4 participants
@stemail23
Copy link

stemail23 commented Aug 24, 2018

Sync Gateway version

Couchbase Sync Gateway/1.5.1(4;cb9522c) (multiple instances, dynamically scaled)

Operating system

Ubuntu 14.04.3 LTS

Expected behavior

When repeatedly requesting GET /{db}/_changes from a Sync Gateway database on the admin port with a since parameter set to the last_seq value returned in the previous request, the expectation is that 100% of updated documents will be reliably returned in the responses. Or in other words this strategy ought to result in a guaranteed feed of every single upserted document in a Sync Gateway database over time.

Actual behavior

After extensive investigation following repeated anecdotal reports from users of a production implementation backed by Sync Gateway, it is demonstrably the case that upserted documents are not included in this changes feed, with significant regularity. It would seem that this most frequently occurs when the database sequence transitions from, or to, a compound sequence (stable_seq::seq) representation.

A Typical Example

A recent example that I investigated followed this pattern. The logs from the change monitor application reveal that:

  1. query made with since: 11045496
  2. response contains 1 document (seq: 11045497), and last_seq: 11045497
  3. subsequent query made with since: 11045497.
  4. response contains 2 documents (seq: 11045497::11045499 & 11045497::11045500), and last_seq: 11045497::11045500
  5. subsequent query made with since: 11045497::11045500
  6. response contains 1 document (seq: 11045497::11045501), and last_seq: 11045497::11045501
  7. subsequent query made with since: 11045497::11045501
  8. response contains 26 documents (seq: 11045497::11045502 … 11045497::11045527), and last_seq: 11045497::11045527
  9. subsequent query made with since: 11045497::11045527
  10. response contains 1 document (seq 11045528), and last_seq: 11045528

This is a total of 31 documents between seq: 11045497 and seq: 11045528

Some short time later once the compound sequence situation has resolved, the following manual confirmation is invoked:

  1. query made with since=11045496&limit=32
  2. response contains 32 document (seq: 11045497 … 11045528), and last_seq: 11045528

This is a total of 32 documents between seq: 11045497 and seq: 11045528

Analysis of the returned documents reveals that in the first example the response in step 4 was missing the document returned as seq: 11045498 in the second example. Or in other words, the response to step 3 is missing the expected document seq: 1045497::11045498. (The log files surrounding this snippet include exhaustive data for several days both before and after this sequence, and the document at seq: 11045498 is demonstrably missing from the change monitor.

Observations

  1. It would seem that the occurrence rate of missed document changes increased significantly when we recently introduced replicating clients into our system (previously, the sync gateway database was only accessed by services directly over the REST API). The speculation is that these replicating clients (perhaps by now invoking the previously unused bulk "set" APIs on the Sync Gateway endpoints?) have increased the frequency of whatever root cause leads to the unreliability. Note that this is speculation only.

  2. We are using this changes feed to detect the presence of replicated "transactional" documents, and also to recognise changes to "state" documents within Sync Gateway. Our system is absolutely contingent on the guaranteed detection of such document changes in order to trigger application critical business logic and subsequent flow-on document mutations. The unreliability of this changes feed is a mission critical situation for us, and our production system is currently unhealthy, with no immediately obvious mitigation strategy.

  3. Perhaps most alarmingly: On the assumption that the remote clients (using Couchbase provided clients) essentially utilise the same changes feed mechanism in order to implement pull replication functionality (albeit over the public port, and with user context channel filtering), there is now a significant concern that such pull replication might also be not 100% reliable and might potentially result in missing document mutations replicating down to clients. Again this constitutes a mission critical failure mode for us.

  4. A search of the forum and issue history for Sync Gateway yields various similar sounding issues, which may, or may not be related, for example:

Thanks for your attention to this issue.

@stemail23

This comment has been minimized.

Copy link

stemail23 commented Aug 24, 2018

@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Aug 24, 2018

@stemail23 Thanks for the detailed info - are you able to post any of the logs mentioned to a gist?

@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Aug 24, 2018

The compound sequence numbers (like 11045497::11045499) indicate that Sync Gateway is still waiting for sequence 11045498 to arrive over the server's mutation feed (i.e. 11045497 represents the last contiguous sequence seen by that Sync Gateway node).

Based on that, the problem step in your above description is step 9 - a request with since: 11045497::11045527 returning last_seq: 11045528 should include 11045498, if that document didn't represent a doc ID that had been subsequently modified.

Getting the full logs will help us dig into what exactly is going on in this scenario.

@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Aug 24, 2018

Information on the timing of these requests will also be useful, along with whether Sync Gateway is running with any cache settings customized in the config.

By default, Sync Gateway will give up on sequences that aren't seen after 60 minutes (configurable using cache.max_wait_skipped). The described behaviour could be related to this timeout being reached.

@stemail23

This comment has been minimized.

Copy link

stemail23 commented Aug 24, 2018

Thanks for the response Adam. I'll attempt to dig out the corresponding Sync Gateway logs on Monday if I can, but it will be awkward, because they'll be distributed across three instances, and heavily interleaved with other activity. Also our Sync Gateway is configured with "log": ["REST"], so there is only the HTTP activity logged, so I'm not sure how helpful it will be? Is there some other logging we should enable to assist?

What I can tell you is that the above mentioned sequence occurred over a span of less than a minute, and that we don't explicitly configure cache.max_wait_skipped, so I assume it's not a timeout in this case.

I can also confirm that the doc at seq: 11045498 has not been subsequently modified and is still at _rev: 1

@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Aug 24, 2018

@stemail23 Thanks - that's useful information. Logs may still be useful, as some scenarios may trigger warning level logging.

A few more questions:

  • Are the changes requests listed all being made against the same Sync Gateway node, or would they be potentially issued against different nodes?
  • Are the changes requests all being issued against the admin port?
  • How much time passed between the original sequence and the subsequentsince=11045496&limit=32? I'm curious whether the results for that request were likely to be resident in the cache, or if it would have triggered a view query.
  • Are these feed=longpoll changes requests?
  • Are any other properties being set on the changes requests (filters, limit, etc)
@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Aug 25, 2018

@stemail23 One additional clarification, based on the existing logs. In your step 10, does the document returned have the sequence 11045528, or 11045497::11045528?

@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Aug 25, 2018

I've reviewed the skipped sequence code in 1.5.1, and written a new unit test to closely mirror this scenario, and I haven't been able to repro yet.

The point of error is the response in step 10. For a changes request with since: 11045497::11045527, there should be two possible outcomes:

  • The sync gateway node is still waiting on 11045498, and so has oldestSkippedSequence=11045498. In that scenario, Sync Gateway treats the since value as 11045527, but maintains the low sequence value as 11045497. The changes response should send a single change (and last_seq) as 11045497::11045528
  • The sync gateway node has received 11045498, and has oldestSkippedSequence != 11045498 (either zero, or a different skipped sequence). In that scenario, Sync Gateway treats the since value as 11045497, and should be (re-)sending all 31 changes from 11045498 through 11045528, with no low sequence (i.e. non-compound sequences)

The stated response in step 10 is neither of these - it's only sending 11045528, but it's not setting the low sequence value.

The low sequence value is initialized at the start of the changes loop, and doesn't get modified during the single iteration that would be expected for a one-shot changes request. Given that, I don't see any race conditions where the low sequence value could change between the initial getChanges and subsequent lowSequence check when creating the change entry.

@stemail23

This comment has been minimized.

Copy link

stemail23 commented Aug 27, 2018

Adam,

It is taking me an extremely long time to try to find any relevant log entries: there are many GB of log files to hunt through, distributed across 6 different Sync Gateway machines. I'm still hunting, but in the meantime, here are some responses to various questions:

Are the changes requests listed all being made against the same Sync Gateway node, or would they be potentially issued against different nodes?

Our infrastructure has 3 public facing Sync Gateway nodes, and 3 internal nodes (for a total of 6 Sync Gateway instances). All the change monitoring described above is being load balanced across the 3 internal nodes, so yes, it is very likely that the change requests are being made against different instances.

Are the changes requests all being issued against the admin port?

Yes

How much time passed between the original sequence and the subsequentsince=11045496&limit=32? I'm curious whether the results for that request were likely to be resident in the cache, or if it would have triggered a view query.

In this case, several days had passed between them. What is interesting though is how I became aware of the missed change notification: I'm currently running two, essentially identical, monitors and correlating the changes detected by both to try to determine and differences. In this case then, one of the monitors correctly detected the change, whereas the other did not. The monitor that successfully detected the change also went into a compound sequence state at approximately the same time as the failing one, but then correctly re-emitted sequences 11045498 … 11045528 when it recovered to a single sequence number, again at approximately the same time as the failing monitor (approximately 15 seconds elapsed time)

Are these feed=longpoll changes requests?

Yes

Are any other properties being set on the changes requests (filters, limit, etc)

include_docs=true
feed=longpoll
limit=5000 (although this limit is not being reached)

One additional clarification, based on the existing logs. In your step 10, does the document returned have the sequence 11045528, or 11045497::11045528?

It it seq: 11045528 (i.e. the compound sequence situation appears to be resolved, even though the missing seq: 11045498 was never returned)

@stemail23

This comment has been minimized.

Copy link

stemail23 commented Aug 28, 2018

I have uploaded some log files from Sync Gateway at the time of the example. There are 2 minutes' worth of logs from 6 Sync Gateway machines and the log files from the change monitor that missed the change.

https://gist.github.com/stemail23/7f2d91eaa406ea70a00f0a609620c0e7

@stemail23

This comment has been minimized.

Copy link

stemail23 commented Aug 30, 2018

@adamcfraser, what causes a Sync Gateway instance to go into compound sequence mode? I understand that they are "still waiting for a sequence over the mutation feed", but more explicitly what does that mean, and what causes it? I'm would like to build the simplest system possible to try to replicate and demonstrate the issue. On a single CB server and single SG instance running on the same machine, I have been unable to trigger the compound sequence situation no matter what load I attempt to put it under.

@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Aug 31, 2018

@stemail23 Thanks for the details/logs - I've been out of the office for the past few days, but will take a look and see if I can get to the bottom of this.

Sync Gateway buffers sequences arriving over the DCP feed for ordering. Compound sequence mode occurs in response to a late arriving sequence on the server's mutation feed (DCP) - if the next expected sequence number doesn't appear for 5 seconds (by default), it will be skipped and compound sequences will be sent to clients. In general this occurs under load of either Couchbase Server or Sync Gateway. It's also more likely when one Sync Gateway node is under much heavier load than others in the cluster.

If you're attempting to repro, you could artificially tune the wait time down (https://docs.couchbase.com/sync-gateway/2.1/config-properties.html#databases-foo_db-cache-max_wait_pending).

@stemail23

This comment has been minimized.

Copy link

stemail23 commented Sep 3, 2018

Thanks for looking into it @adamcfraser I'll have a look at using max_wait_pending to see if I can force the symptoms.

@adamcfraser adamcfraser self-assigned this Sep 4, 2018

@adamcfraser adamcfraser added this to the Iridium milestone Sep 4, 2018

@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Sep 14, 2018

@stemail23 I haven't been able to reproduce this yet, but I'm continuing to investigate. If you've had any success generating a standalone reproducible scenario, definitely keep me in the loop.

@JFlath JFlath added the ffc label Sep 28, 2018

@adamcfraser adamcfraser modified the milestones: Iridium, 2.1.1 Oct 1, 2018

@adamcfraser

This comment has been minimized.

Copy link
Contributor

adamcfraser commented Oct 3, 2018

Am able to reproduce this w/ a longpoll changes request, under the following scenario:

  1. Sequence n is skipped
  2. Some sequences after n are replicated to the client, up to sequence m.
  3. Client issues longpoll changes request with since=n-1::m. No new results found, longpoll goes into wait mode
  4. Sequences n and m+1 arrive over DCP

Expected: SG returns changes entries for n through m+1
Actual: SG returns change entry for m+1 only

Late-arriving sequence handling is correct for one-shot and continuous changes, but longpoll requests that go into wait mode can miss sequences due to the way the since value is modified here. This results in the changes loop running with since=m when the changes feed is woken up, even though the system low sequence has changed.

Fix is to preserve the low sequence value if the initial longpoll iteration doesn't return results.

@stemail23

This comment has been minimized.

Copy link

stemail23 commented Oct 4, 2018

Adam, thanks for persisting with this issue and hopefully making a breakthrough! I’ll see if I can incorporate the fix in a build of our system and see if it appears to resolve the issues we’re seeing.

@djpongh djpongh added the bug label Oct 18, 2018

@adamcfraser adamcfraser closed this Nov 5, 2018

@adamcfraser adamcfraser removed the ready label Nov 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment