New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoDB cluster issues: retry on disconnect does not fail over, and write errors do not fail over #1508

Open
boutell opened this Issue Jul 27, 2018 · 9 comments

Comments

Projects
None yet
2 participants
@boutell
Contributor

boutell commented Jul 27, 2018

In the event that a member of the replica set becomes unable to respond during a particular find() or insert() or similar operation, there is no automatic failover. The operation fails with an error. Our preference would be to automatically retry it.

According to notes by Thomas Chraibi based on input from Charles Sarrazin of MongoDB, one approach to achieve that is to simply re-attempt the find() or insert() call in question. This will result in the MongoDB driver discovering a functioning replica set node to reconnect to, as in the following pseudocode:

const client = ...
const coll = client.db('cool').collection('llamas');

function attemptInsert(doc, callback) {
  coll.insert(doc, (err, result) => {
    if (err) {
      // you might want to check for some errors, as they might be unrecoverable
      // if (err.msg === 'SomeImportantErrorType') return callback(new Error('unrecoverable!'));

      // now recursively call `attemptInsert` to perform server selection again
      attemptInsert(doc, callback);
      return;
    }

    callback(); // success
  });
}

In addition, there is apparently some sort of issue with our autoReconnect configuration:

        autoReconnect: true,
        // retry forever
        reconnectTries: Number.MAX_VALUE,
        reconnectInterval: 1000

Apparently this will keep retrying the connection to a node that is down for as long as that node is down, which is not ideal.

However it is unclear to me why this should occur, while find() and insert() operations apparently will continue to make new connections to other nodes as needed according to the pseudocode that was provided above.

So, more clarification is needed on the following points before implementation can be completed:

  • In what situation does the autoReconnect behavior come into play?
  • If it is undesirable, what approach would ensure we eventually get connected again to an appropriate node?
  • If new find() and insert() operations already reconnect as needed, is there any value in using autoReconnect at all? What value would that be?
  • What MongoDB errors can be safely classed as "this node is ill or unavailable," as opposed to "you did something you should not have" (examples: oversize document, illegal operation, unique key, etc)?
@boutell

This comment has been minimized.

Contributor

boutell commented Jul 27, 2018

@boutell

This comment has been minimized.

Contributor

boutell commented Aug 9, 2018

Hello @mbroadst @csarrazi, any word on this? We do need your input in order to help our mutual customer with this. Thanks!

@mbroadst

This comment has been minimized.

mbroadst commented Aug 10, 2018

Hi @boutell, sorry lost track of this issue in the mix. Here are the answers to your questions:

In what situation does the autoReconnect behavior come into play?
autoreconnect is specifically related to the connection pool, and whether it will be resilient to network failures, attempting to reconnect connections to the server associated with the connection pool.

If it is undesirable, what approach would ensure we eventually get connected again to an appropriate node?
autoreconnect does provide the resiliency you desire in this case, connecting to the same node you were once connected to.

If new find() and insert() operations already reconnect as needed, is there any value in using autoReconnect at all? What value would that be?
The find and insert operations themselves do not presently reconnect.

What MongoDB errors can be safely classed as "this node is ill or unavailable," as opposed to "you did something you should not have" (examples: oversize document, illegal operation, unique key, etc)?
I budding list is growing here, based on a number of our specs.

Now to be a little more specific:
The connection pool already has support for auto reconnection, and an ability to complete/retry operations once such a connection is reconnected. This is an OK design for a single server connection, however is not ideal for any more complicated topology because of a few reasons:

  • you lose the benefits of a distributed topology, waiting for the operation to complete on a recovering node
  • there are no guarantees that the node ever recovers, and therefore your operation will be stuck in limbo, or outright fail without having ever been retried.

There are further issues with the way this retryability has been implemented, specifically that there can essentially be no ordering of operations once they are called and fall into this queue; you can queue up five writes, they can each select a different server, and one of them could potentially be stuck for quite some time (and maybe forever), while you think it went through. Very scary indeed.

We have been trying to tackle these issues holistically in core engineering (drivers + server) at MongoDB for the past year, and come up with some better solutions: retryable writes, and retryable reads. Retryable writes are already available today in the driver, and support versions 3.6+ of the server. Retryable reads are in the specification process and should be available in the next two quarters.

Now back to "what can we do today." I provided the example code above as an example of something that might end up being a general skeleton for retryable reads in the node driver. When connected to a replica set, each operation called on the driver first performs "server selection," allowing the user to specify things like a read preference. Once a server is selected the operation is queued/written on the connection pool associated with that internal representation of the server. In order to provide the kind of highly available retryability that our existing retryability has led users to believe they have, one would need to move the retryability up a level, and incorporate server selection. Essentially, once the operation fails with a retryable error, the topology should retry server selection and enqueue the operation on an available node for immediate execution.

Unfortunately, we cannot speed up our time table to implement retryable reads for y'all, but perhaps we can work together on a module that prototypes it if that suits your needs? I imagine the best place to inject this would be at the topology level (which is unfortunately a bit in flux at the moment), monkey-patching the primitives to provide operation reexection would be a good place to start.

Hope this helps, please let me know if I can clarify anything.

@boutell

This comment has been minimized.

Contributor

boutell commented Aug 13, 2018

@boutell

This comment has been minimized.

Contributor

boutell commented Aug 16, 2018

Hi @mbroadst,

Ludo visited us here in Philadelphia and provided some information toward a better understanding. We'd like to run our combined understanding past you.

  • Autoreconnect: no good with replica sets for all the reasons I already summarized. Fine here.
  • However, everything I was hoping about "the next request" working was completely wrong. (: Specifically, the driver will never start trying requests on another node on its own, even though it knows a list of nodes via the URI (old style) or via mongodb+srv DNS records (new style).
  • This means that we must, at some point, respond to errors relating to the network or a disabled node by forcing the driver to perform server selection. Simply retrying inserts/updates/reads will never do this (except possibly using "retryable writes" in the latest driver and mongodb version).
  • So we need to choose the right way of forcing server selection with existing drivers.
  • We could simply open a new connection with the original URI. This would work and would be upwardly compatible with mongodb+srv, however it may not be the most efficient option available.
  • Specifically, querying the topology and healing the connection in some way might be a bit more efficient. However, this layer is "in flux" as you earlier mentioned.
  • So the most backwards and forwards compatible approach is to try opening new connections in these scenarios, and then retrying the operations.
  • We can then impose our own time limits on how much we retry and so forth.

Is our understanding correct?

Thanks!

@mbroadst

This comment has been minimized.

mbroadst commented Aug 16, 2018

Hi @boutell, answers inline:

  • Autoreconnect: no good with replica sets for all the reasons I already summarized. Fine here.

Yep!

  • However, everything I was hoping about "the next request" working was completely wrong. (: > Specifically, the driver will never start trying requests on another node on its own, even though it knows a list of nodes via the URI (old style) or via mongodb+srv DNS records (new style).

This is incorrect. The driver does do server discovery and monitoring, and keeps an internal up-to-date list of servers in the topology. Each time you execute an operation, a server is selected from the internal topology description based on the provided read preference (or primary by default).

This is what I was referring to before when I provided the initial pseudocode for retryability:

  • call the operation (lets say collection.insert)
  • the operation selects the current primary
  • an error occurs, and collection.insert is called again (per above pseudocode)
  • the operation selects the new primary
  • This means that we must, at some point, respond to errors relating to the network or a disabled node by forcing the driver to perform server selection. Simply retrying inserts/updates/reads will never do this (except possibly using "retryable writes" in the latest driver and mongodb version).

This is incorrect, see above. I think the rest of your questions are operating on this assumption, so I'll stop here. Retryable writes actually implement something close to the pseudo-code I initially provided, but they also use a transaction number used server-side to ensure write-at-most-once semantics.

  • So we need to choose the right way of forcing server selection with existing drivers.
  • We could simply open a new connection with the original URI. This would work and would be
    upwardly compatible with mongodb+srv, however it may not be the most efficient option available.
  • Specifically, querying the topology and healing the connection in some way might be a bit more efficient. However, this layer is "in flux" as you earlier mentioned.
  • So the most backwards and forwards compatible approach is to try opening new connections in these scenarios, and then retrying the operations.
  • We can then impose our own time limits on how much we retry and so forth.
@boutell

This comment has been minimized.

Contributor

boutell commented Aug 17, 2018

@mbroadst

This comment has been minimized.

mbroadst commented Aug 17, 2018

Ah, that's a huge help, thank you. So some developers think the driver does
more than it does (the autoreconnect misunderstanding), and others think it
does less than it does (:

Indeed! Unfortunately, it's proven somewhat difficult to find time to properly document all of this in my short tenure here. My improved SDAM layer should also alleviate much of this, and will include design documentation - more on that later.

So a reasonable strategy would be:

  • Don't use autoreconnect.

IMHO, if your goal is resiliency then I would simply never use autoreconnect. There is a marginal case for its use with a single server connection, but only just so.

  • If an individual request fails, and the error code smells like it's
    network-y or broken-node-y, simply try that request again (up to some limit
    of our choosing).

Yep! But I want to make some things very clear: this implementation of retryability (all client side) is subject to errors for writes specifically. The design of retryable writes requires a server-side transaction number, which allows us to verify that a write was made to the oplog at most one time. If you implement retryability on the client side using the pseudo-code I provided above, you run the risk of having multiple writes reach the oplog.

We can do that.

Just to confirm, this means that as long as you're connected to a replica
set, the driver is capable of eventually getting "back on the air" even if
connection is completely lost to all of the nodes for a period of time? To
the point where the TCP connection(s) close(s)?

Yes. If you have a spare afternoon (ha ha), you might want to peruse our Server Discovery and Monitoring specification. The node driver presently implements most of this specification, and will maintain an active monitoring thread for each seed in a seedlist for the duration of your application. During this time, it will continuously update its internal knowledge of the topology, and use up-to-date information each time an operation is executed. This provides high availability, and because the isMaster responses from each node in the replicaset contains knowledge about new and removed members, the internal state of the driver keeps up even with added nodes not in the initial seed list.

That is, if we connect with this URI, the driver can figure it out and try
other nodes for as long as it has to, and eventually even notice the
original node is back:

mongodb://localhost:27018,localhost:27019,localhost:27020

(Or mongodb+srv, of course)

Unfortunately, mongos instances do not naturally monitor their topology so the plasticity described above does not apply to them. The initially provided seedlist, if they are mongos instances, will be the static list of monitored seeds in the driver for the duration of the application.

But with a single-server URI like this, we would have to use the
autoreconnect option, or else reconnect ourselves:

mongodb://localhost:27017

A little more background on the work that is presently going into the driver. The mongo driver right now has a sort of broken concept of "connecting" to a topology (e.g. MongoClient.connect). What's really going on when you provide your connection string is that the driver is parsing it, finding individual seeds, handshaking and starting monitors for each of them, building an internal view of the topology and creating a connection pool for each known node. When you execute an operation, it first does server selection based on your read preference (by default this is primary), selects that server, then requests one connection from the pool associated with the server and sends the request along. The new SDAM layer actually allows us to drop MongoClient.connect completely:

const client = new MongoClient('mongodb://localhost:27017');
client.db('foo').collection('bar').insertOne({ some: 'document' }, (err, res) => {
  // do something
});

And thus is a more pure form of what we were talking about above - there is no real concept of "connecting". You simply are always asking to execute some operation, against some preference of server, and its up the to driver to figure it out for you.

Finally, I mentioned retryable writes above, but in the next two quarters we will be specifying and implementing retryable reads. This will boil down to the same thing you are likely to be implementing soon for Apostrophe (if that's the path you choose), in that it will retry a read operation (only starting them, not a getMore) if some sort of "retryable error" occurs (network blip, not master, etc).

Hope that helps

@boutell

This comment has been minimized.

Contributor

boutell commented Aug 17, 2018

boutell added a commit that referenced this issue Sep 4, 2018

2.65.0:
* **Important fix for MongoDB replica sets:** previously we used the `autoReconnect` option of the MongoDB driver by default. From now on, we use it only if the MongoDB URI does not refer to a replica set. The use of `autoReconnect` is [inappropriate with a replica set](#1508) because it will keep trying to connect to the node that went down. Leaving this option out results in automatic use of nodes that are up. Also see the [apostrophe-db-mongo-3-driver](https://npmjs.org/package/apostrophe-db-mongo-3-driver) module for a way to use the newer `mongodb+srv` URIs. Thanks to Matt Broadstone of MongoDB for his advice.

* An `apostrophe-file` now has a default URL. The default `_url` property of an `apostrophe-file` piece is simply the URL of the file itself. This allows `apostrophe-file` to be included in your configuration for [apostrophe-permalinks](https://npmjs.org/package/apostrophe-permalinks); picking a PDF in this way generates a direct link to the PDF, which is what the user expects. Note that if the developer elects to set up an `apostrophe-files-pages` module that extends `apostrophe-pieces-pages`, that will still take precedence, so there is no bc break.

* Clicking directly from one rich text widget into another did not work properly; the toolbar did not appear in this situation. This bug has been fixed. The bug only occurred when clicking in a second rich text widget without any intervening clicks outside of all rich text widgets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment