Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change MongoDB Connection's interface to implement a .run() #975

Closed
sohkai opened this issue Dec 20, 2016 · 8 comments
Closed

Change MongoDB Connection's interface to implement a .run() #975

sohkai opened this issue Dec 20, 2016 · 8 comments
Assignees

Comments

@sohkai
Copy link
Contributor

sohkai commented Dec 20, 2016

From #890 (review).

pymongo's interface is very different than rethinkdb's with queries being run directly through the mongodb connection object rather than a .run() method. We should discuss if Connections should still be expected to have a .run() method.

@sohkai sohkai added this to the MongoDB integration milestone Dec 20, 2016
@r-marques
Copy link
Contributor

I like what we do with rethinkdb because we have a single place to handle exceptions and deal with connection errors.

I have been looking at the pymongo documentation and I think it would be possible to do the same thing we do with rethinkdb.

pymongo operations allows us to basically construct the queries without executing them and then we can use bulk_write to execute them on the database

@sohkai
Copy link
Contributor Author

sohkai commented Dec 20, 2016

From #977, when this is done, we should also port some of the RethinkDB connection tests to MongoDB's connection.

@sohkai sohkai changed the title Revisit Connection.run() Change MongoDB Connection's interface to implement a .run() Dec 20, 2016
@r-marques
Copy link
Contributor

I moved this issue to a different milestone.
At this point I don't know exactly what needs to be done. pymongo handles database connections much differently from rethinkdb driver.

pymongo tries to reconnect by itself so we don't need to handle that like with rethinkdb.
Another big difference is that with rethinkdb you would connect to a specific node in a cluster. With mongodb this is different. You connect to the cluster itself and the driver takes care of connecting to a new node when the current node is down, etc...

This raises a lot of new questions that should be handled when you actually perform tests on a cluster.

@r-marques
Copy link
Contributor

A note that we got from MongoDB regarding our connection code:

Retry loop not needed, as the driver will either block until a connection is available or timeout

From pymongo documentation:

We get an AutoReconnect exception. This means that the driver was not able to connect to the old primary (which makes sense, as we killed the server), but that it will attempt to automatically reconnect on subsequent operations. When this exception is raised our application code needs to decide whether to retry the operation or to simply continue, accepting the fact that the operation might have failed.

On subsequent attempts to run the query we might continue to see this exception. Eventually, however, the replica set will failover and elect a new primary (this should take no more than a couple of seconds in general). At that point the driver will connect to the new primary and the operation will succeed

So the driver automatically reconnects if the connection goes down or the primary changes but it still raises an AutoReconnect exception if we run the query while it is still reconnecting.

accepting the fact that the operation might have failed

This means that the operation may or may have not gone through so for inserts its important that we catch a DuplicateKeyError when retrying.

@vrde
Copy link
Contributor

vrde commented Jan 23, 2017

pymongo operations allows us to basically construct the queries without executing them and then we can use bulk_write to execute them on the database

That's pretty interesting, but unfortunately there is no Find, FindOne, or Aggregate operation 😿.

@vrde
Copy link
Contributor

vrde commented Jan 25, 2017

I've explored the bulk API, and it's too limited to support our use case. The bulk API supports, as a first operator, find and insert, while we use count, find_one, aggregate.

@vrde
Copy link
Contributor

vrde commented Jan 27, 2017

Here we go! Two PRs for this issue. I want to give more context about the two solutions.

#1100 implements a new object called Lazy, that can be used to describe a behavior, and execute it later. The basic idea was to emulate the RethinkDB interface to querying the DB. Lazy objects allow composition, so if you have complex queries you can create them in different lines.

#1109 approach is simpler, and it just uses lambda functions to achieve a similar behavior.

#1100 is one order of magnitude slower than #1109, but it's not a bummer IMO:

In [10]: def func1(): return Lazy().split(',')[1].split(' ').pop(1).strip()

In [11]: def func2(): return lambda x: x.split(',')[1].split(' ').pop(1).strip()

In [12]: %timeit func1()
The slowest run took 7.23 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.73 µs per loop

In [13]: %timeit func2()
The slowest run took 27.28 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 138 ns per loop

#1100 uses less code (and this is a plus), #1109 is more ergonomic. Ideas?

@sohkai
Copy link
Contributor Author

sohkai commented Feb 10, 2017

Closing as #1100 implements this.

@sohkai sohkai closed this as completed Feb 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants