New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression querying changes using _doc_ids filter #1737

Closed
garethbowen opened this Issue Nov 13, 2018 · 6 comments

Comments

Projects
None yet
4 participants
@garethbowen

garethbowen commented Nov 13, 2018

When upgrading from v1.7.1 to v2.2.0 I noticed our replication was taking longer. I investigated further and found the problem was specifically in relation to the initial request for changes. We use the _doc_ids filter so we can replicate only certain documents and this has been stable and performant on the 1.x versions.

Steps to Reproduce

Use this node script to create a database and fill it with 1 million docs and then query it for specific IDs.

Context

In my testing using the above script I got responses in 1 to 2ms on v1.7.1 and 2500 to 2600ms on 2.2.0.
In our production database with real world data and about 8 million docs it takes less than a second on v1.7.1 and around 40 seconds on v2.2.0.

This has affected real world performance for users trying to replicate their data.

Your Environment

What I've tried

  • changing clustering from n=3 q=8 to n=1 q=1 - no improvement
  • changing the changes_doc_ids_optimization_threshold at 1, 100, and very large - no improvement
  • using a very large seq_interval parameter - no improvement
  • using a mango selector - worse performance
@rnewson

This comment has been minimized.

Member

rnewson commented Nov 14, 2018

ok, I think I get this finally. Before 2.0, we had an optimization for _doc_ids and _design filter;

commit bfa0a8900163edd4f85c7bbf5b595de9885cfbf9
Author: Filipe David Borba Manana <fdmanana@apache.org>
Date:   Tue Sep 20 22:55:29 2011 +0000

    Efficient implementation of builtin filters

    Currently, the builtin changes filters "_doc_ids" and "_design"
    are not very efficient because they fold the entire seq btree
    and then filter the values by document ID.
    This implementation avoids that by doing direct lookups against
    the id btree, and then, for continuous changes requests, it
    just listens for database update events and does partial seq
    btree folds.

    COUCHDB-1288

The clustered code for _changes does not use it.

@rnewson

This comment has been minimized.

Member

rnewson commented Nov 14, 2018

we pass {default,main_only} down as the filter, so the optimization does not fire.

It's not yet clear if this can be restored given how the clustered version of changes feed has to work, but at least this is progress.

@rnewson

This comment has been minimized.

Member

rnewson commented Nov 14, 2018

so fabric_rpc:changes always calls couch_db:fold_changes .

@wohali

This comment has been minimized.

Member

wohali commented Nov 29, 2018

@garethbowen This is fixed by #1771 and will appear in CouchDB 2.3.0.

@wohali wohali closed this Nov 29, 2018

@wohali wohali added this to the 2.3.0 milestone Nov 29, 2018

@garethbowen

This comment has been minimized.

garethbowen commented Nov 29, 2018

Awesome. Thanks for the quick turnaround!

@garethbowen

This comment has been minimized.

garethbowen commented Dec 3, 2018

I ran the test script above against 2.3.0-RC1 and got responses in 4 to 14ms which is slightly slower than 1.7.1 (1 to 2ms) and massively faster than 2.2.0 (2500 to 2600ms). Great work @janl !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment