Skip to content
This repository was archived by the owner on May 25, 2021. It is now read-only.

Conversation

@dholth
Copy link

@dholth dholth commented Dec 3, 2014

This is a rebase for https://issues.apache.org/jira/browse/COUCHDB-2310 against post-1.6.x CouchDB.

The idea is that when it is finished every GET option should eventually be supported in the _bulk_get API, making CouchDB easier to learn than a hypothetical _bulk_get API supporting only a subtly different subset of the GET API.

Is it acceptable to refactor the GET handling?

Is anyone willing to help get this polished and accepted into CouchDB?

I wrote this patch because without it CouchDB replication to PouchDB is too slow to be useful. After this patch, formerly impractical replications that would take ~10,000 requests now take a pleasantly acceptable ~100 requests.

https://github.com/dholth/pouchdb-bulk-get is a PouchDB plugin for the other end.

Thanks.

@kxepal
Copy link
Member

kxepal commented Dec 3, 2014

You made it!

However, while for 1.x it will work fine, for 2.0 your patch is need in a bit more work since you'd add node level HTTP handler, not the cluster one which is served by apache/couchdb-chttpd module so feature in current implementation wouldn't be accessible for the rest world in common 2.0 cluster setup.

@jo
Copy link

jo commented Dec 3, 2014

Thanks for this work!

@nolanlawson
Copy link
Member

@dholth This is awesome.

So I take it this is the patch to PouchDB so that it doesn't have to be a plugin? Once this commit is merged into CouchDB, we'll be happy to accept a patch, although unfortunately we'll also need some kind of logic to check if the server supports bulk_get at all; probably a GET followed by checking for a 404 will be good enough.

@benoitc
Copy link
Contributor

benoitc commented Dec 18, 2014

Just to say that rcouch include a bulk_get api actully compatible with couchbase lite:

http://docs.rcouch.org/en/latest/api/database/bulk-api.html#post--db-_bulk_get

It is actually compatible with couchdb 1.6 and used in production. It has been ported to the new master of couchdb in the the coming beta of rcouch. It is planned on our side to propose a patch as well to couchdb master. Maybe we could find a way to have the same protocol used everywhere. Not sure actually that pouchdb is compatible with couchbase lite.

@nolanlawson
Copy link
Member

@benoitc This is news to me; I don't think anyone even mentioned it in COUCHDB-2310.

If there's an existing standard that CBLite and rcouch are using, then I'd definitely prefer to implement that one for PouchDB.

It is actually compatible with couchdb 1.6

Not seeing it in my own Couch 1.6.0, maybe I'm doing something wrong?

curl -X POST localhost:5984/test/_bulk_get?revs=true -H 'Content-type:application/json' -d '{docs: []}'

I get "Referer header required," which seems to be the same error if you try to post to a non-existing underscore-starting path (e.g. /test/_foobar).

@kxepal
Copy link
Member

kxepal commented Dec 18, 2014

@nolanlawson compatible with couchdb 1.6 is about rcouch codebase itself I believe. There is no bulk_get in any of CouchDB version, but we can make it in 1.7. For 2.0 there is need a bit more work for that.

@rnewson
Copy link
Member

rnewson commented Dec 18, 2014

any change to the http layer unfortunately has to be done twice to be acceptable for inclusion. once here, and again in https://github.com/apache/couchdb-chttpd. This change will only affect the node-local (and 127.0.0.1 by default) interface in CouchDB 2.0.

@nolanlawson
Copy link
Member

@rnewson OK, but do we all agree that @benoitc's spec is the one to go with? It seems like it meets all the requirements we discussed in COUCHDB-2310.

Unfortunately @dholth that would mean a bit of a rewrite, since your spec is similar but not exactly the same. What do you think?

@nolanlawson
Copy link
Member

@benoitc One worry: it looks like your spec only returns type multipart/related, even if the client only says Accept: application/json – is that right? I ask because PouchDB hasn't implemented multipart support yet.

@benoitc
Copy link
Contributor

benoitc commented Dec 18, 2014

@nolanlawson it also works for Accept: Application/json:
https://github.com/rcouch/couchdb-couch-httpd/blob/master/src/couch_httpd_bulk_get.erl#L42

I should document it...

@dholth
Copy link
Author

dholth commented Dec 18, 2014

To add _bulk_get support to PouchDB you'd want to copy my version of
replicate.js into PouchDB itself. It has a couple of extra require()
calls that PouchDB would not need. Otherwise it is nearly identical to
the standard replicate.js.

If anything fails we just assume it's not available and switch to using
GET for the rest of the replication:
https://github.com/dholth/pouchdb-bulk-get/blob/master/replicate.js#L308

The patch file is outdated but replicate.js is up-to-date as of today.

Having someone else write the Erlang part is fine by me. Here's the
couchbase _bulk_get which appears to have been recently documented.
Claims to be multipart/mime only.

http://developer.couchbase.com/mobile/develop/references/sync-gateway/rest-api/database/post-bulk-get/index.html

I did think my design of allowing all GET options in _bulk_get was
easier to understand/document and it would be nice to be able to pass
common parameters as GET parameters to the POST _bulk_get request.
CouchDB has enough inconsistencies in its API as it is.

My CouchDB 2.0 patch attempts to allow all the GET options in _bulk_get
but I have not really tested it.

On Thu, Dec 18, 2014, at 03:22 PM, Benoit Chesneau wrote:

@nolanlawson[1] it also works for Accept: Application/json:
https://github.com/rcouch/couchdb-couch-httpd/blob/master/src/couch_httpd_bulk_get.erl#L42

I should document it...

— Reply to this email directly or view it on GitHub[2].

Links:

  1. https://github.com/nolanlawson
  2. rebased _bulk_get patch #18 (comment)

@kxepal
Copy link
Member

kxepal commented Apr 21, 2015

Hi there, short update.

I'd started porting this PR over 2.0. First note that in current state it doesn't works and breaks any document update, so don't try to merge it (:

Second, while fixing it I found that initial format doesn't assumes any possible errors:

$ echo '{"docs": [{"id": "foo"}, {"id": "bar"}]}' | http http://localhost:15986/test/_bulk_get  
[
    [{"_id": "foo", "_rev": "1-967a00dff5e02add41819138abb3284d"}],
    [{"_id": "bar", "_rev": "1-6aaf7080aad9d1a9482e07c46581dac0"}]
]

What the response should be if document id is missed or wrong or document not exists or some parameters are malformed? Oblivious solution is:

$ echo '{"docs": [{"id": "foo"}, {"id": "bar"}]}' | http http://localhost:15986/test/_bulk_get  
[
    [{"_id": "foo", "_rev": "1-967a00dff5e02add41819138abb3284d"}],
    [{"error": "not_found", "id": "bar", "reason": "not today"}]
]

But such array of arrays makes me sad panda.

I'd started to look on current implementations and found no agreement on what the response should looks like:

  • Couchbase assumes the response to be always multipart and request payload should contain array of objects where "id" field is mandatory, "rev" and "atts_since" are optional and no mention of the others. So, passing open_revs will not work for Couchbase or make it incompatible with.
  • RCouch returns JSON response in the fashion {"results": [{"id": "foo", "docs": [...]}, ...]} what is completely different from proposed patch and pouchdb-bulk-get implementation.
  • Each of these are incompatible with pouchdb-bulk-get project which was proposed as "standard implementation"

Ok, now I'm really confused now. If I make /_bulk_get compatible with PouchDB that will means that it will be not compatible with the Couchbase/RCouch and be very annoying for people who'll try use it outside of replication context (it's stupid to not).

If I start fixing oblivious flaws of proposed implementation, then we'll have yet another incompatible with the others /_bulk_get implementation. Not cool.

At this moment I want to flip the table and make this feature from scratch, starting first from the actual proposal about the API of final implementation, finding agreement on this in CouchDB team and call PouchDB and Couchbase folks to the table (once I put it back) for the further implementation as a part of replication API. And make this short and quickly since @dholth made all the work, current devil is in the details and shape of things.

Otherwise we (CouchDB) will end with semi-compatible with *ouchDB world resource that's broken for everyday usage. I'm -1 on such turn of events.

Certainly, I may be wrong at all of this, so any ideas are welcome.

Nudge list: @nolanlawson @dholth @janl @benoitc

@dholth
Copy link
Author

dholth commented Apr 21, 2015

I may be the only pouchdb bulk get user. A top level array is a little
better for streaming. Does get return json for 404? That is what bulk
get should do.

On Tue, Apr 21, 2015, at 08:52 AM, Alexander Shorin wrote:

Hi there, short update.

I'd started porting this PR over 2.0. First note that in current
state it doesn't works and breaks any document update, so don't try
to merge it (:

Second, while fixing it I found that initial format doesn't assumes
any possible errors:

$ echo '{"docs": [{"id": "foo"}, {"id": "bar"}]}' | http
http://localhost:15986/test/_bulk_get
[ [{"_id": "foo", "_rev": "1-967a00dff5e02add41819138abb3284d"}],
[{"_id": "bar", "_rev": "1-6aaf7080aad9d1a9482e07c46581dac0"}] ]

What the response should be if document id is missed or wrong or
document not exists or some parameters are malformed? Oblivious
solution is:

$ echo '{"docs": [{"id": "foo"}, {"id": "bar"}]}' | http
http://localhost:15986/test/_bulk_get
[ [{"_id": "foo", "_rev": "1-967a00dff5e02add41819138abb3284d"}],
[{"error": "not_found", "id": "bar", "reason": "not today"}] ]

But such array of arrays makes me sad panda.

I'd started to look on current implementations and found no agreement
on what the response should looks like:

  • Couchbase[1] assumes the response to be always multipart and
    request payload should contain array of objects where "id" field is
    mandatory, "rev" and "atts_since" are optional and no mention of
    the others. So, passing open_revs will not work for Couchbase or
    make it incompatible with.
  • RCouch[2] returns JSON response in the fashion {"results": [{"id":
    "foo", "docs": [...]}, ...]} what is completely different from
    proposed patch and pouchdb-bulk-get implementation.
  • Each of these are incompatible with pouchdb-bulk-get project which
    was proposed as "standard implementation"

Ok, now I'm really confused now. If I make /_bulk_get compatible with
PouchDB that will means that it will be not compatible with the
Couchbase/RCouch and be very annoying for people who'll try use it
outside of replication context (it's stupid to not).

If I start fixing oblivious flaws of proposed implementation, then
we'll have yet another incompatible with the others /_bulk_get
implementation. Not cool.

At this moment I want to flip the table and make this feature from
scratch, starting first from the actual proposal about the API of
final implementation, finding agreement on this in CouchDB team and
call PouchDB and Couchbase folks to the table (once I put it back) for
the further implementation as a part of replication API. And make this
short and quickly since @dholth[3] made all the work, current devil is
in the details and shape of things.

Otherwise we (CouchDB) will end with semi-compatible with *ouchDB
world resource that's broken for everyday usage. I'm -1 on such turn
of events.

Certainly, I may be wrong at all of this, so any ideas are welcome.

Nudge list: @nolanlawson[4] @dholth[5] @janl[6] @benoitc[7]

— Reply to this email directly or view it on GitHub[8].

Links:

  1. http://developer.couchbase.com/mobile/develop/references/sync-gateway/rest-api/database/post-bulk-get/index.html
  2. https://github.com/rcouch/couchdb-couch-httpd/blob/master/src/couch_httpd_bulk_get.erl#L45
  3. https://github.com/dholth
  4. https://github.com/nolanlawson
  5. https://github.com/dholth
  6. https://github.com/janl
  7. https://github.com/benoitc
  8. rebased _bulk_get patch #18 (comment)

@dholth
Copy link
Author

dholth commented Apr 21, 2015

However assuming RCouch has users why not do their json. Double check to
see what arguments PouchDB _bulk_get actually uses.

On Tue, Apr 21, 2015, at 09:08 AM, Daniel Holth wrote:

I may be the only pouchdb bulk get user. A top level array is a little
better for streaming. Does get return json for 404? That is what bulk
get should do.

On Tue, Apr 21, 2015, at 08:52 AM, Alexander Shorin wrote:

Hi there, short update.

I'd started porting this PR over 2.0. First note that in current
state it doesn't works and breaks any document update, so don't try
to merge it (:

Second, while fixing it I found that initial format doesn't assumes
any possible errors:

$ echo '{"docs": [{"id": "foo"}, {"id": "bar"}]}' | http
http://localhost:15986/test/_bulk_get
[ [{"_id": "foo", "_rev": "1-967a00dff5e02add41819138abb3284d"}],
[{"_id": "bar", "_rev": "1-6aaf7080aad9d1a9482e07c46581dac0"}] ]

What the response should be if document id is missed or wrong or
document not exists or some parameters are malformed? Oblivious
solution is:

$ echo '{"docs": [{"id": "foo"}, {"id": "bar"}]}' | http
http://localhost:15986/test/_bulk_get
[ [{"_id": "foo", "_rev": "1-967a00dff5e02add41819138abb3284d"}],
[{"error": "not_found", "id": "bar", "reason": "not today"}] ]

But such array of arrays makes me sad panda.

I'd started to look on current implementations and found no agreement
on what the response should looks like:

  • Couchbase[1] assumes the response to be always multipart and
    request payload should contain array of objects where "id" field
    is mandatory, "rev" and "atts_since" are optional and no mention
    of the others. So, passing open_revs will not work for Couchbase
    or make it incompatible with.
  • RCouch[2] returns JSON response in the fashion {"results": [{"id":
    "foo", "docs": [...]}, ...]} what is completely different from
    proposed patch and pouchdb-bulk-get implementation.
  • Each of these are incompatible with pouchdb-bulk-get project which
    was proposed as "standard implementation"

Ok, now I'm really confused now. If I make /_bulk_get compatible with
PouchDB that will means that it will be not compatible with the
Couchbase/RCouch and be very annoying for people who'll try use it
outside of replication context (it's stupid to not).

If I start fixing oblivious flaws of proposed implementation, then
we'll have yet another incompatible with the others /_bulk_get
implementation. Not cool.

At this moment I want to flip the table and make this feature from
scratch, starting first from the actual proposal about the API of
final implementation, finding agreement on this in CouchDB team and
call PouchDB and Couchbase folks to the table (once I put it back)
for the further implementation as a part of replication API. And make
this short and quickly since @dholth[3] made all the work, current
devil is in the details and shape of things.

Otherwise we (CouchDB) will end with semi-compatible with *ouchDB
world resource that's broken for everyday usage. I'm -1 on such turn
of events.

Certainly, I may be wrong at all of this, so any ideas are welcome.

Nudge list: @nolanlawson[4] @dholth[5] @janl[6] @benoitc[7]

— Reply to this email directly or view it on GitHub[8].

Links:

  1. http://developer.couchbase.com/mobile/develop/references/sync-gateway/rest-api/database/post-bulk-get/index.html
  2. https://github.com/rcouch/couchdb-couch-httpd/blob/master/src/couch_httpd_bulk_get.erl#L45
  3. https://github.com/dholth
  4. https://github.com/nolanlawson
  5. https://github.com/dholth
  6. https://github.com/janl
  7. https://github.com/benoitc
  8. rebased _bulk_get patch #18 (comment)

@kxepal
Copy link
Member

kxepal commented Apr 21, 2015

I may be the only pouchdb bulk get user. A top level array is a little better for streaming.

Agree about streaming.

Does get return json for 404? That is what bulk get should do.

It cannot, because of streaming nature. Like /_bulk_docs, it sends response status 200 and headers before any payload processing (otherwise you'll loose streaming nature) what means that in case of any errors you cannot change response status to 404 / 500 or whatever. All you can is to send error as object in the response stream (as /_bulk_docs does). However, PouchDB-Get is not able to handle such case and will count these errors as docs.

@kxepal
Copy link
Member

kxepal commented Apr 21, 2015

However assuming RCouch has users why not do their json. Double check to see what arguments PouchDB _bulk_get actually uses.

Well, I'm fine to break compatibility, but I wonder for which side I have to do this.

@dholth
Copy link
Author

dholth commented Apr 21, 2015

The principle behind my _bulk_get is that it is exactly lots of GET
returned in an array, the first implementation was even written as a lua
intermediary between nginx and CouchDB.

This "exactly GET" property makes _bulk_get easier to document and
understand. All other implementations seemed to want to remove some of
the weird GET options that no one understands.

For a 404, the JSON response is:

{"error":"not_found","reason":"missing"}

So _bulk_get would return exactly that and maybe add an id field, or you
might have to remember the order of the original request.

[ [{"_id": "foo", "_rev": "1-967a00dff5e02add41819138abb3284d"}],
[{"error": "not_found", "_id": "bar", "reason": "missing"}] ]

I checked, and unfortunately my _bulk_get appears to fail entirely if an
element is not found.

I'm not sure about the array of arrays. Does GET sometimes return an
array, and sometimes it does not?

Thanks!

On Tue, Apr 21, 2015, at 09:22 AM, Alexander Shorin wrote:

However assuming RCouch has users why not do their json. Double check
to see what arguments PouchDB _bulk_get actually uses.

Well, I'm fine to break compatibility, but I wonder for which side I
have to do this.

— Reply to this email directly or view it on GitHub[1].

Links:

  1. rebased _bulk_get patch #18 (comment)

@kxepal
Copy link
Member

kxepal commented Apr 21, 2015

@dholth About the errors...how do you feel with https://github.com/couchbase/sync_gateway/wiki/Bulk-GET ?

Specifically:

If there's an error getting a document revision, most likely because it doesn't exist, its corresponding JSON body in the response will contain only the properties "id", "error", "reason" and "status", just as in a response from _all_docs.

Thought status isn't applicable for CouchDB nor PouchDB (afaik).

I'm not sure about the array of arrays. Does GET sometimes return an array, and sometimes it does not?

Yes, if you request document with open_revs query param and explicit application/json Accept header.

@nolanlawson
Copy link
Member

If I make /_bulk_get compatible with PouchDB

As @dholth said, his implementation is a PouchDB plugin. PouchDB does not depend on it.

A top level array is a little better for streaming.

The streaming stuff (pouchdb-replication-stream, pouchdb-load, socket-pouch, etc.) is a parallel effort . I consider it a long-term plan, whereas _bulk_get/_bulk_revs is the short-term fix.

Well, I'm fine to break compatibility,

The stance of PouchDB (and I think @daleharvey would agree) is that we will implement whatever CouchDB implements. CouchDB is our gold standard, not RCouch or Couchbase.

That being said, if Couchbase has a solution that handles errors (even if a bit inelegantly), then I think we should prefer their version instead of inventing a third version. RCouch has few users (AFAIK), but Couchbase Mobile is pretty popular for iOS/Android development. So we would benefit by tapping in to their ecosystem.

PouchDB's only point of contention with the Couchbase API is that we need a JSON response rather than a multipart/mixed response. Basically if the client says Accept: application/json, then the server has to return JSON.

@kxepal
Copy link
Member

kxepal commented Apr 21, 2015

@nolanlawson Thanks for the input. Ok, seems things getting simpler now, since I was confused different implementations. Good, I'll stay with Couchbase implementation then with RCouch improvements to support application/json response. Basically, that means to backport RCouch implementation (: And everyone should be happy then.

@nolanlawson
Copy link
Member

👍 😺

@kxepal
Copy link
Member

kxepal commented Oct 7, 2015

@dholth Hi! could you close your PR here please? This feature is available via chttpd.

@dholth
Copy link
Author

dholth commented Oct 7, 2015

Hooray for improved replication.

@dholth dholth closed this Oct 7, 2015
janl pushed a commit to janl/couchdb-couch that referenced this pull request Nov 28, 2015
If chttpd got a {timeout, Error} error then it would return
that tuple to the client. Since Error is some internal specifics
this isn't particularly useful information to return to the client.

This commit converts it to our usual timeout response.

This closes apache#18

COUCHDB-2425

Signed-off-by: Alexander Shorin <kxepal@apache.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants