Treat image attachments as binary blobs;tools/dump fetch attachments #246

wbtuomela · 2014-10-22T00:30:57Z

The _all_docs view only returns attachment stubs and doesn't accept the attachments=true attribute so I changed the dump tool to fetch full documents in the main loop. Also added guard check to mulitpart writer to not attempt to decode images and not bloat them with a utf-8 encoding. There is probably a better way to detect binary mimetypes, if pointed in the right direction I can improve that check.

dump, since the view wont include the data field even with attachments=true&include_docs=true

kxepal · 2014-10-22T08:42:52Z

couchdb/multipart.py

-        elif 'charset' not in params:
+        # exclude images from being treated as a string
+        # XXX: is there a better way to do this??
+        elif 'charset' not in params and 'image/' not in ctype:


I think there is need to make reverse check to encode any text mimes text/* plus few common ones like application/json, application/xml counting everyone else as binaries.

djc · 2014-10-22T08:46:29Z

@kxepal did you see my earlier comment about the mimetype check? I think your approach would be an improvement over the current code, but I'm a little bit worried about have a hard-coded list of exceptional mimetypes.

kxepal · 2014-10-22T08:57:44Z

@djc oh, sorry - I didn't received any notification about, just this one.

If the goal to keep the requests down then the _all_docs view needs to be replaced with a temporary view with attachments=true.

No need for: _all_docs are also accepts attachments=true parameter, but this would limit supported version up to 1.5 release IIRC. Also, this will cause significant overhead since you'll have to decode attachments from base64 encoding. Using old good way about fetching doc with attachments as multipart may be slow, but more tolerate for system resources.

I think your approach would be an improvement over the current code, but I'm a little bit worried about have a hard-coded list of exceptional mimetypes.

They aren't so much to worry about. However, more wise solution may be always to leave attachments as is in binary form. Do we have any reasons to work with some of them as with text?

djc · 2014-10-22T12:35:14Z

Ah yeah, so @traxxas, are you on something pre-1.5? Are you in a position to upgrade to 1.5?

As for relying on attachments=true versus retrieving the attachments per document, the relative efficiency would very much depend on how many attachments there are in the common case as well as how large they are. If there are many (smaller) attachments, retrieving them all at once, then decoding should be more efficient. If there are just a few, then retrieving the attachments separately is probably faster. However, we don't know this in advance. I guess in this case maybe doing separate retrieval makes sense, since it also alleviates the compatibility concerns with pre-1.5 CouchDB; and it feels like attachments are probably relatively rare even in databases that use them.

It may make sense to leave attachments in binary data, but of course the MultipartWriter doesn't have a clear notion of whether a content block represents an attachment or not. Actually, I'm not sure it even makes sense to add a ;charset param to the mimetype if content is not a util.utype (some of that code got added during the Python 3 port where I may have had limited context). Something like the below does run into test failures, but if we don't actually have a charset in the provided mimetype, perhaps we shouldn't be guessing about it. (We should probably add a test which does set the charset in the provided mimetype, to make sure that works correctly.)

diff --git a/couchdb/multipart.py b/couchdb/multipart.py
index 9d4fc78..392db10 100644
--- a/couchdb/multipart.py
+++ b/couchdb/multipart.py
@@ -150,11 +150,6 @@ class MultipartWriter(object):
             else:
                 content = content.encode('utf-8')
                 mimetype = mimetype + ';charset=utf-8'
-        elif 'charset' not in params:
-            try:
-                content.decode('utf-8')
-            finally:
-                mimetype = mimetype + ';charset=utf-8'

         headers['Content-Type'] = mimetype
         if content:

djc · 2014-10-27T13:41:54Z

@traxxas any progress?

Treat image attachments as binary blobs. Fetch docs with attachments in

3108d02

dump, since the view wont include the data field even with attachments=true&include_docs=true

kxepal reviewed Oct 22, 2014
View reviewed changes

djc added a commit that referenced this pull request Nov 1, 2014

Don't try to decode binary data (see #246)

3dd954d

djc closed this in 4473b09 Nov 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Treat image attachments as binary blobs;tools/dump fetch attachments #246

Treat image attachments as binary blobs;tools/dump fetch attachments #246

Uh oh!

wbtuomela commented Oct 22, 2014

Uh oh!

kxepal Oct 22, 2014

Uh oh!

djc commented Oct 22, 2014

Uh oh!

kxepal commented Oct 22, 2014

Uh oh!

djc commented Oct 22, 2014

Uh oh!

djc commented Oct 27, 2014

Uh oh!

Uh oh!

Treat image attachments as binary blobs;tools/dump fetch attachments #246

Treat image attachments as binary blobs;tools/dump fetch attachments #246

Uh oh!

Conversation

wbtuomela commented Oct 22, 2014

Uh oh!

kxepal Oct 22, 2014

Choose a reason for hiding this comment

Uh oh!

djc commented Oct 22, 2014

Uh oh!

kxepal commented Oct 22, 2014

Uh oh!

djc commented Oct 22, 2014

Uh oh!

djc commented Oct 27, 2014

Uh oh!

Uh oh!