Skip to content

Treat image attachments as binary blobs;tools/dump fetch attachments #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

wbtuomela
Copy link

The _all_docs view only returns attachment stubs and doesn't accept the attachments=true attribute so I changed the dump tool to fetch full documents in the main loop. Also added guard check to mulitpart writer to not attempt to decode images and not bloat them with a utf-8 encoding. There is probably a better way to detect binary mimetypes, if pointed in the right direction I can improve that check.

dump, since the view wont include the data field even with
attachments=true&include_docs=true
elif 'charset' not in params:
# exclude images from being treated as a string
# XXX: is there a better way to do this??
elif 'charset' not in params and 'image/' not in ctype:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is need to make reverse check to encode any text mimes text/* plus few common ones like application/json, application/xml counting everyone else as binaries.

@djc
Copy link
Owner

djc commented Oct 22, 2014

@kxepal did you see my earlier comment about the mimetype check? I think your approach would be an improvement over the current code, but I'm a little bit worried about have a hard-coded list of exceptional mimetypes.

@kxepal
Copy link
Collaborator

kxepal commented Oct 22, 2014

@djc oh, sorry - I didn't received any notification about, just this one.

If the goal to keep the requests down then the _all_docs view needs to be replaced with a temporary view with attachments=true.

No need for: _all_docs are also accepts attachments=true parameter, but this would limit supported version up to 1.5 release IIRC. Also, this will cause significant overhead since you'll have to decode attachments from base64 encoding. Using old good way about fetching doc with attachments as multipart may be slow, but more tolerate for system resources.

I think your approach would be an improvement over the current code, but I'm a little bit worried about have a hard-coded list of exceptional mimetypes.

They aren't so much to worry about. However, more wise solution may be always to leave attachments as is in binary form. Do we have any reasons to work with some of them as with text?

@djc
Copy link
Owner

djc commented Oct 22, 2014

Ah yeah, so @traxxas, are you on something pre-1.5? Are you in a position to upgrade to 1.5?

As for relying on attachments=true versus retrieving the attachments per document, the relative efficiency would very much depend on how many attachments there are in the common case as well as how large they are. If there are many (smaller) attachments, retrieving them all at once, then decoding should be more efficient. If there are just a few, then retrieving the attachments separately is probably faster. However, we don't know this in advance. I guess in this case maybe doing separate retrieval makes sense, since it also alleviates the compatibility concerns with pre-1.5 CouchDB; and it feels like attachments are probably relatively rare even in databases that use them.

It may make sense to leave attachments in binary data, but of course the MultipartWriter doesn't have a clear notion of whether a content block represents an attachment or not. Actually, I'm not sure it even makes sense to add a ;charset param to the mimetype if content is not a util.utype (some of that code got added during the Python 3 port where I may have had limited context). Something like the below does run into test failures, but if we don't actually have a charset in the provided mimetype, perhaps we shouldn't be guessing about it. (We should probably add a test which does set the charset in the provided mimetype, to make sure that works correctly.)

diff --git a/couchdb/multipart.py b/couchdb/multipart.py
index 9d4fc78..392db10 100644
--- a/couchdb/multipart.py
+++ b/couchdb/multipart.py
@@ -150,11 +150,6 @@ class MultipartWriter(object):
             else:
                 content = content.encode('utf-8')
                 mimetype = mimetype + ';charset=utf-8'
-        elif 'charset' not in params:
-            try:
-                content.decode('utf-8')
-            finally:
-                mimetype = mimetype + ';charset=utf-8'

         headers['Content-Type'] = mimetype
         if content:

@djc
Copy link
Owner

djc commented Oct 27, 2014

@traxxas any progress?

djc added a commit that referenced this pull request Nov 1, 2014
@djc djc closed this in 4473b09 Nov 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants