Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow accents in file names #17

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

xsuchy
Copy link
Contributor

@xsuchy xsuchy commented Nov 24, 2012

Closes: #16

@basak
Copy link
Owner

basak commented Nov 25, 2012

Thanks for working on this!

There are a few points which I think stop me from merging this quite yet. My understanding of this feature is that we need to:

  1. Since we're in Python 2 (as boto currently only supports 2), incoming command line arguments are of type str, so if --transcode-names is specified then first we need to convert name arguments to unicode using .decode(sys.getdefaultencoding()).
  2. Then anything we send to Amazon, together with any operations on the cache need to happen on these unicode names encoded with .encode(args.transcode_names) first.
  3. It would make sense for the cache to remain in the same encoding as the vaults, since then the cache is always valid and never dependent on the --transcode-names= encoding used on a particular invocation.
  4. Output to the terminal that refers to archive names should be in the user's own locale encoding, which Python will do for us automatically if we print unicode strings as normal.
  5. The ref: and name: prefixes still need to work, both for input and output. This one gets a bit messy, since it means that the Cache class either needs to be aware of the encoding going on (since _archive_ref is defined in there, or we somehow need to pull that handling out of that class.

I am confused as to why you seem to have encode and decode reversed in this proposal. Is this because I've mixed up steps 1 and 2 here? I tried fixing this up as I expected this to work, and I came up with this: http://paste.ubuntu.com/1385888/; but I haven't done anything for item 5 yet. I will continue to work on this when I can.

this will encode name to utf8 before sending
closes: basak#16
which is more flexible and allow user to decide which codec will be used to encode characters with ord() > 125
@xsuchy
Copy link
Contributor Author

xsuchy commented Oct 23, 2013

rebased to current master

@xsuchy
Copy link
Contributor Author

xsuchy commented Oct 23, 2013

Addressing your comments (after long pause):

  1. you can not use sys.getdefaultencoding(), because that return 'ascii' and it will cause
>>> a='ýíáží'
>>> a.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
  1. yes, that is what I'm doing - in archive_upload()
if not isinstance(name, unicode) and self.args.transcode_names:
    name = name.decode(self.args.transcode_names)

which will convert name to unicode and boto itself will then correctly handle upload of unicode names (u'foo')

  1. Hmm, true..., but the fact is that if you use filenames with ord(128)+, you could not upload it at all without --transcode-name, so it will neither never appear in vault nor in cache.
  2. I recall I had some problem, but IIRC it was cause by those raise "foo %r" and once I replaced %r with %s then I thing we can leave out all encode and it will work as well. But I'm used to decode on input, encode on output - which is generally recommended as it can save you from potential problems.
  3. I'm not sure I follow here.

decode convert type(str) to unicode. I do that before upload and boto will handle upload of unicode correctly. But only if it is u'foo' and not 'foo'. If you can not flip it, but you can do boto work by decoding and then encoding it again, but it will store not so expected string because

>>> 'ýíáží'.decode('utf-8').encode('utf-8')
'\xc3\xbd\xc3\xad\xc3\xa1\xc5\xbe\xc3\xad'

@aspiers
Copy link

aspiers commented Apr 25, 2015

@basak Was there still something blocking merge of this?

@xsuchy
Copy link
Contributor Author

xsuchy commented Apr 16, 2016

I rebased the patch to master and add there some python3 compatibility code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Could not upload files with diacritics in name
3 participants