Allow accents in file names #17

xsuchy · 2012-11-24T07:23:03Z

Closes: #16

basak · 2012-11-25T18:18:05Z

Thanks for working on this!

There are a few points which I think stop me from merging this quite yet. My understanding of this feature is that we need to:

Since we're in Python 2 (as boto currently only supports 2), incoming command line arguments are of type str, so if --transcode-names is specified then first we need to convert name arguments to unicode using .decode(sys.getdefaultencoding()).
Then anything we send to Amazon, together with any operations on the cache need to happen on these unicode names encoded with .encode(args.transcode_names) first.
It would make sense for the cache to remain in the same encoding as the vaults, since then the cache is always valid and never dependent on the --transcode-names= encoding used on a particular invocation.
Output to the terminal that refers to archive names should be in the user's own locale encoding, which Python will do for us automatically if we print unicode strings as normal.
The ref: and name: prefixes still need to work, both for input and output. This one gets a bit messy, since it means that the Cache class either needs to be aware of the encoding going on (since _archive_ref is defined in there, or we somehow need to pull that handling out of that class.

I am confused as to why you seem to have encode and decode reversed in this proposal. Is this because I've mixed up steps 1 and 2 here? I tried fixing this up as I expected this to work, and I came up with this: http://paste.ubuntu.com/1385888/; but I haven't done anything for item 5 yet. I will continue to work on this when I can.

this will encode name to utf8 before sending closes: basak#16

which is more flexible and allow user to decide which codec will be used to encode characters with ord() > 125

xsuchy · 2013-10-23T05:38:04Z

rebased to current master

xsuchy · 2013-10-23T18:32:08Z

Addressing your comments (after long pause):

you can not use sys.getdefaultencoding(), because that return 'ascii' and it will cause

>>> a='ýíáží'
>>> a.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

yes, that is what I'm doing - in archive_upload()

if not isinstance(name, unicode) and self.args.transcode_names:
    name = name.decode(self.args.transcode_names)

which will convert name to unicode and boto itself will then correctly handle upload of unicode names (u'foo')

Hmm, true..., but the fact is that if you use filenames with ord(128)+, you could not upload it at all without --transcode-name, so it will neither never appear in vault nor in cache.
I recall I had some problem, but IIRC it was cause by those raise "foo %r" and once I replaced %r with %s then I thing we can leave out all encode and it will work as well. But I'm used to decode on input, encode on output - which is generally recommended as it can save you from potential problems.
I'm not sure I follow here.

decode convert type(str) to unicode. I do that before upload and boto will handle upload of unicode correctly. But only if it is u'foo' and not 'foo'. If you can not flip it, but you can do boto work by decoding and then encoding it again, but it will store not so expected string because

>>> 'ýíáží'.decode('utf-8').encode('utf-8')
'\xc3\xbd\xc3\xad\xc3\xa1\xc5\xbe\xc3\xad'

aspiers · 2015-04-25T11:08:02Z

@basak Was there still something blocking merge of this?

xsuchy · 2016-04-16T21:27:57Z

I rebased the patch to master and add there some python3 compatibility code.

xsuchy added 2 commits October 23, 2013 07:33

allow accents in names of files

b69e18f

this will encode name to utf8 before sending closes: basak#16

change --convert-utf8 to --transcode-names=CODEC

40ba355

which is more flexible and allow user to decide which codec will be used to encode characters with ord() > 125

xsuchy mentioned this pull request Apr 24, 2015

UnicodeDecodeError on upload #52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow accents in file names #17

Allow accents in file names #17

xsuchy commented Nov 24, 2012

basak commented Nov 25, 2012

xsuchy commented Oct 23, 2013

xsuchy commented Oct 23, 2013

aspiers commented Apr 25, 2015

xsuchy commented Apr 16, 2016

Allow accents in file names #17

Are you sure you want to change the base?

Allow accents in file names #17

Conversation

xsuchy commented Nov 24, 2012

basak commented Nov 25, 2012

xsuchy commented Oct 23, 2013

xsuchy commented Oct 23, 2013

aspiers commented Apr 25, 2015

xsuchy commented Apr 16, 2016