Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not upload files with diacritics in name #16

Open
xsuchy opened this issue Nov 22, 2012 · 14 comments · May be fixed by #17
Open

Could not upload files with diacritics in name #16

xsuchy opened this issue Nov 22, 2012 · 14 comments · May be fixed by #17

Comments

@xsuchy
Copy link
Contributor

xsuchy commented Nov 22, 2012

I have filename "./2009/Agátka ve školce/PC090374.JPG"
and I'm trying to upload it using:
`glacier archive upload --name "./2009/Agátka ve školce/PC090374.JPG" Photos "./2009/Agátka ve školce/PC090374.JPG"``

I end up with traceback:

Traceback (most recent call last):
  File "/usr/local/bin/glacier", line 618, in <module>
    App().main()
  File "/usr/local/bin/glacier", line 604, in main
    args.func(args)
  File "/usr/local/bin/glacier", line 416, in archive_upload
    archive_id = vault.create_archive_from_file(file_obj=args.file, description=name)
  File "/home/mirek/glacier-cli/boto/glacier/vault.py", line 163, in create_archive_from_file
    part_size=part_size)
  File "/home/mirek/glacier-cli/boto/glacier/vault.py", line 126, in create_archive_writer
    description)
  File "/home/mirek/glacier-cli/boto/glacier/layer1.py", line 479, in initiate_multipart_upload
    response_headers=response_headers)
  File "/home/mirek/glacier-cli/boto/glacier/layer1.py", line 83, in make_request
    raise UnexpectedHTTPResponseError(ok_responses, response)
boto.glacier.exceptions.UnexpectedHTTPResponseError

Not sure if this is problem of boto or glacier-cli.

Will investigate later.

@basak
Copy link
Owner

basak commented Nov 22, 2012

Amazon Glacier does not permit anything non-ASCII in the name. Details are here: http://docs.amazonwebservices.com/amazonglacier/latest/dev/api-archive-post.html

"The description must be less than or equal to 1,024 characters. The allowable characters are 7-bit ASCII without control codes, specifically ASCII values 32—126 decimal or 0x20—0x7E hexadecimal."

The error message you get from glacier-cli is not helpful though, and I will leave this issue open to fix that.

@xsuchy
Copy link
Contributor Author

xsuchy commented Nov 22, 2012

Boto can handle it by passing in decoded UTF-8. So we just have to pass it to boto as unicode and not as ascii. I tested this patch:

diff --git a/glacier.py b/glacier.py
index 784736a..b18d072 100755
--- a/glacier.py
+++ b/glacier.py
@@ -395,6 +395,7 @@ class App(object):
     def archive_list(self, args):
         archive_list = list(self.cache.get_archive_list(args.vault))
         if archive_list:
+            # FIXME problem here
             print(*archive_list, sep="\n")

     def archive_upload(self, args):
@@ -412,6 +413,8 @@ class App(object):
                 raise RuntimeError('Archive name not specified. Use --name')
             name = os.path.basename(full_name)

+        if not isinstance(name, unicode):
+            name = name.decode('utf-8')
         vault = self.connection.get_vault(args.vault)
         archive_id = vault.create_archive_from_file(file_obj=args.file, description=name)
         self.cache.add_archive(args.vault, name, archive_id)

The second part make uploading work. But archive list will then fail.
In point of FIXME is in my case the value of archive_list:
[u'./somefile', u'./2009/Ag\xe1tka ve \u0161kolce/IMG_3876.jpg']
first item is printed, but when it iterate to second item, it will traceback with:

Traceback (most recent call last):
  File "/usr/local/bin/glacier", line 621, in <module>
    App().main()
  File "/usr/local/bin/glacier", line 607, in main
    args.func(args)
  File "/usr/local/bin/glacier", line 399, in archive_list
    print(*archive_list, sep="\n")

Which is weird to me, because when I'm trying to reproduce it in python console it works:

$ python
Python 2.7.3 (default, Aug  9 2012, 17:23:57) 
[GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import print_function
>>> a=[u'./2009/Ag\xe1tka ve \u0161kolce/IMG_3876.jpg']
>>> a
[u'./2009/Ag\xe1tka ve \u0161kolce/IMG_3876.jpg']
>>> print(*a, sep="\n")
./2009/Agátka ve školce/IMG_3876.jpg
>>> import iso8601
>>> print(*a, sep="\n")
./2009/Agátka ve školce/IMG_3876.jpg
>>> from __future__ import unicode_literals
>>> print(*a, sep="\n")
./2009/Agátka ve školce/IMG_3876.jpg

@basak
Copy link
Owner

basak commented Nov 22, 2012

I don't understand. If Amazon will only take ASCII in the range 32-126, how do you expect glacier-cli or boto to encode it for sending to Amazon? If, after a disaster, you use a different tool for recovery, how will that tool know how to decode your encoded archive names?

@xsuchy
Copy link
Contributor Author

xsuchy commented Nov 22, 2012

Lets have character 'á'
http://www.fileformat.info/info/unicode/char/e1/index.htm
If you pass it as is, then boto pass is as U+00E1, which is outside of randge.
But if you pass it as u'á' then it will be encoded and passed as '\xc3\xa1' (8 character long string), which is within range.
This tranformation is made only for character out of range, characters within 0-128 is left intact. This leaves open small corner case, but I assume no one put \n or \r in name of file name :)

@basak
Copy link
Owner

basak commented Nov 22, 2012

I don't follow. "\xc3" is 195 decimal, which is greater than the Amazon 126 limit, no?

@basak
Copy link
Owner

basak commented Nov 22, 2012

I think I've just understood what you are trying to do and now get what you mean by "8 character long string".

The problem is though that this overloads the backslash character. If Amazon Glacier gives glacier-cli an archive of description '\xc3\xa1' (8 byte long literal), then how does glacier-cli know whether to create a filename of exactly 8 ASCII bytes ['', 'c', '3', ...] or a filename of exactly 1 UTF-8 'á'?

Fundamentally, glacier-cli is a front end for Amazon Glacier, and Glacier doesn't support Unicode so neither can glacier-cli without introducing ambiguities in decoding which harms interoperability with other tools. So I regret that glacier-cli will never be able to support Unicode archive names by default.

If you want to add functionality so that the user can specify some kind of mapping as a command line option (that won't be default), then I'd be happy to accept that. It would need to either be some accepted standard method or be done in a pluggable way to support multiple mappings, and needs to be free of conversion ambiguities.

Alternatively, a wrapper to glacier-cli might be able to do this, or users could use git-annex which keeps filename metadata in the annex instead of in the special remote.

@xsuchy
Copy link
Contributor Author

xsuchy commented Nov 22, 2012

I do not suppose no one name filenames in utf8 encoded format, but ok. I think having this as option, which is by default off is fine as well. What about --allow-utf8 ?

I find the problem with archive list so I may able to provide patch soon.

@basak
Copy link
Owner

basak commented Nov 22, 2012

But what encoding would --allow-utf8 use?

I'm just looking at http://docs.python.org/2/library/codecs.html#standard-encodings. Why don't we pick one of these? A suitable one would be a coding that converts from Unicode to something that Amazon Glacier can accept (ie. fits into the range 32-126). How about quopri-codec? Quoted-printable is a fairly standard way of embedding Unicode data into a 7-bit stream, right?

I'd prefer --convert-utf8 to make it clear that what goes into Glacier is being modified in some way.

So then glacier-cli could do a simple name_to_send = local_unicode_name.encode('quopri-codec') on the way in, and local_unicode_name = name_received.decode('quopri-codec') on the way back, if (and only if) --convert-utf8 was specified. I'm having some trouble with the details of this, but I hope you get the gist.

How does this sound?

@xsuchy
Copy link
Contributor Author

xsuchy commented Nov 22, 2012

On 11/22/2012 02:14 PM, basak wrote:

I'm just looking at http://docs.python.org/2/library/codecs.html#standard-encodings. Why don't we pick one of these?

But UTF8 is one of these - unicode_escape :)

fits into the range 32-126). How about quopri-codec? Quoted-printable is
a fairly standard way of embedding Unicode data into a 7-bit stream, right?

I do not agree. That is standard in email world. But to my experience
UTF is standard everywhere else.

I'd prefer --convert-utf8 to make it clear that what goes into Glacier
is being modified in some way.

OK, --convert-utf8 then.

Mirek

@basak
Copy link
Owner

basak commented Nov 23, 2012

But UTF-8 is not unicode_escape! We cannot use UTF-8 since Amazon is not 8-bit clean for Glacier archive descriptions. And if we use unicode_escape, then we're limiting our interoperability only to other Python tools.

Is there a common encoding that is 7-bit friendly that is generally accepted and not Python-specific? Apart from quoted-printable, I only see base64 and hex.

@xsuchy
Copy link
Contributor Author

xsuchy commented Nov 23, 2012

On 23.11.2012 09:33, basak wrote:

Is there a common encoding that is 7-bit friendly that is generally
accepted and not Python-specific? Apart from quoted-printable, I only
see base64 and hex.

Hmm, I think everybody will have different opinion.
So what about

--convert-name=CODE
where CODE is anything from
http://docs.python.org/2/library/codecs.html#standard-encodings
And admin itself can decide which one he will use.

I have the patch already ready and checking that, I see that such change
is possible and will be in fact trivial. I will have to test it again thou.

@basak
Copy link
Owner

basak commented Nov 23, 2012

That sounds absolutely fine. How about --transcode-names=... for the name of the option? That would be a even more specific about what it actually does (now that we know!), and in some cases more than one name is being converted.

@xsuchy
Copy link
Contributor Author

xsuchy commented Nov 23, 2012

--transcode-names=
I agree with you.

I will send pull request on Monday.

xsuchy added a commit to xsuchy/glacier-cli-1 that referenced this issue Nov 24, 2012
this will encode name to utf8 before sending
closes: basak#16
@xsuchy xsuchy linked a pull request Nov 24, 2012 that will close this issue
@nomeata
Copy link

nomeata commented Oct 17, 2013

This is just byting me. What is the status of this request?

xsuchy added a commit to xsuchy/glacier-cli-1 that referenced this issue Oct 23, 2013
this will encode name to utf8 before sending
closes: basak#16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants