Impossible to access files with an accent in the name #1294

Closed
EtienneDepaulis opened this Issue Nov 20, 2012 · 15 comments

Projects

None yet

5 participants

@EtienneDepaulis

Hello,

We are using fog to manage files on our S3 server.
We have a file with an accent in it and which is the list if we do:

directory.files

But when we try access it with the same path name:

directory.files.get "uploads/courrier_debut_saison_comm_rég_KP.pdf"

fog returns nil.

Any idea ?

(.encoding on the string returns #Encoding:UTF-8)

@seanhandley
Member

What's the operating system of the host machine?

@EtienneDepaulis

OS X (local server)
It's a french computer BTW

@fcheung
Contributor
fcheung commented Nov 20, 2012

Do you know what S3 does as far as collation goes? It may be that you have to give it strings in a particular canonical form for stuff to work properly (and it's worth remembering that os x gives you an almost but not quite canonically decomposed forms for file names)

@EtienneDepaulis

@fcheung What do you mean by collation ? I'm providing a simple string :s

@fcheung
Contributor
fcheung commented Nov 20, 2012

Nothing's ever simple when unicode is involved. Collation means how are string compared & ordered, which is important in a unicode world because different languages can have different rules and because something like é can be stored in more than one way. I've no idea what S3 does in this respect. I would naively assume that you'd be ok if you're just using the string returned from an s3 list operation.

It's probably also worth checking that fog is doing the percent escaping properly - have a look with tcpdump or similar to see what the request actually looks like

@EtienneDepaulis

If I retrieve the file list from S3, then select the correct file and do .key then an .inspect on the string (a string with which I can retrieve the file without any problem), I have exactly the same result as an .inspect on my previous string :s

Any idea why ?

@fcheung
Contributor
fcheung commented Nov 20, 2012

Do the raw bytes differ (ie s.bytes.to_a) ?

@EtienneDepaulis

Nope :s

#source : directory.files
k1 = file.key
=> "uploads/direct/e3641dd12ba502374ea5a012162ccaf3/courrier_debut_saison_comm_reg_KPé.pdf" 

#source : amazon S3 URL stored in DB
uri = URI.parse(URI.encode(url))
k2 = URI.decode(uri.path[1..-1])
 => "uploads/direct/e3641dd12ba502374ea5a012162ccaf3/courrier_debut_saison_comm_reg_KPé.pdf" 

k1.bytes.to_a
=> [117, 112, 108, 111, 97, 100, 115, 47, 100, 105, 114, 101, 99, 116, 47, 101, 51, 54, 52, 49, 100, 100, 49, 50, 98, 97, 53, 48, 50, 51, 55, 52, 101, 97, 53, 97, 48, 49, 50, 49, 54, 50, 99, 99, 97, 102, 51, 47, 99, 111, 117, 114, 114, 105, 101, 114, 95, 100, 101, 98, 117, 116, 95, 115, 97, 105, 115, 111, 110, 95, 99, 111, 109, 109, 95, 114, 101, 103, 95, 75, 80, 195, 169, 46, 112, 100, 102] 

k2.bytes.to_a
=> [117, 112, 108, 111, 97, 100, 115, 47, 100, 105, 114, 101, 99, 116, 47, 101, 51, 54, 52, 49, 100, 100, 49, 50, 98, 97, 53, 48, 50, 51, 55, 52, 101, 97, 53, 97, 48, 49, 50, 49, 54, 50, 99, 99, 97, 102, 51, 47, 99, 111, 117, 114, 114, 105, 101, 114, 95, 100, 101, 98, 117, 116, 95, 115, 97, 105, 115, 111, 110, 95, 99, 111, 109, 109, 95, 114, 101, 103, 95, 75, 80, 101, 204, 129, 46, 112, 100, 102] 

Any idea on what to try next ? I'm starting to become mad :s

@fcheung
Contributor
fcheung commented Nov 20, 2012

Those byte sequences look different to me
First ends with

80, 195, 169, 46, 112, 100, 102]
Which is P, 0xc3, 0xc9 ( which is utf for é)
Second ends with
80, 101, 204, 129, 46, 112, 100, 102]

Which is P, e, 0xcc, 0x81 (which is utf for combining e acute accent)

So both byte sequences result in the same sequence of glyphs when printed on screen, but s3 appears to consider them as being different. Which one of those strings works?

Fred.

Sent from my iPhone

On 20 Nov 2012, at 17:36, Etienne Depaulis notifications@github.com wrote:

Nope :s

k1 = file.key
=> "uploads/direct/e3641dd12ba502374ea5a012162ccaf3/courrier_debut_saison_comm_reg_KPé.pdf"

uri = URI.parse(URI.encode(url))
k2 = URI.decode(uri.path[1..-1])
=> "uploads/direct/e3641dd12ba502374ea5a012162ccaf3/courrier_debut_saison_comm_reg_KPé.pdf"

k1.bytes.to_a
=> [117, 112, 108, 111, 97, 100, 115, 47, 100, 105, 114, 101, 99, 116, 47, 101, 51, 54, 52, 49, 100, 100, 49, 50, 98, 97, 53, 48, 50, 51, 55, 52, 101, 97, 53, 97, 48, 49, 50, 49, 54, 50, 99, 99, 97, 102, 51, 47, 99, 111, 117, 114, 114, 105, 101, 114, 95, 100, 101, 98, 117, 116, 95, 115, 97, 105, 115, 111, 110, 95, 99, 111, 109, 109, 95, 114, 101, 103, 95, 75, 80, 195, 169, 46, 112, 100, 102]

k2.bytes.to_a
=> [117, 112, 108, 111, 97, 100, 115, 47, 100, 105, 114, 101, 99, 116, 47, 101, 51, 54, 52, 49, 100, 100, 49, 50, 98, 97, 53, 48, 50, 51, 55, 52, 101, 97, 53, 97, 48, 49, 50, 49, 54, 50, 99, 99, 97, 102, 51, 47, 99, 111, 117, 114, 114, 105, 101, 114, 95, 100, 101, 98, 117, 116, 95, 115, 97, 105, 115, 111, 110, 95, 99, 111, 109, 109, 95, 114, 101, 103, 95, 75, 80, 101, 204, 129, 46, 112, 100, 102]
Any idea on what to try next ? I'm starting to become mad :s


Reply to this email directly or view it on GitHub.

@EtienneDepaulis

Sorry @fcheung, I meant "Nope, they differ" :s

Both strings prints the same on the screen but only the first one (the one retrieved from directory.files) returns the correct file.

Thanks for your help so far

@maximeg
maximeg commented Nov 20, 2012

Hello Etienne, just saw it on Twitter.

You have two kind of é

[195, 169] => two byte UTF8 "é" => NFC
[101, 204, 129] => "e" + combining acute accent (U+0301) => NFD

Which uploader are you using (paperclip, carrierwave, dragonfly, a homemade) ?
And what ORM/DB ?
It seems like an issue when storing/retrieving the filename.

@EtienneDepaulis

Hey @maximeg ;)

I'm using JQuery File Upload without any specific uploader (very large files) which POST a S3 URL to a very simple controller that stores the value in a postgres database.

Is it not possible to convert a NFD string to NFC format ?

@maximeg
maximeg commented Nov 20, 2012

I think I get it... In Mac OS, filenames are in NFD, so when your jquery thing send back the filename of the just uploaded file, it will send NFD... and maybe postgresql don't care. (filename in Web, Linux are NFC, and S3 do the math)

See http://unicode-utils.rubyforge.org/UnicodeUtils.html

@EtienneDepaulis

Long day ;)

Here is the final solution:

In the Gemfile:

gem 'unf'

=> https://github.com/knu/ruby-unf

Then a simple .to_nfc on the path_name !
It's finally working. Thanks all for your help 😃

@geemus
Member
geemus commented Nov 26, 2012

Thanks for working through this!

@EtienneDepaulis - what a tricky one... Thanks for sharing the solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment