Impossible to access files with an accent in the name #1294

EtienneDepaulis · 2012-11-20T10:56:48Z

Hello,

We are using fog to manage files on our S3 server.
We have a file with an accent in it and which is the list if we do:

directory.files

But when we try access it with the same path name:

directory.files.get "uploads/courrier_debut_saison_comm_rég_KP.pdf"

fog returns nil.

Any idea ?

(.encoding on the string returns #Encoding:UTF-8)

The text was updated successfully, but these errors were encountered:

seanhandley · 2012-11-20T11:05:10Z

What's the operating system of the host machine?

EtienneDepaulis · 2012-11-20T11:19:23Z

OS X (local server)
It's a french computer BTW

fcheung · 2012-11-20T11:24:23Z

Do you know what S3 does as far as collation goes? It may be that you have to give it strings in a particular canonical form for stuff to work properly (and it's worth remembering that os x gives you an almost but not quite canonically decomposed forms for file names)

EtienneDepaulis · 2012-11-20T11:54:28Z

@fcheung What do you mean by collation ? I'm providing a simple string :s

fcheung · 2012-11-20T12:00:01Z

Nothing's ever simple when unicode is involved. Collation means how are string compared & ordered, which is important in a unicode world because different languages can have different rules and because something like é can be stored in more than one way. I've no idea what S3 does in this respect. I would naively assume that you'd be ok if you're just using the string returned from an s3 list operation.

It's probably also worth checking that fog is doing the percent escaping properly - have a look with tcpdump or similar to see what the request actually looks like

EtienneDepaulis · 2012-11-20T14:20:17Z

If I retrieve the file list from S3, then select the correct file and do .key then an .inspect on the string (a string with which I can retrieve the file without any problem), I have exactly the same result as an .inspect on my previous string :s

Any idea why ?

fcheung · 2012-11-20T14:25:58Z

Do the raw bytes differ (ie s.bytes.to_a) ?

EtienneDepaulis · 2012-11-20T17:36:50Z

Nope :s

#source : directory.files
k1 = file.key
=> "uploads/direct/e3641dd12ba502374ea5a012162ccaf3/courrier_debut_saison_comm_reg_KPé.pdf" 

#source : amazon S3 URL stored in DB
uri = URI.parse(URI.encode(url))
k2 = URI.decode(uri.path[1..-1])
 => "uploads/direct/e3641dd12ba502374ea5a012162ccaf3/courrier_debut_saison_comm_reg_KPé.pdf" 

k1.bytes.to_a
=> [117, 112, 108, 111, 97, 100, 115, 47, 100, 105, 114, 101, 99, 116, 47, 101, 51, 54, 52, 49, 100, 100, 49, 50, 98, 97, 53, 48, 50, 51, 55, 52, 101, 97, 53, 97, 48, 49, 50, 49, 54, 50, 99, 99, 97, 102, 51, 47, 99, 111, 117, 114, 114, 105, 101, 114, 95, 100, 101, 98, 117, 116, 95, 115, 97, 105, 115, 111, 110, 95, 99, 111, 109, 109, 95, 114, 101, 103, 95, 75, 80, 195, 169, 46, 112, 100, 102] 

k2.bytes.to_a
=> [117, 112, 108, 111, 97, 100, 115, 47, 100, 105, 114, 101, 99, 116, 47, 101, 51, 54, 52, 49, 100, 100, 49, 50, 98, 97, 53, 48, 50, 51, 55, 52, 101, 97, 53, 97, 48, 49, 50, 49, 54, 50, 99, 99, 97, 102, 51, 47, 99, 111, 117, 114, 114, 105, 101, 114, 95, 100, 101, 98, 117, 116, 95, 115, 97, 105, 115, 111, 110, 95, 99, 111, 109, 109, 95, 114, 101, 103, 95, 75, 80, 101, 204, 129, 46, 112, 100, 102]

Any idea on what to try next ? I'm starting to become mad :s

fcheung · 2012-11-20T17:58:44Z

Those byte sequences look different to me
First ends with

80, 195, 169, 46, 112, 100, 102]
Which is P, 0xc3, 0xc9 ( which is utf for é)
Second ends with
80, 101, 204, 129, 46, 112, 100, 102]

Which is P, e, 0xcc, 0x81 (which is utf for combining e acute accent)

So both byte sequences result in the same sequence of glyphs when printed on screen, but s3 appears to consider them as being different. Which one of those strings works?

Fred.

Sent from my iPhone

On 20 Nov 2012, at 17:36, Etienne Depaulis notifications@github.com wrote:

Nope :s

k1 = file.key
=> "uploads/direct/e3641dd12ba502374ea5a012162ccaf3/courrier_debut_saison_comm_reg_KPé.pdf"

uri = URI.parse(URI.encode(url))
k2 = URI.decode(uri.path[1..-1])
=> "uploads/direct/e3641dd12ba502374ea5a012162ccaf3/courrier_debut_saison_comm_reg_KPé.pdf"

k1.bytes.to_a
=> [117, 112, 108, 111, 97, 100, 115, 47, 100, 105, 114, 101, 99, 116, 47, 101, 51, 54, 52, 49, 100, 100, 49, 50, 98, 97, 53, 48, 50, 51, 55, 52, 101, 97, 53, 97, 48, 49, 50, 49, 54, 50, 99, 99, 97, 102, 51, 47, 99, 111, 117, 114, 114, 105, 101, 114, 95, 100, 101, 98, 117, 116, 95, 115, 97, 105, 115, 111, 110, 95, 99, 111, 109, 109, 95, 114, 101, 103, 95, 75, 80, 195, 169, 46, 112, 100, 102]

k2.bytes.to_a
=> [117, 112, 108, 111, 97, 100, 115, 47, 100, 105, 114, 101, 99, 116, 47, 101, 51, 54, 52, 49, 100, 100, 49, 50, 98, 97, 53, 48, 50, 51, 55, 52, 101, 97, 53, 97, 48, 49, 50, 49, 54, 50, 99, 99, 97, 102, 51, 47, 99, 111, 117, 114, 114, 105, 101, 114, 95, 100, 101, 98, 117, 116, 95, 115, 97, 105, 115, 111, 110, 95, 99, 111, 109, 109, 95, 114, 101, 103, 95, 75, 80, 101, 204, 129, 46, 112, 100, 102]
Any idea on what to try next ? I'm starting to become mad :s

—
Reply to this email directly or view it on GitHub.

EtienneDepaulis · 2012-11-20T18:01:49Z

Sorry @fcheung, I meant "Nope, they differ" :s

Both strings prints the same on the screen but only the first one (the one retrieved from directory.files) returns the correct file.

Thanks for your help so far

maximeg · 2012-11-20T18:05:00Z

Hello Etienne, just saw it on Twitter.

You have two kind of é

[195, 169] => two byte UTF8 "é" => NFC
[101, 204, 129] => "e" + combining acute accent (U+0301) => NFD

Which uploader are you using (paperclip, carrierwave, dragonfly, a homemade) ?
And what ORM/DB ?
It seems like an issue when storing/retrieving the filename.

EtienneDepaulis · 2012-11-20T18:08:21Z

Hey @maximeg ;)

I'm using JQuery File Upload without any specific uploader (very large files) which POST a S3 URL to a very simple controller that stores the value in a postgres database.

Is it not possible to convert a NFD string to NFC format ?

maximeg · 2012-11-20T18:19:35Z

I think I get it... In Mac OS, filenames are in NFD, so when your jquery thing send back the filename of the just uploaded file, it will send NFD... and maybe postgresql don't care. (filename in Web, Linux are NFC, and S3 do the math)

See http://unicode-utils.rubyforge.org/UnicodeUtils.html

EtienneDepaulis · 2012-11-20T18:23:49Z

Long day ;)

Here is the final solution:

In the Gemfile:

gem 'unf'

=> https://github.com/knu/ruby-unf

Then a simple .to_nfc on the path_name !
It's finally working. Thanks all for your help 😃

geemus · 2012-11-26T17:08:18Z

Thanks for working through this!

@EtienneDepaulis - what a tricky one... Thanks for sharing the solution.

EtienneDepaulis closed this as completed Nov 20, 2012

EtienneDepaulis reopened this Nov 20, 2012

EtienneDepaulis closed this as completed Nov 20, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to access files with an accent in the name #1294

Impossible to access files with an accent in the name #1294

EtienneDepaulis commented Nov 20, 2012

seanhandley commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

fcheung commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

fcheung commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

fcheung commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

fcheung commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

maximeg commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

maximeg commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

geemus commented Nov 26, 2012

Impossible to access files with an accent in the name #1294

Impossible to access files with an accent in the name #1294

Comments

EtienneDepaulis commented Nov 20, 2012

seanhandley commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

fcheung commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

fcheung commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

fcheung commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

fcheung commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

maximeg commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

maximeg commented Nov 20, 2012

EtienneDepaulis commented Nov 20, 2012

geemus commented Nov 26, 2012