Invalid byte sequence in UTF-8 (in file name) #288

Closed
mlarocque opened this Issue Feb 29, 2012 · 6 comments

Comments

Projects
None yet
3 participants

I'm syncing files on our server to S3 successfully. However, one particular directory - uploads (where people upload a variety of documents) - fails with the following:

2012/02/29 06:58:10][message] Generating checksums for /storage/data/uploads
[2012/02/29 06:58:15][error] ModelError: Backup for Backup to S3 (cnbc_s3_backup) Failed!
[2012/02/29 06:58:15][error] An Error occured which has caused this Backup to abort before completion.
[2012/02/29 06:58:15][error] Reason: ArgumentError
[2012/02/29 06:58:15][error] invalid byte sequence in UTF-8
[2012/02/29 06:58:15][error]
[2012/02/29 06:58:15][error] Backtrace:
[2012/02/29 06:58:15][error] /home/mlarocque/.rvm/gems/ruby-1.9.3-p125/gems/backup-3.0.23/lib/backup/syncer/cloud.rb:101:in split' [2012/02/29 06:58:15][error] /home/mlarocque/.rvm/gems/ruby-1.9.3-p125/gems/backup-3.0.23/lib/backup/syncer/cloud.rb:101:inlocal_files'
[2012/02/29 06:58:15][error] /home/mlarocque/.rvm/gems/ruby-1.9.3-p125/gems/backup-3.0.23/lib/backup/syncer/cloud.rb:93:in `all_file_names'

It appears that one or more of the filenames has characters which backup doesn't like.

Any ideas?

@ghost

ghost commented Feb 29, 2012

Would validating the user's uploaded file names be an option?
I don't think backup should change the names, since those files when restored would be different.

I agree that validating file names should be done, and I believe it is now. These files are from a 5+ year old Rails app and there have been a ton of file uploads over that time. Perhaps it might be possible for backup to fail somewhat less spectacularly and identify the file that is causing the issue?

Contributor

tomash commented Mar 1, 2012

this occurs only in ruby 1.9, which is pretty strict about non-unicode encoding.

usually this is solvable using some iconv, to convert offending names to either utf8 (recommended) or downgrade them to ascii. who's in to fix it?

@ghost ghost added a commit that referenced this issue Mar 7, 2012

Brian D. Burns update Cloud Syncers
- Syncer::Cloud namespace
  `sync_with S3` -> `sync_with Cloud::S3`
  `sync_with CloudFiles` -> `sync_with Cloud::CloudFiles`
- Warn user if paths contain invalid UTF-8 byte sequences (#288)
14334ef
@ghost

ghost commented Mar 7, 2012

@mlarocque This has been taken care of in the 'cloud-syncers' branch if you're still needing this :)
https://github.com/meskyanichi/backup/tree/cloud-syncers
It will simply skip any file paths with invalid UTF-8 characters and log a warning with a reference to the skipped path.

There were other changes made and issues addressed in this branch, as well as those addressed in 'develop' (which cloud-syncers is based on). So, it may take several days before we're comfortable releasing a gem with these updates.

You can add this following to your Gemfile for now, so you can sync and identify those bad files.

gem 'backup',
  :git => 'git://github.com/meskyanichi/backup.git',
  :branch => 'cloud-syncers'

@ghost ghost added a commit that referenced this issue Mar 10, 2012

Brian D. Burns update Cloud Syncers
- Syncer::Cloud namespace
  `sync_with S3` -> `sync_with Cloud::S3`
  `sync_with CloudFiles` -> `sync_with Cloud::CloudFiles`
- Warn user if paths contain invalid UTF-8 byte sequences (#288)
603c46a

ghost closed this Feb 19, 2013

bjensen commented Jul 31, 2013

@burns UTF8 is allowed on the filesystem and on S3...So shouldn't a fix handle UTF8 characters in filenames?

@ghost

ghost commented Jul 31, 2013

@bjensen The problem here was invalid UTF-8. If the file name is valid UTF-8, then there's no problem.

This issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment