Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf-16le mistaken for binary #6

Closed
mcandre opened this issue Aug 28, 2014 · 6 comments
Closed

utf-16le mistaken for binary #6

mcandre opened this issue Aug 28, 2014 · 6 comments
Assignees
Labels

Comments

@mcandre
Copy link

mcandre commented Aug 28, 2014

ptools mistakenly treats some Unicode files as binary.

Example:

$ echo "test" > test.ascii
$ file -I test.ascii 
test.ascii: text/plain; charset=us-ascii
$ irb
> require 'ptools'
 => true 
> File.binary?('test.ascii')
 => false 
> exit

$ iconv -f ascii -t utf-16 test.ascii > test.utf-16
$ file -i test.utf-16 
test.utf-16: text/plain; charset=utf-16le
$ irb
> require 'ptools'
 => true 
> File.binary?('test.utf-16')
 => true 
> exit

System:

$ specs gem:ptools ruby file os
Specs:

specs 0.11
https://github.com/mcandre/specs#readme

gem list | grep ptools
ptools (1.2.6)

bundle --version
Bundler version 1.6.2

gem --version
2.2.2

ruby --version
ruby 2.0.0p481 (2014-05-08 revision 45883) [x86_64-linux]

file --version
file-5.14
magic file from /etc/magic:/usr/share/misc/magic

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.1 LTS
Release:    14.04
Codename:   trusty
@djberg96
Copy link
Owner

djberg96 commented Sep 5, 2014

I'm not sure how to fix it. Help wanted.

@mcandre
Copy link
Author

mcandre commented Sep 14, 2014

We could add a check in File#binary? for UTF-16 encoding in a file--If present, consider file to be text. Otherwise, consider the file to be binary.

A hacky way to implement this check is to make a system call to the Unix file program to identify the file encoding.

A more pure Ruby way to do this is to look for the Byte Order Mark in the first few bytes of the file. http://unicode.org/faq/utf_bom.html#BOM

@mcandre
Copy link
Author

mcandre commented Sep 14, 2014

Update: The Charlock Holmes gem can detect UTF-16!

> CharlockHolmes::EncodingDetector.new.detect(File.read('test.utf-16'))
 => {:type=>:text, :encoding=>"UTF-16BE", :ruby_encoding=>"UTF-16BE", :confidence=>100} 

> !(CharlockHolmes::EncodingDetector.new.detect(File.read('test.utf-16'))[:encoding] =~ /UTF-16/).nil?
 => true

We should also add a check for iso-8859-1 encoded files (they're text, not binary).

@djberg96
Copy link
Owner

Using the unix file command rather defeats the purpose of the ptools gem, and can't be used on Windows. I'd rather not add a 3rd party dependency, either. Surely, a pure Ruby solution is possible.

@djberg96
Copy link
Owner

djberg96 commented Jun 7, 2020

@mcandre How's it look? #35

@djberg96
Copy link
Owner

djberg96 commented Jun 7, 2020

Fixed in 1.3.6, which was released today. Thanks for the report!

@djberg96 djberg96 closed this as completed Jun 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants