Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Character encoding detection, brought to you by ICU
Ruby C C++

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
benchmark
ext/charlock_holmes
lib
spec
.gitignore
.rspec
Gemfile
Gemfile.lock
MIT-LICENSE
README.md
Rakefile
charlock_holmes.gemspec

README.md

CharlockHolmes

Character encoding detecting library for Ruby using ICU

Usage

First you'll need to require it

require 'charlock_holmes'

Encoding detection

contents = File.read('test.xml')
detection = CharlockHolmes::EncodingDetector.detect(contents)
# => {:encoding => 'UTF-8', :confidence => 100, :type => :text}

# optionally there will be a :language key as well, but
# that's mostly only returned for legacy encodings like ISO-8859-1

NOTE: CharlockHolmes::EncodingDetector.detect will return nil if it was unable to find an encoding.

For binary content, :type will be set to :binary

Though it's more efficient to reuse once detector instance:

detector = CharlockHolmes::EncodingDetector.new

detection1 = detector.detect(File.read('test.xml'))
detection2 = detector.detect(File.read('test2.json'))

# and so on...

String monkey patch

Alternatively, you can just use the detect_encoding method on the String class

require 'charlock_holmes/string'

contents = File.read('test.xml')

detection = contents.detect_encoding

Ruby 1.9 specific

NOTE: This method only exists on Ruby 1.9+

If you want to use this library to detect and set the encoding flag on strings, you can use the detect_encoding! method on the String class

require 'charlock_holmes/string'

contents = File.read('test.xml')

# this will detect and set the encoding of `contents`, then return self
contents.detect_encoding!

Transcoding

Being able to detect the encoding of some arbitrary content is nice, but what you probably want is to be able to transcode that content into an encoding your application is using.

content = File.read('test2.txt')
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'

The first parameter is the content to transcode, the second is the source encoding (the encoding the content is assumed to be in), and the third parameter is the destination encoding.

Installing

If the traditional gem install charlock_holmes doesn't work, you may need to specify the path to your installation of ICU using the --with-icu-dir option during the gem install.

At the time of writing, Homebrew for OSX installs ICU (icu4c is the package name) as a keg-only install so you'll have to specify the location during the gem install:

gem install charlock_holmes -- --with-icu-dir=/usr/local/Cellar/icu4c/4.4.1

If you're using Bundler and need to specify a custom path(s), you can do so with the bundle config command:

bundle config build.charlock_holmes --with-icu-dir=/usr/local/Cellar/icu4c/4.4.1

Something went wrong with that request. Please try again.