Efficient Pure Ruby Unicode Normalization
Ruby Perl
Failed to load latest commit information.
benchmark added file with benchmark results for jruby Nov 12, 2013
data First commit of complete project Oct 19, 2013
lib limited normalization and checks to Unicode-based encodings Nov 27, 2013
test changed normalize_check to normalized? (more Ruby-like) Nov 24, 2013
LICENSE reversed order of licence paragraphs Oct 20, 2013


Efficient Pure Ruby Unicode Normalization (eprun)

(pronounced e-prune)

The Talk

Please see the Internationalization & Unicode Conference 37 talk on Implementing Normalization in Pure Ruby - the Fast and Easy Way.

Directories and Files

  • lib/normalize.rb: The core normalization code.
  • lib/string_normalize.rm: String#normalize.
  • lib/generate.rb: Generation script, generates lib/normalize_tables.rb from data/UnicodeData.txt and data/CompositionExclusions.txt. This needs to be run only once when updating to a new Unicode version.
  • lib/normalize_tables.rb: Data used for normalization, automatically generated by lib/generate.rb.
  • data/: All three files in this directory are downloaded from the Unicode Character Database. They are currently at Unicode version 6.3. They need to be updated for a newer Unicode version (happens about once a year).
  • test/test_normalize.rb: Tests for lib/string_normalize.rb, using data/NormalizationTest.txt.
  • benchmark/benchmark.rb: Runs the benchmark with example text files. Automatically checks for existing gems/libraries; if e.g. the unicode_util gem is not available, that part of the benchmark is skipped. This also applies to eprun, which will not be run on Ruby 1.8.
  • benchmark/Deutsch_.txt, Japanese_.txt, Korean_.txt, Vietnamese_.txt: example texts extracted from random Wikipedia pages (see http://en.wikipedia.org/wiki/Wikipedia:Random). The languages are choosen based on number of characters affected by normalization (Deutsch < Japanese < Vietnamese < Korean). These files have somewhat differing lengths, so the results cannot directly be compared across languages. Adding other files with ending "_.txt" will include them in the benchmark.
  • benchmark/benchmark_results.rb: Results of benchmark for eprun, unicode_utils, ActiveSupport::Multibyte (version 3.0.0), twitter_cldr, and the unicode gem. Eprun, unicode_utils, and unicode normalizations are run 100 times each, ActiveSupport::Multibyte is run 10 times each, and twitter_cldr is run only 1 time (didn't want to wait any longer).
  • benchmark/benchmark_results_jruby.txt: Results of benchmark when using jruby (excludes unicode gem), version 1.7.4 (1.9.3p392, 2013-05-16 2390d3b on Java HotSpot(TM) Client VM 1.7.0_07-b10 [Windows 7-x86]).
  • benchmark/benchmark.pl: Runs the benchmark using Perl, both with xsub (i.e. C) version (run 100 times) and pure Perl version (run 10 times).
  • benchmark/benchmark_results_pl.txt: Results of Perl benchmarks.

TODOs and Ideas

  • Publish as a gem, or several gems.
  • Deal better with encodings other than UTF-8.
  • Add methods such as String#nfc, String#nfd,...
  • Add methods for normalization variants.
  • See talk for more.