Permalink
Browse files

Example of not ordering by codepoint

  • Loading branch information...
1 parent b939ed5 commit 7b77d26cef3fb7133693db1336808c3ac26ba8ba @candlerb candlerb committed Jun 21, 2010
Showing with 14 additions and 0 deletions.
  1. +14 −0 string19.rb
View
@@ -816,6 +816,20 @@ def b.to_str
# REFERENCE: rb_str_cmp_m in string.c
+# It's important to realise that ruby 1.9 does not sort by codepoints, it
+# sorts by bytes. It's a convenient property of UTF-8 encoding that lower
+# codepoints sort before higher ones, but this does not work for all
+# encodings, not even all encodings of unicode. Here's an example of where
+# the distinction is important:
+
+ s1 = 97.chr("UTF-8") # a
+ s2 = 257.chr("UTF-8") # ā
+ is true, s1 < s2 # expected
+
+ s1 = 97.chr("UTF-16LE") # a
+ s2 = 257.chr("UTF-16LE") # ā
+ is false, s1 < s2 # not ordered by codepoint
+
# In ruby 1.9 these questions have to be considered for symbols too, since
# symbols now have string-like properties. As far as I can see, the same
# rules are applied to symbols as for strings. In particular, this means

0 comments on commit 7b77d26

Please sign in to comment.