Permalink
Browse files

Replace alternatives.markdown, python 3 got it right

  • Loading branch information...
1 parent 7f55690 commit 9fffab3c3b56ff9a58d0e41259d11dd4858f20ec @candlerb committed Jan 28, 2011
Showing with 26 additions and 226 deletions.
  1. +26 −226 alternatives.markdown
View
@@ -1,234 +1,34 @@
-Towards an alternative approach
-===============================
+I've removed what I wrote here originally, because basically python 3.0 has
+got it right, and you might as well read about it here:
+http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
-The top three issues I have with ruby 1.9 are:
+But in summary:
-1. It doesn't add anything to simple expressions like "a << b" except that
- they can now crash under certain input conditions - and this is the sort of
- thing which is difficult to pick up in unit testing.
+* There are two types of object: strings (unicode text) and bytes (data).
+* The two are incompatible. For example, you cannot concatenate strings and
+ bytes, and they always compare as different.
+* A string is a sequence of unicode characters. There is no 'encoding'
+ associated with it, because characters exist independently of their encoded
+ representation.
+* When you convert between text and data (i.e. the external representation
+ of that text), then you specify what encoding to use. The default is
+ picked up from LANG unless you override it, as in ruby 1.9.
+* When you open a file, you open it in either text or binary mode (r/rb),
+ and what you get when you read it is either strings or bytes respectively.
- Similarly, regular expression matches are more likely to crash given
- unexpected or malformed input.
+This to me is hugely sensible and logical. Some practical consequences are:
-2. There are myriad rules and inconsistencies - e.g. that data read from a
- File defaults to the environment locale, but data read from a Socket
- defaults to ASCII-8BIT.
+1. If there's a problem with your program, it will crash early and
+*consistently* (e.g. if you opened a file in binary mode and tried to treat
+it as text, or vice versa).
-3. The same program and data can behave differently on different machines,
- dependent on the environment locale.
+Ruby may run OK if you feed it some data (e.g. which happens to be
+ASCII-only) but crash when you feed it something else.
-This last point is in some cases the desired behaviour, because tools like
-'sed' also behave in this way:
+2. Strings have no encoding dimension, so both program analysis and unit
+testing are totally straightforward.
- $ echo "über" >/tmp/str
- $ sed -e 's/.//' /tmp/str
- ber
- $ env LC_ALL=C sed -e 's/.//' /tmp/str
- �ber
+3. External libraries need only document whether they accept (and return)
+strings or bytes. If you make a wrong assumption, again your program will
+crash early and you can immediately fix it.
-But note that:
-
-1. sed is inherently a text-processing tool, whereas ruby is often used
- for processing binary data;
-
-2. sed doesn't crash when given invalid input:
- $ echo -e "\xfcber"
- �ber
- $ echo -e "\xfcber" >/tmp/str
- $ sed -e 's/.//' /tmp/str
- �er
-
-3. sed doesn't need to introduce its own library of encodings, but just
- uses the facilities provided by the underlying OS. (Having said that, I
- don't know exactly *how* sed deals with encodings. Does it handle UTF-8
- specially? Are other encodings mbrtowc'd or iconv'd, and then converted
- back again for output?)
-
-Anyway, given all this, how do I think ruby should have dealt with the issue
-of encodings?
-
-
-Option 0: Don't tag strings
----------------------------
-
-What I mean is, leave Strings as one-dimensional arrays of bytes, not tagged
-with any encoding. This basically rolls things back to ruby 1.8, and this
-is what I'm sticking with.
-
-For people who want to use non-ASCII text, make them work with UTF-8.
-There is regexp support for this in 1.8 already. Some extra methods could
-be added to make life more convenient, e.g.
-
-* counting the number of characters: `str.charsize`
-* extracting characters: `str.substr(0..50)`
-* transcoding: `str.encode("ISO-8859-1", "UTF-8")`
-
-
-Option 1: binary and UTF-8
---------------------------
-
-Python 3.0 and Erlang both have two distinct data structures, one for binary
-data and one for UTF-8 text. This could be implemented as two classes, e.g.
-String and Binary, or as a String with a one-bit binary flag.
-
-You'd need some way to distinguish a binary literal from a string one,
-maybe just Binary.new(...)
-
-The main difference between this and option 0 is that [], length, chop etc
-would work differently on binaries and strings, whereas option 0 above would
-have different methods like String#substr, String#charsize, String#charchop
-etc.
-
-TODO: flesh out the various cases like what happens when combining String
-and Binary.
-
-Going with either option 0 or 1 would eliminate most of the complexity
-inherent in ruby 1.9.
-
-All non-UTF-8 data would be transcoded at the boundary (something which is
-needed for stateful encodings like ISO-2022-JP anyway).
-
-What you'd lose is the ability to handle things like EUC-JP and GB2312
-"natively", without transcoding to UTF-8 and back again. Is that important?
-Aren't these "legacy" character sets anyway? If it is important, you could
-still have an external library for dealing with them natively.
-
-UTF-16 and UTF-32 would also need transcoding, but this is lossless.
-
-You'd lose the ability to write ruby scripts in non-UTF-8 character sets,
-but on the plus side, all the rules for #encoding tags would no longer be
-needed. Note that ruby 1.9 requires constants to start with capital 'A' to
-'Z', so it's not possible to write programs entirely in non-Roman scripts
-anyway.
-
-Programs which use non-UTF-8 data would have to be written to take this into
-account. e.g.
-
- File.open("/path/to/data", "r:IBM437") # transcode to UTF-8
- File.open("/path/to/data2", "w:IBM437") # transcode from UTF-8
-
-I have no objection to making "r:locale" and "w:locale" available, but IMO
-that should not be the default.
-
-
-Option 2: Band-aids
--------------------
-
-Given that so much effort has been invested in tagging strings throughout
-ruby 1.9, and the huge loss of face which would be involved in reversing
-that decision, I don't expect this ever to happen.
-
-So could we apply some tweaks to the current system to make it more
-reasonable? Here are some options.
-
-* When opening a text file ("r" or "w" as opposed to "rb" or "wb") then
- make the external encoding default to UTF-8. If you want it to be
- different then use "r:<encoding>" or "r:locale" when opening a file.
-
- Or even make it default to US-ASCII, like source encodings do. This
- is consistent and *forces* people to decide whether to open a file as
- UTF-8, some other encoding, or guess from the locale.
-
- (Making both files and source encodings default to UTF-8 is perhaps
- more helpful though)
-
-* Have a universally-compatible "BINARY" encoding. Any operation between
- BINARY and FOO gives encoding BINARY, and transcoding between BINARY and
- any other encoding is a null operation.
-
-* Treat invalid characters in the same way as String#[] does, i.e. never
- raise an exception. In particular, regexp matching always succeeds.
-
-Whilst this may make programs less fragile, it could also end up with the
-set of rules for Strings becoming even more complex, not less.
-
-
-Option 3: Automatic transcoding
--------------------------------
-
-There seems to me to be little benefit in having every single String in your
-program tagged with an encoding, if the main thing it does is introduce
-opportunities for Ruby to raise exceptions when these strings encounter each
-other.
-
-But if Ruby trancoded strings automatically as required, this might actually
-become useful.
-
-Consider: I'm building up a string of UTF-8 characters. Then I append a
-string of ISO-8859-1. Hey presto, it is converted to UTF-8 and appended, and
-the program continues happily. Ditto when interpolating:
-
- "This error message incorporates #{str1} and #{str2}"
-
-where str1 and str2 could be of different encodings. They would both be
-transcoded to the source encoding of the outer string.
-
-Proposed rules:
-
-* Everything is compatible with everything else, by definition.
-
-* If I combine str1 (encoding A) with str2 (encoding B), then str2 is
-transcoded to encoding A before the operation starts, and the result is of
-encoding A.
-
-* If I match str (encoding S) with regexp (encoding R), then *regexp* is
-transcoded to encoding S automatically.
-
- Consider, for example, that
-
- str =~ /abc/
-
- would work even if str were in a wide encoding like UTF-16BE, which
- would contain "\x00a\x00b\x00c", because the regexp has been transcoded
- to a UTF-16BE regexp behind the scenes.
-
- For efficiency, multiple encoding representations of the same regexp
- could be stored as a cache inside the regexp object itself, generated
- on demand.
-
-* Have a binary regexp /./n which matches one *byte* always (whereas /./
-would match one *character* in the source string, in the source's encoding)
-
-* Transcoding errors could still occur, but I think these should normally
-default to substituting a ? character. If you want to raise an exception
-then use the `encode` or `encode!` methods with appropriate arguments to
-request this behaviour.
-
-* Transcoding to or from ASCII-8BIT is a null operation, so if you are
-working with binary all you need to do is ensure one of your arguments is
-ASCII-8BIT.
-
-* As another example:
-
- s2 = s1.tr("as","Aß")
-
- would first transcode both "as" and "Aß" to the encoding of s1, before
- running the tr method. This would therefore work even if s1 were in a wide
- encoding, and s2 would still be in a wide encoding. It would also work if s1
- were in ISO-8859-1 but the source encoding of the file were UTF-8, since
- both have a representation for "ß".
-
-* To be fully consistent, transcoding should take place on output as well as
-input. e.g. if STDIN's external encoding is taken from the locale, then
-STDOUT's external encoding should also be taken from the locale. Writing
-a string tagged as ISO-8859-1 to STDOUT should transcode it automatically.
-
-There are some issues to consider though. For example, what happens for
-`str1<=>str2` where they are of different encodings? Do we transcode str2
-to str1's encoding just for the comparison, and then throw that
-representation away? This could make repeated comparisons (e.g. for
-sorting) very expensive. Perhaps the alternative representations need to
-be cached, similar to the ascii_only? flag.
-
-I haven't worked this all the way through, but I believe you would end up
-with a much simpler set of rules for combining strings of different
-encodings. Encoding.compatible? would be dropped completely, and
-String#ascii_only? would become purely an optimisation (so that transcoding
-becomes a null operation in common cases)
-
-It would also be much easier to reason about encodings, because for an
-expression like s3 = s1 + s2 the encoding of s3 will always be the encoding
-of s1.
-
-This does introduce some asymmetry, but it's still got to be better than
-raising exceptions. It's quite a fundamental change though.

0 comments on commit 9fffab3

Please sign in to comment.