Skip to content

Commit

Permalink
Replace alternatives.markdown, python 3 got it right
Browse files Browse the repository at this point in the history
  • Loading branch information
candlerb committed Jan 28, 2011
1 parent 7f55690 commit 9fffab3
Showing 1 changed file with 26 additions and 226 deletions.
252 changes: 26 additions & 226 deletions alternatives.markdown
@@ -1,234 +1,34 @@
Towards an alternative approach
===============================
I've removed what I wrote here originally, because basically python 3.0 has
got it right, and you might as well read about it here:
http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

The top three issues I have with ruby 1.9 are:
But in summary:

1. It doesn't add anything to simple expressions like "a << b" except that
they can now crash under certain input conditions - and this is the sort of
thing which is difficult to pick up in unit testing.
* There are two types of object: strings (unicode text) and bytes (data).
* The two are incompatible. For example, you cannot concatenate strings and
bytes, and they always compare as different.
* A string is a sequence of unicode characters. There is no 'encoding'
associated with it, because characters exist independently of their encoded
representation.
* When you convert between text and data (i.e. the external representation
of that text), then you specify what encoding to use. The default is
picked up from LANG unless you override it, as in ruby 1.9.
* When you open a file, you open it in either text or binary mode (r/rb),
and what you get when you read it is either strings or bytes respectively.

Similarly, regular expression matches are more likely to crash given
unexpected or malformed input.
This to me is hugely sensible and logical. Some practical consequences are:

2. There are myriad rules and inconsistencies - e.g. that data read from a
File defaults to the environment locale, but data read from a Socket
defaults to ASCII-8BIT.
1. If there's a problem with your program, it will crash early and
*consistently* (e.g. if you opened a file in binary mode and tried to treat
it as text, or vice versa).

3. The same program and data can behave differently on different machines,
dependent on the environment locale.
Ruby may run OK if you feed it some data (e.g. which happens to be
ASCII-only) but crash when you feed it something else.

This last point is in some cases the desired behaviour, because tools like
'sed' also behave in this way:
2. Strings have no encoding dimension, so both program analysis and unit
testing are totally straightforward.

$ echo "über" >/tmp/str
$ sed -e 's/.//' /tmp/str
ber
$ env LC_ALL=C sed -e 's/.//' /tmp/str
�ber
3. External libraries need only document whether they accept (and return)
strings or bytes. If you make a wrong assumption, again your program will
crash early and you can immediately fix it.

But note that:

1. sed is inherently a text-processing tool, whereas ruby is often used
for processing binary data;

2. sed doesn't crash when given invalid input:
$ echo -e "\xfcber"
�ber
$ echo -e "\xfcber" >/tmp/str
$ sed -e 's/.//' /tmp/str
�er

3. sed doesn't need to introduce its own library of encodings, but just
uses the facilities provided by the underlying OS. (Having said that, I
don't know exactly *how* sed deals with encodings. Does it handle UTF-8
specially? Are other encodings mbrtowc'd or iconv'd, and then converted
back again for output?)

Anyway, given all this, how do I think ruby should have dealt with the issue
of encodings?


Option 0: Don't tag strings
---------------------------

What I mean is, leave Strings as one-dimensional arrays of bytes, not tagged
with any encoding. This basically rolls things back to ruby 1.8, and this
is what I'm sticking with.

For people who want to use non-ASCII text, make them work with UTF-8.
There is regexp support for this in 1.8 already. Some extra methods could
be added to make life more convenient, e.g.

* counting the number of characters: `str.charsize`
* extracting characters: `str.substr(0..50)`
* transcoding: `str.encode("ISO-8859-1", "UTF-8")`


Option 1: binary and UTF-8
--------------------------

Python 3.0 and Erlang both have two distinct data structures, one for binary
data and one for UTF-8 text. This could be implemented as two classes, e.g.
String and Binary, or as a String with a one-bit binary flag.

You'd need some way to distinguish a binary literal from a string one,
maybe just Binary.new(...)

The main difference between this and option 0 is that [], length, chop etc
would work differently on binaries and strings, whereas option 0 above would
have different methods like String#substr, String#charsize, String#charchop
etc.

TODO: flesh out the various cases like what happens when combining String
and Binary.

Going with either option 0 or 1 would eliminate most of the complexity
inherent in ruby 1.9.

All non-UTF-8 data would be transcoded at the boundary (something which is
needed for stateful encodings like ISO-2022-JP anyway).

What you'd lose is the ability to handle things like EUC-JP and GB2312
"natively", without transcoding to UTF-8 and back again. Is that important?
Aren't these "legacy" character sets anyway? If it is important, you could
still have an external library for dealing with them natively.

UTF-16 and UTF-32 would also need transcoding, but this is lossless.

You'd lose the ability to write ruby scripts in non-UTF-8 character sets,
but on the plus side, all the rules for #encoding tags would no longer be
needed. Note that ruby 1.9 requires constants to start with capital 'A' to
'Z', so it's not possible to write programs entirely in non-Roman scripts
anyway.

Programs which use non-UTF-8 data would have to be written to take this into
account. e.g.

File.open("/path/to/data", "r:IBM437") # transcode to UTF-8
File.open("/path/to/data2", "w:IBM437") # transcode from UTF-8

I have no objection to making "r:locale" and "w:locale" available, but IMO
that should not be the default.


Option 2: Band-aids
-------------------

Given that so much effort has been invested in tagging strings throughout
ruby 1.9, and the huge loss of face which would be involved in reversing
that decision, I don't expect this ever to happen.

So could we apply some tweaks to the current system to make it more
reasonable? Here are some options.

* When opening a text file ("r" or "w" as opposed to "rb" or "wb") then
make the external encoding default to UTF-8. If you want it to be
different then use "r:<encoding>" or "r:locale" when opening a file.

Or even make it default to US-ASCII, like source encodings do. This
is consistent and *forces* people to decide whether to open a file as
UTF-8, some other encoding, or guess from the locale.

(Making both files and source encodings default to UTF-8 is perhaps
more helpful though)

* Have a universally-compatible "BINARY" encoding. Any operation between
BINARY and FOO gives encoding BINARY, and transcoding between BINARY and
any other encoding is a null operation.

* Treat invalid characters in the same way as String#[] does, i.e. never
raise an exception. In particular, regexp matching always succeeds.

Whilst this may make programs less fragile, it could also end up with the
set of rules for Strings becoming even more complex, not less.


Option 3: Automatic transcoding
-------------------------------

There seems to me to be little benefit in having every single String in your
program tagged with an encoding, if the main thing it does is introduce
opportunities for Ruby to raise exceptions when these strings encounter each
other.

But if Ruby trancoded strings automatically as required, this might actually
become useful.

Consider: I'm building up a string of UTF-8 characters. Then I append a
string of ISO-8859-1. Hey presto, it is converted to UTF-8 and appended, and
the program continues happily. Ditto when interpolating:

"This error message incorporates #{str1} and #{str2}"

where str1 and str2 could be of different encodings. They would both be
transcoded to the source encoding of the outer string.

Proposed rules:

* Everything is compatible with everything else, by definition.

* If I combine str1 (encoding A) with str2 (encoding B), then str2 is
transcoded to encoding A before the operation starts, and the result is of
encoding A.

* If I match str (encoding S) with regexp (encoding R), then *regexp* is
transcoded to encoding S automatically.

Consider, for example, that

str =~ /abc/

would work even if str were in a wide encoding like UTF-16BE, which
would contain "\x00a\x00b\x00c", because the regexp has been transcoded
to a UTF-16BE regexp behind the scenes.

For efficiency, multiple encoding representations of the same regexp
could be stored as a cache inside the regexp object itself, generated
on demand.

* Have a binary regexp /./n which matches one *byte* always (whereas /./
would match one *character* in the source string, in the source's encoding)

* Transcoding errors could still occur, but I think these should normally
default to substituting a ? character. If you want to raise an exception
then use the `encode` or `encode!` methods with appropriate arguments to
request this behaviour.

* Transcoding to or from ASCII-8BIT is a null operation, so if you are
working with binary all you need to do is ensure one of your arguments is
ASCII-8BIT.

* As another example:

s2 = s1.tr("as","Aß")

would first transcode both "as" and "Aß" to the encoding of s1, before
running the tr method. This would therefore work even if s1 were in a wide
encoding, and s2 would still be in a wide encoding. It would also work if s1
were in ISO-8859-1 but the source encoding of the file were UTF-8, since
both have a representation for "ß".

* To be fully consistent, transcoding should take place on output as well as
input. e.g. if STDIN's external encoding is taken from the locale, then
STDOUT's external encoding should also be taken from the locale. Writing
a string tagged as ISO-8859-1 to STDOUT should transcode it automatically.

There are some issues to consider though. For example, what happens for
`str1<=>str2` where they are of different encodings? Do we transcode str2
to str1's encoding just for the comparison, and then throw that
representation away? This could make repeated comparisons (e.g. for
sorting) very expensive. Perhaps the alternative representations need to
be cached, similar to the ascii_only? flag.

I haven't worked this all the way through, but I believe you would end up
with a much simpler set of rules for combining strings of different
encodings. Encoding.compatible? would be dropped completely, and
String#ascii_only? would become purely an optimisation (so that transcoding
becomes a null operation in common cases)

It would also be much easier to reason about encodings, because for an
expression like s3 = s1 + s2 the encoding of s3 will always be the encoding
of s1.

This does introduce some asymmetry, but it's still got to be better than
raising exceptions. It's quite a fundamental change though.

0 comments on commit 9fffab3

Please sign in to comment.