Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tag: svn-import
Fetching contributors…

Cannot retrieve contributors at this time

777 lines (607 sloc) 34.297 kb
EEP: 10
Title: Representing Unicode characters in Erlang
Version: $Id: unicode_in_erlang.txt,v 1.12 2008/11/12 14:29:51 pan Exp $
Last-Modified: $Date$
Author: Patrik Nyblom
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 07-05-2008
Erlang-Version: R12B-4
Post-History: 01-jan-1970
Abstract
========
This EEP suggest a standard representation of Unicode [2]_ characters in
Erlang, as well as the basic functionality to deal with them.
Motivation
==========
As Unicode characters are more widely used, the need for a common
representation of Unicode characters in Erlang arise. Up until now,
the Erlang programmer writing Unicode programs has to decide on
his or her own representation and has little or no help from the
standard libraries.
Implementing functions in the libraries dealing with all possible
combinations and variants of Unicode representation in Erlang is
considered both extremely time consuming and confusing to the future
user of the standard library.
One common representation, dealing both with binaries and lists is
therefore desirable, making Unicode handling in the standard libraries
easier to implement and giving a more stringent result.
Once the representation is agreed upon, implementation can be done
incrementally. This EEP only outlines the most basic functionality the
system should provide. The Unicode support is by no means complete if
this EEP is implemented, but implementation will be feasible.
The EEP also suggests library functions and bit syntax to deal with
alternative encodings. However, one *standard* encoding is suggested,
which will be what library functions in Erlang are expected to
support, while other representations are supported only in terms of
conversion.
Rationale
=========
Preconditions
-------------
Erlang traditionally represents text strings as lists of bytes (8bit
entities), where the characters are encoded in ISO-8859-1 (latin1).
As the use of Unicode characters gets more widely spread, the demand
for a common view of how to represent Unicode characters in Erlang
arise.
Unicode is a character encoding standard where all known, living and
historical written languages are represented in one single character
set, which of course results in characters demanding more than eight
bits each for representation.
Lists
-----
Regardless of the representation, the Unicode character set is a
super-set of the latin1 ditto, while latin1 in it's turn is a super-set
of the traditional 7-bit US-ASCII character set. Representing Unicode
characters in Erlang lists is therefore quite naturally done by
allowing characters in lists to take on values higher than 255.
Therefore a Unicode string can, in Erlang, be conveniently stored
as a list where each element represents one single Unicode
character. The following list:
.. _ex1:
[1050,1072,1082,1074,
1086,32,1077,32,85,110,105,99,111,100,101,32,63]
\- would represent the Bulgarian translation of "What is Unicode ?" (which
looks something like like "KAKBO e Unicode ?" with only the last part
in latin letters). The last part
([32,85,110,105,99,111,100,101,32,63]) is plain latin1 as the string
"Unicode ?" is written in latin letters, while the first part contains
characters not to be represented in a single byte. In essence, the
string is encoded in the Unicode encoding standard UTF-32, one
32bit entity for each character, which is more than sufficient for one
Unicode character per position.
However, the currently most common representation of Unicode
characters is UTF-8 [1]_, in which the characters are stored in one to
four 8-bit entities organized in such way that plain 7-bit US ASCII is
untouched, while characters 128 and upwards are split over more than
one byte. The advantage of this coding is that e.g. characters having
a meaning to the file/operating system are kept intact and that many
strings in western languages do not occupy more space when transformed
into Unicode. In such an encoding, the above mentioned Bulgarian
string (ex1_) would be represented as the list
[208,154,208,176,208,186,208,178,208,
190,32,208,181,32,85,110,105,99,111,100,101,32,63], where the first
part, containing the Bulgarian script letters occupy more bytes per
character, while the trailing part "Unicode ?" is identical to the
plain and more intuitive encoding of one character per list element.
In spite of being less intuitive, the UTF-8 encoding is the one most
widely spread and supported by operating systems and terminal
emulators. UTF-8 is therefore the most convenient way to communicate
text to external entities (files, drivers, terminals and so on).
When dealing with lists in Erlang, the advantages of using one list
element per character seems to be greater than the advantage of not
having to convert a UTF-8 character string before e.g. printing
it on a terminal. This is especially true as the current Erlang
implementation allows for all current Unicode characters to
occupy the same memory space as a latin1 character would (bearing in
mind that each character is represented as an integer and the list
element can contain integers up to 16#7ffffff on 32-bit
implementations, which is far larger than the largest current Unicode
character 16#10ffff). A further advantage is that routines like
io:format can easily cope with latin1 characters and Unicode
characters alike, as the eight-bit characters of Unicode happen
to correspond exactly to the latin1 character set. It would seem as
lists have a very natural way of dealing with Unicode characters.
Binaries
--------
Binaries on the other hand would suffer greatly from a scheme where
every character is encoded with a fixed width capable of representing
numbers up to 16#10ffff. The standardized way of doing this would be
what's commonly referred to as UTF-32, i.e. one 32-bit word for each
character. Even a UTF-16 representation would guarantee to double the
memory requirements for all text strings encoded in binaries, while
UTF-8 would for most common cases be the most space-saving
representation.
Binaries are often used to represent data to be sent
to external programs, which also speaks in favor of the UTF-8
representation.
There are however problems with the UTF-8 representation, most
obviously the fact that characters occupy a variable number of
positions (bytes) in the binary, so that traversal is somewhat more
tedious. An extension to the bit syntax where UTF-8 characters can be
matched in the head of a string conveniently would ease up the
situation, but as of today, no such primitives are present. UTF-8
encoded characters are also only backward compatible with 7-bit US-ASCII,
and there are only probabilistic approaches to determining if a
sequence of bytes represent Unicode characters encoded as UTF-8 or
plain latin1. A library function in Erlang therefore needs to be
informed about the way characters are encoded in a binary to be able
to interpret them correctly. A latin1 character above 128 will be
displayed incorrectly if written to a terminal set for displaying
UTF-8 encoded Unicode and v.v. As a common example
io:format("~s~n",[MyBinaryString]), would need to be informed about
the fact that the string is encoded in UTF-8 or latin1 to display it
correctly on a terminal.
The formatting functions actually present a whole set of challenges
regarding Unicode characters. New formatting controls will be needed
to inform the formatting functions in the io and io_lib modules that
strings are in Unicode or that input is in UTF-8. This is however
solvable, as discussed below.
My conclusion so far is that as binaries are often used to save space
and commonly utilized when communicating with external entities, the
UTF-8 advantages seem to supersede the disadvantages in the
binary case. It therefore seems sensible to commonly encode Unicode
characters in binaries as UTF-8. Of course any
representation is possible, but UTF-8 would be the most common
case and can therefore be regarded as the Erlang standard
representation.
Combinations of lists and binaries
----------------------------------
To furthermore complicate things, Erlang has the concept of
iolist's (or iodata). An io_list is any (or almost any) combination of integers
and binaries representing a sequence of bytes, like i.e
[[85],110,[105,[99]],111,<<100,101>>] as a
representation of the string "Unicode". When sending data to drivers
and in many BIFs this rather convenient representation is accepted
(convenient when constructing, less convenient when traversing).
When dealing with Unicode strings, a similar abstraction would be
desirable, and with the above suggested conventions, that would mean
that a Unicode character string could be a list with any combination
of integers ranging from 0 to 16#10ffff and binaries with Unicode
characters encoded as UTF-8. Converting such data to a plain list or a
plain UTF-8 binary would be easily done as long as one knows how the
characters are encoded to begin with. It would however not necessarily
be an iolist. Furthermore conversion functions need to be aware of
the original intention of the list to behave correctly. If one wants
to convert an iolist containing latin1 characters in both list part
and binary part to UTF-8, the list part cannot be misinterpreted, as
latin1 and Unicode are alike for all latin1 characters, but the binary
part can, as latin1 characters above 127 are encoded in two bytes if
the binary contains UTF-8 encoded characters, but only one byte when
latin1 encoding is used. The same of course holds for other encodings,
if a binary encoded in UTF-32 would be converted to UTF-8, the process
also would differ from the process of converting latin1 characters.
If we stick with the idea of representing Unicode as one character per
list element in lists and as UTF-8 in binaries, we could have the
following definitions:
* A latin1 list: a list containing characters in the range 0..255
* A latin1 binary: a binary consisting of bytes each representing one
letter in the ISO-8859-1 character set.
* A Unicode list: a list consisting solely of integers in the range
0..16#10ffff.
* A Unicode binary: a binary with Unicode characters encoded
as UTF-8
* A mixed latin1 list: a possibly deep list containing any combination
of integers in the range 0..255 and latin1 binaries.
* A mixed Unicode list: a possibly deep list containing integers in the
range 0..16#10ffff and Unicode binaries.
Conversion routines
-------------------
Conversion functions between latin1 lists and latin1 binaries as well
as from mixed latin1 lists to latin1 binaries are already present in
the system as list_to_binary, binary_to_list, and iolist_to_binary.
Conversion between Unicode lists, Unicode binaries, and from
mixed Unicode lists could in a similar way be provided by functions like::
unicode:list_to_utf8(UM) -> Bin
Where UM is a mixed Unicode list and the result is a UTF-8 binary, and::
unicode:utf8_to_list(Bin) -> UL
Where Bin is a binary consisting of Unicode characters encoded as
UTF-8 and UL is a plain list of Unicode characters.
To allow for conversion to and from latin1 the functions::
unicode:latin1_list_to_utf8(LM) -> Bin
and::
unicode:latin1_list_to_unicode_list(LM) -> UL
would do the same job. Actually latin1_list_to_list is not necessary
in this context, as it is more of an iolist-function, but should be
present for completeness.
The fact that lists of integers representing latin1 characters are a
subset of the lists containing Unicode characters might however be more
confusing than useful to utilize when converting from mixed lists to
UTF-8 coded binaries. I think a good approach would be to
differentiate the functions dealing with latin1 characters and Unicode
so that mixed lists are expected to contain only numbers 0..255 if the
binaries are expected to contain latin1 bytes. For functions like
io:format, the same thing should be true i.e. ~s means latin1 mixed
lists and ~ts means Unicode mixed lists (with binaries in
UTF-8). Passing a list with an integer > 255 to ~s would be an error
with this approach, just like passing the same thing to
latin1_list_to_utf8/1. See below for more discussions about the io system.
The unicode_list_to_utf8/1 and latin1_list_to_utf8/1 functions can be
combined into the single function list_to_utf8/2 like this::
unicode:characters_to_binary(ML,InEncoding) -> binary()
ML := A mixed Unicode list or a mixed latin1 list
InEncoding := {latin1 | unicode}
The word "characters" is used to denote a possibly complex
representation of characters in the encoding concerned, like a short
word for "a possibly mixed and deep list of characters and/or binaries
in either latin1 representation or Unicode".
Giving latin1 as the encoding would mean that all of ML should be
interpreted as latin1 characters, implying that integers > 255 in the
list would be an error. Giving Unicode as the encoding would mean that
all integers 0..16#10ffff are accepted and the binaries are expected
to already be UTF-8 coded.
In the same way, conversion to lists of Unicode characters could be done with a function::
unicode:characters_to_list(ML, InEncoding) -> list()
ML := A mixed Unicode list or a mixed latin1 list
InEncoding := {latin1 | unicode}
I think the approach of two simple conversion functions
characters_to_binary/2 and characters_to_list/2 is attractive, despite the fact
that certain combinations of in-data would be somewhat harder to
convert (e.g. combinations of Unicode characters > 255 in a list with
binaries in latin1). Extending the bit syntax to cope with UTF-8 would
make it easy to write special conversion functions to handle those
rare situations where the above mentioned functions cannot do the job.
To accommodate other encodings, the characters_to_binary functionality
could be extended to handle other encodings as well. A more general
functionality could be provided with the following functions
(preferably placed in their own module, the module name 'unicode'
being a good name-candidate):
**characters_to_binary(ML) -> binary() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}**
Same as characters_to_binary(ML,unicode,unicode).
**characters_to_binary(ML,InEncoding) -> binary() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}**
Same as characters_to_binary(ML,InEncoding,unicode).
**characters_to_binary(ML,InEncoding, OutEncoding) -> binary() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}**
Types:
- ML := A mixed list of integers or binaries corresponding to the
InEncoding or a binary in the InEncoding
- InEncoding := { latin1 | unicode | utf8 | utf16 | utf32 }
- OutEncoding := { latin1 | unicode | utf8 | utf16 | utf32 }
- Encoded := binary()
- Rest := Mixed list as specified for ML.
The option 'unicode' is an alias for utf8, as this is the
preferred encoding for Unicode characters in binaries. Error tuples
are returned when the data cannot be encoded/decoded due to errors
in indata and incomplete tuples when the indata is possibly correct
but truncated.
**characters_to_list(ML) -> list() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}**
Same as characters_to_list(ML,unicode).
**characters_to_list(ML,InEncoding) -> list() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}**
Types:
- ML := A mixed list of integers or binaries corresponding to the
InEncoding or a binary in the InEncoding
- InEncoding := { latin1 | unicode | utf8 | utf16 | utf32 }
- Encoded := list()
- Rest := Mixed list as specified for ML.
Here also the option 'unicode' denotes the default Erlang encoding
of utf8 in binaries and is therefore an alias for utf8. Error- and
incomplete-tuples are returned in the same way as for
characters_to_binary.
Note that as the datatypes returned upon success are well defined,
guard tests exist (is_list/1 and is_binary/1), why i suggest not
returning the clunky {ok, Data} tuples even though the error and
incomplete tuples can be returned. This makes the functions simpler to
use when the encoding is known to be correct while return values can
still be checked easily.
Bit syntax
----------
Using Erlang bit syntax on binaries containing Unicode characters
in UTF-8 could be facilitated by a new type. The type name utf8 would
be preferable to utf-8, as dashes ("-") have special meaning in bit
syntax separating type, signedness, endianess and units.
The utf8 type in bit syntax matching would convert a UTF-8
coded character in the binary to an integer regardless of how many bytes it
occupies, leaving the trailing part of the binary to be matched against the
rest of the bit syntax matching expression.
When constructing binaries, an integer converted to UTF-8 could
consequently occupy between one and four bytes in the resulting binary.
As bit syntax is often used to interpret data from various external
sources, it would be useful to have corresponding utf16 and utf32
types as well. While UTF-8, UTF-16 and UTF-32 are easily interpreted
with the current bit syntax implementation, the suggested specific
types would be convenient for the programmer. Also Unicode imposes
restrictions in terms of range and has some forbidden ranges which are best
handled using a built in bit syntax type.
The utf16 and utf32 types need to have an endianess option, as UTF-16
and UTF-32 can be stored as big or little endian entities.
Formatting functions
--------------------
Given a default Unicode character representation in Erlang, let's dig
deeper into the formatting functions. I suggest the concept of
formatting control sequence modifiers, an extra character between the
"~" and the control character, denoting Unicode input/output. The
letter "t" (for translate) is not used in any formatting functions
today, making it a good candidate. The meaning of the modifier should
be such that e.g. the formatting control "~ts" means a string in
Unicode while "~s" means means a string in iso-latin-1. The reason for not
simply introducing a new single control character, is that the
suggested modifier can be applicable to various control characters,
like e.g. "p" or even "w", while a new single control character for
Unicode strings would only be a replacement for the current "s"
control character.
Although the io-protocol in Erlang from the beginning did not impose
any limit on what characters could be transferred between a client and
an io_server, demands for better performance
from the io-system in Erlang has made later
implementations use binaries for communication, which in practice has
made the io-protocol contain bytes, not general characters.
Furthermore has the fact that the io-system currently works with
characters that can be represented as bytes been utilized in numerous
applications, so that output from io-functions (i.e. io_lib:format)
has been sent directly to entities only accepting byte input (like
sockets) or that io_servers have been implemented assuming only
character ranges of 0 - 255. Of course this can be changed, but such a
change might include lower performance from the io-system as well as
large changes to code already in production (aka "customer code").
The io-system in Erlang currently works around an assumption that data
is always a stream of bytes. Although this was not the original
intention, this is how it's today used. This means that a latin1
string can be sent to a terminal or a file in much the same way, there
will never be any conversion needed. This might not always hold for
terminals, but in case of terminals there is always one single
conversion needed, namely that from the byte-stream to whatever the
terminal likes. A disk-file is a stream of bytes as well as a terminal
is, at least as far as the Erlang io-system is concerned. Furthermore
the io_lib formatting function always returns (possibly) deep lists of
integers, each representing one character, making it hard to
differentiate between different encodings. The result is then sent as
is by functions like io:format to the io_server where it is finally
put on the disk. The servers also accept binaries, but they are never
produced by io_lib:format.
When Erlang starts supporting Unicode characters, the world changes a
little. A file might contain text in UTF-8 or in iso-latin-1 and there is
no telling from the list produced by e.g io_lib:format
what the user originally intended.
Suggested solution
...................
To make a solution that as far as possible does not break current code
and also keeps (or reverts to) the original intention of the io-system
protocol, I suggest a scheme where the formatting functions that
return lists, keep to the current behavior as far as possible.
So the io_lib:format function returns a (possibly deep) list of
integers 0..255 (latin1, which can be viewed as a subset of Unicode)
if used without translation modifiers. If the translation modifiers
are used, it will however return a possibly deep list of integers in
the complete unicode range. Going back to the Bulgarian string (ex1_),
let's look at the following::
1> UniString = [1050,1072,1082,1074,
1086,32,1077,32,85,110,105,99,111,100,101,32,63].
2> io_lib:format("~s",[UniString]).
\- here the Unicode string violates the mixed latin1 list property and a
badarg exception will be raised. This behavior should be retained. On
the other hand::
3> io_lib:format("~ts",[UniString]).
\- would return a (deep) list with the Unicode string as a list of integers::
[[1050,1072,1082,1074,1086,32,1077,32,85,110,105,99,111,100,
101,32,63]]
The downside of introducing integers > 255 in the result list is of course
that the return value of the function is no longer valid iodata(), but on
the other hand, the following code::
lists:flatten(io_lib:format("~ts",[UniString]))
will give a result similar to that of a non-Unicode version.
As the format modifier "t" is new, the possibility to get integers >
255 in the resulting deep list will not break old code. To get
iodata() in UTF-8, one could simply do::
unicode:characters_to_binary(io_lib:format("~ts",[UniString]),
unicode, unicode)
As before, directly formatting (with ~s) a list of characters > 255
would be an error, but with the "t" modifier it would work.
When it comes to range checking and backward compatibility::
6> io:format(File,"~s",[UniString]).
\- would as before throw the badarg exception, while::
7> io:format(File,"~ts",[UniString]).
\- would be accepted.
The corresponding behavior of io:fread/2,3 would be to expect Unicode data in this call::
11> io:fread(File,'',"~ts").
\- but expect latin1 in this::
12> io:fread(File,'',"~s").
The actual io-protocol, on the other hand, should deal only with
Unicode, meaning that when data is converted to binaries for sending,
all data should be translated into UTF-8. When lists of integers are
used in communication, the latin1 and Unicode representations are the
same, why no conversion or restrictions apply. Recall that the
io-system is built so that characters should have one interpretation
regardless of the io-server. The only possible encoding would be a
Unicode one.
As we are communicating heavily between processes (the client and
server processes in the io-system), converting the data to Unicode
binaries (UTF-8) is the most efficient strategy for larger amounts of
data.
Generally, writing
to an io-server using the file-module will only be possible with
byte-oriented data, while using the io-module will work on Unicode
characters. Calling the function file\:write/2 will send the bytes to
the file as is, as files are byte-oriented, but when writing on a file
using the io-module, Unicode characters are expected and handled.
The io-protocol will make conversions of bytes into Unicode when
sending to io-servers, but if the file is byte-oriented, the
conversion back will make this transparent to the user. All bytes are
representable in UTF-8 and can be converted back and forth without hassle.
The incompatible change will have to be to the put_chars function in
io. It should only allow Unicode data, not iodata() as it is
documented to do now. The big change being that any binaries provided
to the function need to be in UTF-8. However, most usage of this
function is restricted to lists, why this incompatible change is
expected not to cause trouble for users.
To handle possible Unicode text data on a file, one should be able to
provide encoding parameters when opening a file. A file should by
default be opened for byte (or latin1) encoding, while the option to
open it for i.e. utf8 translation should be available.
Lets look at some examples:
Example 1 - common byte-oriented writing
........................................
A file is opened as usual with file\:open. We then want to write bytes
to it:
- Using file\:write with iodata() (bytes), the data is converted into
UTF-8 by the io-protocol, but the io-server will convert it back to
latin1 before actually putting the bytes on file. For better
performance, the file could be opened in raw mode, avoiding all
conversion.
- Using file\:write with data already converted to UTF-8 by the user,
the io-protocol will embed this in yet another layer of UTF-8
encoding, the file-server will unpack it and we will end up with the
UTF-8 bytes written to the file as expected.
- Using io:put_chars, the io-server will return an error if any of the
Unicode characters sent are not possible to represent in one
byte. Characters representable in latin1 will however be written
nicely even though they might be encoded as UTF-8 in binaries sent
to io:put_chars. As long as the io_lib:format function is used
without the translation-modifier, everything will be valid latin1
and all return values will be lists, why it is both valid Unicode *and*
possible to write on a default file. Old code will function as
before, except when feeding io:put_chars with latin1 binaries, in
that case the call should be replaced with a file\:write call.
Example 2 - Unicode-oriented writing
....................................
A file is opened using a parameter telling that Unicode data should be
written in a defined encoding, in this case we'll select UTF-16/bigendian to
avoid mix-ups with the native UTF-8 encoding. We open the file with
file\:open(Name,[write,{encoding,utf16,bigendian}]).
- Using file\:write with iodata(), the io-protocol will convert into
the default Unicode representation (UTF-8) and send the data to the
io-server, which will in turn convert the data to UTF-16 and put it
on the file. The file is to be regarded as a text file and all
iodata() sent to it will be regarded as text.
- If the data is already in Unicode representation (say UTF-8) it
should not be written to this type of file using file\:write,
io:put_chars is expected to be used (which is not a problem as
Unicode data should not exist in old code and this is only a problem
when the file is opened to translate).
- If the data is in the Erlang default Unicode format, it can be
written to the file using io:put_chars. This works for all types of
lists with integers and for binaries in UTF-8, for other
representations (most notably latin1 in binaries) the data should be
converted using Unicode:characters_to_XXX(Data,latin1) prior to
sending. For latin1 mixed lists (iodata()), file\:write can also be
used directly.
To sum up this case - Unicode strings (including latin1 lists) are
written to a converting file using io:put_chars, but pure iodata() can
also be implicitly converted to the encoding by using file\:write.
Example 3 - raw writing
.......................
A file opened for raw access will only handle bytes, it cannot be used
together with io:put_chars.
- Data formatted with io_lib:format can still be written to a raw file
using file\:write. The data will end up being written as is. If the
translation modifier is consistently used when formatting, the file
will get the native UTF-8 encoding, if no translation modifiers are
used, the file will have latin1 encoding (each character in the list
returned from io_lib:format will be representable as a latin1
byte). If data is generated in different ways, the conversion
functions will have to be used.
- Data written with file\:write will be put on the file directly, no
conversion to and from Unicode representation will happen.
Example 4 - byte-oriented reading
.................................
When a file is opened for reading, much the same things apply as for
writing.
- file\:read on any file will expect the io-protocol to deliver data as
Unicode. Each byte will be converted to Unicode by the io_server and
turned back to a byte by file\:read
- If the file actually contains Unicode characters, they will be byte-wise
converted to Unicode and then back, giving file\:read the
original encoding. If read as (or converted to) binaries they can
then easily be converted back to the Erlang default representation
by means of the conversion routines.
- If the file is read with io:get_chars, all characters will be
returned in a list as expected. All characters will be latin1, but
that is a subset of Unicode and there will be no difference to
reading a translating file. If the file however contains Unicode
converted characters and is read in this way, the return value from
io:get_chars will be hard to interpret, but that is to be
expected. If such a functionality is desired, the list can be
converted to a binary with list_to_binary and then explored as a
Unicode entity in the encoding the file actually has.
Example 5 - Unicode file reading
................................
As when writing, reading Unicode converting files is best done with
the io-module. Let's once again assume UTF-16 on the file.
- When reading using file\:read, the UTF-16 data will be converted into
a Unicode representation native to Erlang and sent to the
client. If the client is using file\:read, it will translate the data
back to bytes in the same way as bytes were translated to Unicode
for the protocol when writing. Is everything representable as bytes,
the function will succeed, but if any Unicode character larger than
255 is present, the function will fail with a decoding error.
- Unicode data in the range over code-point 255 can not be retrieved by
use of the file-module. The io-module should be used instead.
- io:get_chars and io:get_line will work on the Unicode data provided
by the io-protocol. All Unicode returns will be as Unicode lists as
expected. The fread function will return lists with integers > 255 only
when the translation modifier is supplied.
Example 6 - raw reading
.......................
As with writing, only the file module can be used and only byte
oriented data is read. If encoded, the encoding will remain when
reading and writing raw files.
Conclusions from the examples
.............................
With this solution, the file module is consistent with latin1
io_servers (aka common files) and raw files. A file type, a translating
file, is added for the io-module to be able to get implicit conversion
of its Unicode data (another example of such an io_server with
implicit conversion would of course be the
terminal). Interface-wise,common files behave as before and we only
get added functionality.
The downsides are the subtly changed behavior of io:put_chars and the
performance impact by the conversion to and from Unicode
representations when using the file module on non-raw files with
default (latin1/byte) encoding. The latter may be possible to change
by extending the io-protocol to tag whole chunks of data as bytes
(latin1) or Unicode, but using raw files for writing large amounts of
data is often the better solution in those cases.
Specification
=============
Convention
----------
I suggest the convention of letting the Unicode representation in
lists be one character per element, in binaries UTF-8 and in mixed
Unicode entities a combination of those.
Conversion to and from latin1 and UTF-8
---------------------------------------
I also suggest a module 'unicode', containing functions for
converting between representations of Unicode. The default format for
all functions should be utf8 in binaries to point out this as the
preferred internal representation of Unicode characters in binaries.
The two main conversion functions should be characters_to_binary/3 and
characters_to_list/2 as described above.
Bit syntax
----------
I suggest an extension to the bit syntax, allowing matching and
construction in UTF-8 coding, e.g::
<<Ch/utf8,_/binary>> = BinString
as well as::
MyBin = <<Ch/utf8,More/binary>>
Optionally UTF-16 could be supported in a similar way for binaries, e.g::
<<Ch/utf16-little,_/binary>> = BinString
UTF-32 will need to be supported in a similar way as UTF-16, both for
completeness and for the range-checking that will be involved when
converting Unicode characters.
Formatting
----------
I finally suggest the "t" modifier to control sequence in the
formatting function, which expects mixed lists of integers
0..16#10ffff and binaries with UTF-8 coded Unicode characters. The
functions in io and io_lib will retain their current
functionality for code not using the translation modifier, but will
return Unicode characters when ordered to.
The fread function should in the same way accept Unicode data only
when the "t" modifier is used.
The io-protocol need to be changed to always handle Unicode characters.
Options given when opening a file will allow for implicit conversion of
text files.
References
==========
.. [1] http://www.ietf.org/rfc/rfc3629.txt - The UTF-8 RFC.
.. [2] http://www.unicode.org/ - The Unicode homepage, containing downloadable versions of the standard(s).
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
Jump to Line
Something went wrong with that request. Please try again.