Add new `Data.Text.Encoding.decodeLatin1` ISO-8859-1 decoding function #18

Merged
merged 3 commits into from Dec 3, 2012

Projects

None yet

2 participants

@hvr
Contributor
hvr commented Mar 3, 2012

This has about an order of magnitude lower runtime and/or call-overhead as
compared to the more generic text-icu approach, e.g. according to criterion
with GHC 7.4.1 on Linux/x86_64:

  • 12 times faster for empty input strings,
  • 6 times faster for 16-byte strings, and
  • 3 times faster for 1024-byte strings.

decodeLatin1 is also faster compared to using decodeUtf8 for plain ASCII:

  • 2 times faster for 16-byte input strings,
  • ~40% faster for 1024-byte strings.
@hvr
Contributor
hvr commented Mar 3, 2012

...well... wrongly guessed the next pull req sequence number ;-)

@hvr
Contributor
hvr commented Mar 3, 2012

here's the benchmark code I used: https://gist.github.com/1965035

@bos
Owner
bos commented Mar 6, 2012

I'm dubious about this, since it opens the door to "what about every other encoding under the sun?"

By the way, the function is poorly named: Latin1 usually means code page 1252, which is not the same as ISO-8859-1. Encodings are a messy, messy subject, which is why I've been so anxious to keep this stuff confined to a different package.

At least some of the slowness of the encoding handling in text-icu actually comes from the unnecessary copy that's performed (back before I knew that arrays could be pinned during FFI calls). Eliminate that, and I bet you'd see a substantial improvement for all encodings.

@hvr
Contributor
hvr commented Mar 6, 2012

I'm dubious about this, since it opens the door to "what about every other encoding under the sun?"

I suspected you'd say something like that... :-)

IMHO, if a decodeUtf8 is part of text then a decodeIso88591 should be too, due to its relationship to the Unicode character set: Unicode was designed in such a way that the lowest 256 code-points correspond exactly to the ISO-8859-1 code-points with the aim to provide a direct upgrade path.

Also, ISO-8859-1 takes a special place imho, as it's the mainstream character set that's used as default (or even only) encoding by various RFCs. Supporting it out of the box (together with UTF8) would be convenient. Also, wikipedia states ISO-8859-1 to be "by far the most popular 8-bit character set in the world"

Moreover, decodeIso88591 is just a two-orders-of-magnitude-or-so faster version of T.pack . B.unpack, having to depend on a rather heavy weight as text-icu to perform this kind of operation seems a bit of overhead (imho).

Finally, decodeIso88591 is even faster than decodeUtf8 for decoding 7-bit ASCII afaics.

...do the arguments above help in convincing you somewhat about the special status of ISO-8859-1, and that it deserves special handling together with the UTF-* and ASCII encodings already supported natively in the text package? :-)

PS: There's one thing I really dislike about the text-icu API: it seems to require to use the IO-monad to select a character set, which forces client code to mess with the IO-monad somehow even if the act of character conversion would otherwise be a pure computation.

By the way, the function is poorly named: Latin1 usually means code page 1252, which is not the same as ISO-8859-1

I wasn't aware of this, do you have any backing reference about Latin1 referring to CP1252? Also, according to wikipedia's ISO-8859-1 entry:

ISO-8859-1 is the IANA preferred charset name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. The following other aliases are registered for ISO-8859-1: ISO_8859-1, iso-ir-100, csISOLatin1, latin1, l1, IBM819, CP819.

@bos
Owner
bos commented Mar 6, 2012

See the Wikipedia page for info on the confusion around CP1252.

@hvr
Contributor
hvr commented Mar 8, 2012

See the Wikipedia page for info on the confusion around CP1252.

Actually, the Wikipedia entry doesn't mention "Latin1 usually means CP1252" at all, but rather that some people are incapable of looking up the ISO-8859-1 spec and mis-declare windows-1252 encoded as being ISO-8859-1 on a common basis... (btw, even the base Java java.nio.charset.Charset documentation refers to ISO-8859-1 as "ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1" - albeit putting some emphasis on the ISO term)

So it's rather "ISO-8859-1 usually means CP1252", but it seems absurd to me not use the term ISO-8859-1 anymore, just because a few broken implementations (which ones btw? Java didn't seem affected the last time I checked) have managed to pollute its official ISO standardized meaning in some communities... :-/

I don't have any strong feelings wrt whether the function is named decodeLatin1 vs. decodeIsoLatin1 or decodeISO88591 (if putting the "ISO"-association into the function name helps avoiding confusion)

(I'm was just somewhat surprised about this latin1=cp1252 business, as it's the first time I've heard about this... which OTOH might just mean, that I've successfully managed to avoid exposure to certain broken products in the last 20 years... :-) )

And btw, fwiw, the ICU libraries don't seem to associate latin-1 with windows-1252 either:

Data.Text.ICU.Convert> aliases "windows-1252"
["ibm-5348_P100-1997","ibm-5348","windows-1252","cp1252"]

Data.Text.ICU.Convert> open "latin-1" Nothing
Converter "ISO-8859-1"

Data.Text.ICU.Convert> aliases "latin-1"
["ISO-8859-1","ibm-819","IBM819","cp819","latin1","8859_1","csISOLatin1","iso-ir-100","ISO_8859-1:1987","l1","819"]

...all being said, are you just dubious, or rather against putting an optimized go-to implementation of T.pack . B.unpack into the text package?

@bos
Owner
bos commented Jun 24, 2012

OK, I'll take it.

But first:

  • Please add a QuickCheck test that demonstrates that the function works.
  • Also, a lazy variant is needed.
@hvr
Contributor
hvr commented Aug 15, 2012

great, I'll look into it

@hvr
Contributor
hvr commented Sep 23, 2012

fyi, 6e202e8e258f9cc21b130361c4f277459e33f465 is an updated changeset, adding a quick-check test for the strict D.T.E.decodeLatin1 version and adds the whitespace in the while () expression

A commit providing the lazy-text decodeLatin1 version shall follow shortly (if I don't get side-tracked again)

hvr added some commits Sep 23, 2012
@hvr hvr Add new `Data.Text.Encoding.decodeLatin1` ISO-8859-1 decoding function
This has about an order of magnitude lower runtime and/or call-overhead as
compared to the more generic `text-icu` approach, e.g. according to criterion
with GHC 7.4.1 on Linux/x86_64:

 * 12 times faster for empty input strings,
 * 6 times faster for 16-byte strings, and
 * 3 times faster for 1024-byte strings.

`decodeLatin1` is also faster compared to using `decodeUtf8` for plain ASCII:

 * 2 times faster for 16-byte input strings,
 * ~40% faster for 1024-byte strings.
7c06306
@hvr hvr Add `Data.Text.Lazy.Encoding.decodeLatin1` ISO-8859-1 decoding function
See 7c06306 for more information
3cec9a2
@hvr hvr Optimize latin1-to-UTF16 C-implementation by using 32-bit loads ea0e522
@bos bos merged commit 54fca94 into bos:master Dec 3, 2012
@bos

On a vaguely modern x86/x86_64 system, unaligned loads are basically free.

@hvr hvr deleted the hvr:pull-req-16 branch May 23, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment