Unicode Support and UTF Encoding #164

Shinmera · 2015-11-11T12:42:23Z

Unicode is the de-facto standard encoding for text nowadays. As such, Clasp must support it in order to be able to run a lot of useful software. As an initial suggestion, using UTF-32 internally for string would be a good choice since it will fit the entirety of Unicode into a single character and thus allow constant time access on strings. The size should not be a problem on modern systems. For external formats, UTF-8 and UTF-16 support should also be added.

Since Clasp's main purpose is interaction with C++ libraries, a variety of support functions and mechanisms might have to be added to ease the conversion and sharing of string data between Clasp and external or bound libraries. This might necessitate supporting different string representation formats internally to allow relatively efficient handling of strings without having to rely on conversion every time the Clasp/Library boundary is overstepped.

The text was updated successfully, but these errors were encountered:

Bike · 2020-06-30T16:21:21Z

Currently character strings (as opposed to base strings) are UTF-32. It looks like we have UTF-8 and UCS-4 (which is the same as UTF-32, I think) as external formats. We also have UCS-2 which is not actually equivalent to UTF-16 I think. We don't have any support for unicode algorithms like normal forms, etc. We do not handle some character functions correctly, e.g. (char-downcase #\CYRILLIC_CAPITAL_LETTER_IE_WITH_GRAVE) => #\CYRILLIC_CAPITAL_LETTER_IE_WITH_GRAVE

attila-lendvai · 2020-12-24T14:58:52Z

a potentially crazy idea: integrate babel into the build/bootstrap process and use it?

https://github.com/cl-babel/babel

it only covers the external format reading/writing/conversion, though.

Serentty · 2021-07-07T09:57:33Z

As an initial suggestion, using UTF-32 internally for string would be a good choice since it will fit the entirety of Unicode into a single character and thus allow constant time access on strings.

Constant time access to code points isn't really that useful. Most situations where you would think it's what you want, it's actually not, because code points don't correspond exactly with anything meaningful user-facing characters.

Shinmera · 2021-07-07T10:00:07Z

It is in the context of common lisp because most sequence functions use aref et al to access things in strings, which is indexed per character, and thus require constant time random access to be efficient at all.

Serentty · 2021-07-07T10:01:52Z

I see. So they can't use something more like an iterator? The language mandates that they go index by index, and that implementation detail can't be hidden?

Shinmera · 2021-07-07T10:04:24Z

Yes. Strings are vectors, and violating the assumption that they can be accessed in constant time per index will lead to lots of problems. You'd have to devise a separate string type entirely, with an API that does not follow the vectors one. I think @Bike did make a library for utf-8 strings using the extensible sequences API as a proof of concept once, for instance.

Bike · 2021-07-07T13:44:35Z

I see. So they can't use something more like an iterator? The language mandates that they go index by index, and that implementation detail can't be hidden?

In the "sequences" extension to the language, you can implement an iteration protocol which is then used by standard sequence functions (map, reduce, etc.). But strings are vectors, and the iterator implementation for vectors does use indices.

As Shinmera said, I did write a library to use UTF-8 encoded strings, which defines iteration without relying on indices so much. For example here is the "next" function: https://github.com/Bike/utf8string/blob/master/utf8string.lisp#L456-L464

Could I ask why this issue has been revived? I mean, we never closed it, because we lack some aspects of Unicode support (like anything analogous to sb-unicode), but we do have UTF-32 strings and such now and have for a while.

Shinmera added the enhancement label Nov 11, 2015

Shinmera mentioned this issue Nov 11, 2015

character and base-char are confused #153

Closed

Shinmera added this to the 1.0.0 milestone Nov 11, 2015

Bike mentioned this issue Dec 11, 2019

Rearrange string code to support displaced strings and prepare for Unicode #179

Closed

kpoeck mentioned this issue Dec 27, 2020

we don't handle char-downcase or char-upcase for unicode chars correctly #1120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode Support and UTF Encoding #164

Unicode Support and UTF Encoding #164

Shinmera commented Nov 11, 2015

Bike commented Jun 30, 2020

attila-lendvai commented Dec 24, 2020 •

edited

Loading

Serentty commented Jul 7, 2021

Shinmera commented Jul 7, 2021

Serentty commented Jul 7, 2021

Shinmera commented Jul 7, 2021

Bike commented Jul 7, 2021

Unicode Support and UTF Encoding #164

Unicode Support and UTF Encoding #164

Comments

Shinmera commented Nov 11, 2015

Bike commented Jun 30, 2020

attila-lendvai commented Dec 24, 2020 • edited Loading

Serentty commented Jul 7, 2021

Shinmera commented Jul 7, 2021

Serentty commented Jul 7, 2021

Shinmera commented Jul 7, 2021

Bike commented Jul 7, 2021

attila-lendvai commented Dec 24, 2020 •

edited

Loading