Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Support and UTF Encoding #164

Open
Shinmera opened this issue Nov 11, 2015 · 7 comments
Open

Unicode Support and UTF Encoding #164

Shinmera opened this issue Nov 11, 2015 · 7 comments
Milestone

Comments

@Shinmera
Copy link
Contributor

Unicode is the de-facto standard encoding for text nowadays. As such, Clasp must support it in order to be able to run a lot of useful software. As an initial suggestion, using UTF-32 internally for string would be a good choice since it will fit the entirety of Unicode into a single character and thus allow constant time access on strings. The size should not be a problem on modern systems. For external formats, UTF-8 and UTF-16 support should also be added.

Since Clasp's main purpose is interaction with C++ libraries, a variety of support functions and mechanisms might have to be added to ease the conversion and sharing of string data between Clasp and external or bound libraries. This might necessitate supporting different string representation formats internally to allow relatively efficient handling of strings without having to rely on conversion every time the Clasp/Library boundary is overstepped.

@Bike
Copy link
Member

Bike commented Jun 30, 2020

Currently character strings (as opposed to base strings) are UTF-32. It looks like we have UTF-8 and UCS-4 (which is the same as UTF-32, I think) as external formats. We also have UCS-2 which is not actually equivalent to UTF-16 I think. We don't have any support for unicode algorithms like normal forms, etc. We do not handle some character functions correctly, e.g. (char-downcase #\CYRILLIC_CAPITAL_LETTER_IE_WITH_GRAVE) => #\CYRILLIC_CAPITAL_LETTER_IE_WITH_GRAVE

@attila-lendvai
Copy link
Contributor

attila-lendvai commented Dec 24, 2020

a potentially crazy idea: integrate babel into the build/bootstrap process and use it?

https://github.com/cl-babel/babel

it only covers the external format reading/writing/conversion, though.

@Serentty
Copy link

Serentty commented Jul 7, 2021

As an initial suggestion, using UTF-32 internally for string would be a good choice since it will fit the entirety of Unicode into a single character and thus allow constant time access on strings.

Constant time access to code points isn't really that useful. Most situations where you would think it's what you want, it's actually not, because code points don't correspond exactly with anything meaningful user-facing characters.

@Shinmera
Copy link
Contributor Author

Shinmera commented Jul 7, 2021

It is in the context of common lisp because most sequence functions use aref et al to access things in strings, which is indexed per character, and thus require constant time random access to be efficient at all.

@Serentty
Copy link

Serentty commented Jul 7, 2021

I see. So they can't use something more like an iterator? The language mandates that they go index by index, and that implementation detail can't be hidden?

@Shinmera
Copy link
Contributor Author

Shinmera commented Jul 7, 2021

Yes. Strings are vectors, and violating the assumption that they can be accessed in constant time per index will lead to lots of problems. You'd have to devise a separate string type entirely, with an API that does not follow the vectors one. I think @Bike did make a library for utf-8 strings using the extensible sequences API as a proof of concept once, for instance.

@Bike
Copy link
Member

Bike commented Jul 7, 2021

I see. So they can't use something more like an iterator? The language mandates that they go index by index, and that implementation detail can't be hidden?

In the "sequences" extension to the language, you can implement an iteration protocol which is then used by standard sequence functions (map, reduce, etc.). But strings are vectors, and the iterator implementation for vectors does use indices.

As Shinmera said, I did write a library to use UTF-8 encoded strings, which defines iteration without relying on indices so much. For example here is the "next" function: https://github.com/Bike/utf8string/blob/master/utf8string.lisp#L456-L464

Could I ask why this issue has been revived? I mean, we never closed it, because we lack some aspects of Unicode support (like anything analogous to sb-unicode), but we do have UTF-32 strings and such now and have for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants