-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reader flag to support unicode escapes in string literals #59
Comments
A few things:
What's the use case for unicode escapes in string literals? Are you getting edn that you have no control over that is broken in this way? Are you dumping raw bytes of some other format (JSON) directly into an edn stream, crossing your fingers and hoping everything will be fine? Do you need to persist or transmit edn in a character encoding other than UTF-8?
|
Hello, thank you for the response. The motivation are invisible, zero-width characters like \u200c. I keep in .edn files certain configuration and unit test data for that configuration. One of the examples I must support includes text containing \u200c. If I simply save it as UTF-8, other people reading this file will have difficulty understanding what is going on. As for printing, I also was thinking it should not produce unicode escapes, even for characters, because I don't know how to decide what characters should use the escaped representation. |
Yes, I can see the utility of that. I'll see what I can do about that over the weekend. |
Please have a look at what I just pushed to the develop branch. |
@bpsm , thank you, that would suit my needs. So you don't' see a need for a special flag to enable Unicode escapes in string literals, it's fine to be always enabled? |
Yes, I think I'll just have it always enabled. It seems to me like a sensible addition and inoffensive enough to leave always on. It occurred to me (after I'd implemented it) that there's actually another approach you could have taken to address your use case. (Maybe you'd thought of it, but I hadn't.)
And then install a corresponding handler for the tag #my/unicode-escaped which scans the string for things that look like
|
If you keep this feature always on, it's less work for me - just update the version number in dependency. But to be honest I have to tell you one pro-flag consideration: people who want only valid EDN to enter their system. So if they can parse it with edn-java and store somewhere, then any other component down the processing chain is guaranteed to read the file. That's why I reported the issue mentioning the flag. The workaround with tags is interesting, I haven't thought of it. If you reject this issue and not implement I maybe use it. But would prefer to avoid - my configs are complex enough already for users to edit, EDN is not as widely know as JSON or XML, the config includes regular expressions represented as EDN string literals, so users already have to deal with double escaping of \ (regex syntax + edn string literals), adding tags and custom Unicode escapes into the mix would not be friendly. Ideally, IMHO, the spec would include unicode escapes for string literals. Missing them, while they are already specified for characters looks like the result of the spec being a bit drafty, not polished. |
I hadn't considered the use case of using edn-java to check incoming edn for well-formedness before persisting the original bytes. (Rather than, printing the just parsed edn back out to fresh bytes). I suppose that would argue in favor of making this extension off by default. (Maximum compatibility.) I'll have a look at what adding this as an option to the parser configuration would look like. |
Please have a look at this change which makes \uXXXX an option to be configured by installing a TagHandler into the Parser.Config and tell me if you think that would be acceptable to you. |
@bpsm , thank you, this solution would work for me. FWIW, I'm not sure why you decided to express this option as a pseudo tag handler. Other pseudo tag handlers allow the user to customize the representation of EDN numbers in the object structure created by the parser. In the case of Unicode escapes not much room for custom behaviour, only one of the two predefined variants to chose: support or not support. Also, I noticed passing through the tag handler takes some redundant memory allocations: firs the Unicode escape text is put nnto the scanner buffer, then extracted as a substring of it and passed to the tag handler, the tag handler takes the substring of unicode escape to pass to parseInteger, then the resulting character is converted into a string and returned from the tag handler - 3 temporary string objects are created. Of course, unicode escapes would usually be rare so these allocations will not be significant usually. If you like the code as is, I'm fine. Thank you again for your support. |
@bpsm, another aspect to consider is escapes support by the https://github.com/clojure/tools.reader from official clojure github project. (Of course, double escapes are used here) So it supports unicode and even octal escapes by default. In this case, maybe edn-java does not need to make this feature optional, probably I was overthinking. |
Yea, I think you're right. I'll have edn-java follow clojure.tools.read.edn's lead on this. So, I'll go back to making support for "\uXXXX" always on. (This reminds me of another thing I have yet to create an issue for, Clojure's reader supports unicode characters in symbol and keyword names, though the spec makes it sound like it doesn't and edn-java does not currently support this. Do you have any thoughts on this?) The garden path that lead me to (ab)using a TagHandler for in-string unicode escaping was something like this:
Shrug. |
This is implemented in 0.7.0 (But 0.7.0 is not currently on Maven central because of GPG issues I have been unable to resolve.) |
Thank you! |
Despite the edn format doesn't specify unicode escapes in string literals (unlike for characters), in practice, it is very inconvenient sometimes. Optional support for unicode escapes in string literals, managed by reader config flag is very desirable.
See also: edn-format/edn#65
Looking into the code I think unicode escapes are not even supported in characters. This is a bug - the EDN specification requires unicode escapes in characters: https://github.com/edn-format/edn#characters.
The text was updated successfully, but these errors were encountered: