New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENSIP-15: Normalization Standard #134
Conversation
|
I moved my stuff into the ens-improvement-proposals/ensip-15/ directory. I added 3 additional markdown files which enable me to remove the links to my repo files. I believe the spec.json approach has worked so far. Last I checked, djstrong had a compliant Python implementation. To clarify, you want me to add a more technical document, like DNSSEC which includes some snippets and direct repo links? |
Technical docs would be warmly welcomed! But for the ENSIP itself, all that's required is that an independent implementer can create a compliant implementation given only the standard. |
* For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified. | ||
* No string transformations (like case-folding) should be applied. | ||
|
||
1. Split the name into [labels](./ensip-1-ens#name-syntax). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're trivial, but the split
and join
functions need to be defined somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late edits — okay, adding Split and Join sections
I had left them undefined since it's the same mechanics as ENSIP-1 (which I referenced.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are Join and Split sections:
- Split
- Partition a name into labels, separated by
2D (.) FULL STOP
, and return the resulting array. - Example:
"abc.123.eth"
→["abc", "123", "eth"]
- Partition a name into labels, separated by
- Join
- Assemble an array of labels into a name, inserting
2D (.) FULL STOP
between each label, and return the resulting string. - Example:
["abc", "123", "eth"]
→"abc.123.eth"
- Assemble an array of labels into a name, inserting
Given a string, convert to codepoints, and produce a list of **Text** and **Emoji** tokens, each with a payload of codepoints. | ||
|
||
1. Allocate an empty codepoint buffer. | ||
1. Find the longest emoji sequence that matches the remaining input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you do this? It should be specified or linked to from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adraffy Maybe it is simpler to explain to use regular expression constructed as alternative of all emojis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My convention was that the initial paragraph describes the process of the following itemized steps.
Is the question how to convert a string to codepoints? If so, can I make a general disclaimer that all strings are Unicode strings and 1:1 with sequence of codepoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adraffy Maybe it is simpler to explain to use regular expression constructed as alternative of all emojis.
The naive regex for this is enormous (48KB) although you're correct that it's conceptually simple, since it's just: /^(emoji1|emoji2|...)/u
where \uFE0F
is replaced with \uFE0F?
and *
is replaced with \*
In v8, it appears to only takes 3MB of memory to store that regex (~30K DFA states). I wonder if that's actually a better way to parse the emoji... I'll have to do some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question here is how do you "find the longest emoji sequence that matches the remaining input"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adraffy Maybe it is simpler to explain to use regular expression constructed as alternative of all emojis.
The naive regex for this is enormous (48KB) although you're correct that it's conceptually simple, since it's just:
/^(emoji1|emoji2|...)/u
where\uFE0F
is replaced with\uFE0F?
and*
is replaced with\*
In v8, it appears to only takes 3MB of memory to store that regex (~30K DFA states). I wonder if that's actually a better way to parse the emoji... I'll have to do some tests.
@adraffy It is not better for performance, in Python we have done that instead of using trie as you did.
I have proposed this here for answering @Arachnid question. Maybe regex is simpler than explaining the trie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the following to indicate the concept of longest:
- The longest sequence prevents matching on a shorter sequence that has the same initial codepoints.
- Example:
👨🏻💻 [1F468 1F3FB 200D 1F4BB]
- Match (1):
👨️ [1F468] man
- Match (2):
👨🏻 [1F468 1F3FB] man: light skin tone
- Match (4):
👨🏻💻 [1F468 1F3FB 200D 1F4BB] man technologist: light skin tone
← longest match
- Match (1):
Kinda related, I spun off the emoji stuff into a new repo https://github.com/adraffy/emoji.js where I combine a simple compressor with the regex matching idea to get full UTS-51 support for 8KB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the "concept of longest" that is unspecified - it's how you determine the longest sequence of emoji. You should provide specific directions on identifying where an emoji sequence ends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha. I added the following near the top of the document:
An Emoji Sequence is a single entity composed of one of more emoji characters and emoji components.
And made sure I was using the word Emoji Sequence consistently throughout the document (except for the spec.json
part where I say the word emoji a million times.)
I also added djstrong's regex suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added definition of a string:
- A string is a sequence of Unicode codepoints.
### Definitions | ||
|
||
* Terms in **bold** throughout this document correspond with [components of `spec.json`](#description-of-specjson). | ||
* An **Emoji Sequence** is a [single entity composed of one of more](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) emoji characters and emoji components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also explain how implementers determine if something is an emoji character or emoji component?
* Unicode version `15.0.0` | ||
* Normalization is a living specification and should use the latest stable version of Unicode. | ||
* [`spec.json`](./ensip-15/spec.json) contains all [necessary data](#description-of-specjson) for normalization. | ||
* [`nf.json`](./ensip-15/nf.json) contains all necessary data for [Unicode Normalization Forms](https://unicode.org/reports/tr15/) NFC and NFD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How exactly is nf.json
supposed to be used to aid in the implementation? How does an implementer reference the ranks
key and how is one supposed to interpret the lists in there?
I'm curious if this can actually be used as a reference with static value mapping (as spec.json
tends to help with) or if during implementation NFC
and / or NFD
actually need to be applied to the text tokens.
An explainer of this file would be quite helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, I didn't provide an explanation.
My thinking was that NF version issue is really only important on the web, where there are almost no space-efficient client-side Unicode libraries, and my library and the official ENS library cover that use-case. I was assuming any non-web user would just import a library with the matching Unicode version.
I will include a section in my next update.
For reference, ranks
is just a dense packing of Unicode combining classes.
Because NF orders combining marks, a static replacement table is insufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thanks for the prompt reply 👌🏼
|
||
* Normalization is the process of canonicalizing a name before for [hashing](./ensip-1-ens.md#namehash-algorithm). | ||
* It is idempotent: applying normalization multiple times produces the same result. | ||
* For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small note / nit (for the next draft) that this can read like as though it is meant for the implementers to strip names for user convenience since it's under the Algorithm
section. This will lead to many failing tests that look for invalid leading and trailing characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow this one, can you suggest what the remedy would be? I don't see how "leading and trailing whitespace" can be conflated with names/labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, take or leave this suggestion if it isn't clear.
For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified.
From my point of view, listing this bullet under algorithm makes me think it's part of the normalization algorithm to strip any leading and trailing whitespace a user may have inputted for the user's convenience. I originally implemented this and it led to failing negative-case tests because they were looking for errors related to this. Once I realized this is meant from the user's perspective, I stopped trimming leading and trailing whitespace in our implementation and the tests began correctly failing with the expected failures.
1. Start with the set of **ALL** groups. | ||
1. For each unique character in the label: | ||
* If the character is **Confused** (a member of a **Whole Confusable**): | ||
* Retain groups with **Whole Confusable** characters excluding the **Confusable Extent** of the matching **Confused** character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an implementer, I'm having a difficult time walking through this section. I'm not sure what this line means exactly. I think giving one example flow might help or maybe a bit better clarification.
edit: I think with the examples below it helps... but since Confusable Extent is defined as the mapping of multiple characters and multiple groups, maybe this is closer to "Retain a set of the groups from all Confusable Extents excluding the groups in the Confusable Extent of the matching character." Is that what this is asking for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm was difficult to explain, but it's more digestible once you work through a few examples or see it visually. This tool may help: https://adraffy.github.io/ens-normalize.js/test/confused.html
The Unicode suggestion is also very terse and suffers from the problem that it's only hypothetical: "As usual, this algorithm is intended only as a definition; implementations should use an optimized routine that produces the same result." It also doesn't determine how to choose a "winner" and interacts weirdly with the Augmented Scripts and Script Extensions. From my POV, this is why this logic is implemented virtually no where on the web.
For reference, this is my implementation, which is only a few lines. My code uses an additional optimization which is the "Confusable Extent Complement", so instead of retaining the excluded, you just intersect a pre-computed complement (for every confusable character, I've precomputed the groups that are outside that characters confusable extent.)
Another way of describing the algorithm is that you're trying to create a function which generates confusable strings. You already know the string you have is "valid" (according to the previous step of Validate: every character belongs to at least one group.)
Starting from a generator of all strings, you take each character in the actual string, and fine-tune the generator to produce similar strings. However, instead of making it only generate the original string, you make it generate all of the OTHER (confusable) strings EXCEPT original string.
If you can successfully produce a function of this kind, then the original string is confusable by construction — there exists a confusable string that isn't linked to the original string (or its extents.)
Since some groups have confusables between their own characters and some characters are shared between groups, the Confusable Extent concept is necessary to ensure that the constructed generator doesn't produce strings that are just valid-BUT-different substitutions of the original string.
Retain groups with Whole Confusable characters excluding the Confusable Extent of the matching Confused character.
"Retain a set of the groups from all Confusable Extents excluding the groups in the Confusable Extent of the matching character." Is that what this is asking for?
The extent is only relevant on the current confusable to exclude everything it could be substituted with it.
Prior Groups: {A, B, C, D}
Current Character: X
(placeholder)
Current Confusable: X → {C, D}, Y → {B}
Union of Confusable Groups: {B, C, D}
Current Character Confusable Extent: {X} x {C, D}
Confusable Groups not in Extent: {B}
Retained Groups: {B}
The Confusable Extent concept is my own. There may be a simpler way to explain it. The current algorithm is designed to be efficient and minimal (so it can be implemented on-chain). It's possible a more direct algorithm exists.
Thank you for taking the time to explain this. I really appreciate it. My intention here is just to chime in where I couldn't understand the spec in hopes of making it more clear for others. I think I found out where my implementation was going "wrong" and it's not immediately clear to me if this is intentional or not. I'm passing all the tests now but I still have some questions. Taking an example label from the tests, "ᎫᏦᎥ" (codepoints [5035, 5094, 5029]), we end up with all From the spec:
In this case there are no buffered characters because all were confused. Using a comparison such as >>> buffer
set()
>>> all(cp in retained_group.valid_cps() for cp in buffer)
True With the above in mind, really any group would contain all of the empty set buffer. If you used any list of cps for I checked with this exact example and this is how the other python implementation is working (namehash's implementation) and they based their code off the js code. I haven't checked the js if that's a similar case. This really tripped me up because if it is intended this way, as I implemented it, I would only compare if the buffer was non-empty otherwise no group would truly contain all of an empty character set. If anything this feels like it should not be an invalid name because there is no remaining group that contains all of the buffered characters because there are no buffered characters. It's possible I'm just too fried from looking at only this for the past few days 😄 edit: Asserting none of the remaining groups containing any of the buffered characters should look more like: if any(cp in retained_group.valid_cps() for cp in buffer):
raise InvalidName(f"...: {real_group} / {retained_group.name}") And if I do use this, the specific test case I mentioned would not raise this exception since there are no characters to look for. I originally had something similar to this and that was causing about 122 failures that were very difficult to debug until I got here. |
Ah yes, you are correct, that should be ALL:
And yes, ALL of empty set is true. I can make this more clear. If you finish the loop without exiting early, you have at least 1 group. If there are no buffered characters, as you describe, its confusable since another group has a confusable character for each character in the original string. However, if there are buffered characters, there's a chance some of those characters aren't in final set (which could mean that string isn't normalizable and therefore not confusable). I perform this check at the end of the loop because the excluded set keeps shrinking as you progress through the string. For reference, characters that get buffered must belong to multiple groups and can come from two sources:
Both are a consequence of the following logic:
If you have a Cyrillic |
No that makes sense. Changing |
This is incomplete and requires some input/suggestions.
How do I pick an ENSIP number? I chose
15
as it was next in the sequence.What should I do about the links to my repo? For example, chars-disallowed.js is just a massive file with various comments. Maybe these types of files should be formatted into separate document(s)?
Where should I put static reference files? I created a new directory
normalization-standard/
however these files potentially should be versioned or dated (eg. this is the Unicode 15 deployment.)Normalization really has (2) parts: the derivation of the spec and the implementation of the spec. For the ENSIP, I mostly describe the implementation and only outline/hyperlink to the derivation (as it requires significantly more explanation.) For example, without links to my code, I probably need a document that explains the script-groups.js in more detail.