ENSIP-15: Normalization Standard #134

adraffy · 2023-04-03T08:17:50Z

This is incomplete and requires some input/suggestions.

How do I pick an ENSIP number? I chose 15 as it was next in the sequence.
What should I do about the links to my repo? For example, chars-disallowed.js is just a massive file with various comments. Maybe these types of files should be formatted into separate document(s)?
Where should I put static reference files? I created a new directory normalization-standard/ however these files potentially should be versioned or dated (eg. this is the Unicode 15 deployment.)
Normalization really has (2) parts: the derivation of the spec and the implementation of the spec. For the ENSIP, I mostly describe the implementation and only outline/hyperlink to the derivation (as it requires significantly more explanation.) For example, without links to my code, I probably need a document that explains the script-groups.js in more detail.

Arachnid · 2023-04-21T14:45:30Z

15 is fine.
Please add an 'ensip-15/' folder under 'ens-improvement-proposals' and put them in there.
As above for Add section describing DNSSEC #2
That seems fine by and large. Ideally the spec should allow someone to create a new implementation from nothing but the spec, but that can include data files (just not code).

adraffy · 2023-04-24T08:37:19Z

I moved my stuff into the ens-improvement-proposals/ensip-15/ directory. I added 3 additional markdown files which enable me to remove the links to my repo files.

I believe the spec.json approach has worked so far. Last I checked, djstrong had a compliant Python implementation.

To clarify, you want me to add a more technical document, like DNSSEC which includes some snippets and direct repo links?

Arachnid · 2023-04-24T16:33:29Z

Technical docs would be warmly welcomed! But for the ENSIP itself, all that's required is that an independent implementer can create a compliant implementation given only the standard.

Arachnid · 2023-04-25T15:54:24Z

ens-improvement-proposals/ensip-15-normalization-standard.md

+* For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed.  Inner characters should remain unmodified.
+* No string transformations (like case-folding) should be applied.
+
+1. Split the name into [labels](./ensip-1-ens#name-syntax).


They're trivial, but the split and join functions need to be defined somewhere.

Sorry for the late edits — okay, adding Split and Join sections

I had left them undefined since it's the same mechanics as ENSIP-1 (which I referenced.)

There are Join and Split sections:

Split

Partition a name into labels, separated by 2D (.) FULL STOP, and return the resulting array.

Example: "abc.123.eth" → ["abc", "123", "eth"]

Join

Assemble an array of labels into a name, inserting 2D (.) FULL STOP between each label, and return the resulting string.

Example: ["abc", "123", "eth"] → "abc.123.eth"

Arachnid · 2023-04-25T15:54:44Z

ens-improvement-proposals/ensip-15-normalization-standard.md

+Given a string, convert to codepoints, and produce a list of **Text** and **Emoji** tokens, each with a payload of codepoints.
+
+1. Allocate an empty codepoint buffer.
+1. Find the longest emoji sequence that matches the remaining input.


How do you do this? It should be specified or linked to from here.

@adraffy Maybe it is simpler to explain to use regular expression constructed as alternative of all emojis.

My convention was that the initial paragraph describes the process of the following itemized steps.

Is the question how to convert a string to codepoints? If so, can I make a general disclaimer that all strings are Unicode strings and 1:1 with sequence of codepoints?

@adraffy Maybe it is simpler to explain to use regular expression constructed as alternative of all emojis.

The naive regex for this is enormous (48KB) although you're correct that it's conceptually simple, since it's just: /^(emoji1|emoji2|...)/u where \uFE0F is replaced with \uFE0F? and * is replaced with \*

In v8, it appears to only takes 3MB of memory to store that regex (~30K DFA states). I wonder if that's actually a better way to parse the emoji... I'll have to do some tests.

My question here is how do you "find the longest emoji sequence that matches the remaining input"?

@adraffy Maybe it is simpler to explain to use regular expression constructed as alternative of all emojis.

The naive regex for this is enormous (48KB) although you're correct that it's conceptually simple, since it's just: /^(emoji1|emoji2|...)/u where \uFE0F is replaced with \uFE0F? and * is replaced with \*

In v8, it appears to only takes 3MB of memory to store that regex (~30K DFA states). I wonder if that's actually a better way to parse the emoji... I'll have to do some tests.

@adraffy It is not better for performance, in Python we have done that instead of using trie as you did.

I have proposed this here for answering @Arachnid question. Maybe regex is simpler than explaining the trie.

I added the following to indicate the concept of longest:

The longest sequence prevents matching on a shorter sequence that has the same initial codepoints.

Example: 👨🏻‍💻 [1F468 1F3FB 200D 1F4BB]

Match (1): 👨️ [1F468] man

Match (2): 👨🏻 [1F468 1F3FB] man: light skin tone

Match (4): 👨🏻‍💻 [1F468 1F3FB 200D 1F4BB] man technologist: light skin tone ← longest match

Kinda related, I spun off the emoji stuff into a new repo https://github.com/adraffy/emoji.js where I combine a simple compressor with the regex matching idea to get full UTS-51 support for 8KB.

It's not the "concept of longest" that is unspecified - it's how you determine the longest sequence of emoji. You should provide specific directions on identifying where an emoji sequence ends.

Ah, gotcha. I added the following near the top of the document:

An Emoji Sequence is a single entity composed of one of more emoji characters and emoji components.

And made sure I was using the word Emoji Sequence consistently throughout the document (except for the spec.json part where I say the word emoji a million times.)

I also added djstrong's regex suggestion.

Also added definition of a string:

A string is a sequence of Unicode codepoints.

ens-improvement-proposals/ensip-15-normalization-standard.md

Arachnid · 2023-05-18T12:47:29Z

ens-improvement-proposals/ensip-15-normalization-standard.md

+### Definitions
+
+* Terms in **bold** throughout this document correspond with [components of `spec.json`](#description-of-specjson).
+* An **Emoji Sequence** is a [single entity composed of one of more](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) emoji characters and emoji components.


Can you also explain how implementers determine if something is an emoji character or emoji component?

fselmo · 2023-06-15T21:43:59Z

ens-improvement-proposals/ensip-15-normalization-standard.md

+* Unicode version `15.0.0`
+	* Normalization is a living specification and should use the latest stable version of Unicode.
+* [`spec.json`](./ensip-15/spec.json) contains all [necessary data](#description-of-specjson) for normalization.
+* [`nf.json`](./ensip-15/nf.json) contains all necessary data for [Unicode Normalization Forms](https://unicode.org/reports/tr15/) NFC and NFD.


How exactly is nf.json supposed to be used to aid in the implementation? How does an implementer reference the ranks key and how is one supposed to interpret the lists in there?

I'm curious if this can actually be used as a reference with static value mapping (as spec.json tends to help with) or if during implementation NFC and / or NFD actually need to be applied to the text tokens.

An explainer of this file would be quite helpful.

You are correct, I didn't provide an explanation.

My thinking was that NF version issue is really only important on the web, where there are almost no space-efficient client-side Unicode libraries, and my library and the official ENS library cover that use-case. I was assuming any non-web user would just import a library with the matching Unicode version.

I will include a section in my next update.

For reference, ranks is just a dense packing of Unicode combining classes.

Because NF orders combining marks, a static replacement table is insufficient.

Makes sense. Thanks for the prompt reply 👌🏼

fselmo · 2023-06-16T19:39:23Z

ens-improvement-proposals/ensip-15-normalization-standard.md

+
+* Normalization is the process of canonicalizing a name before for [hashing](./ensip-1-ens.md#namehash-algorithm).
+* It is idempotent: applying normalization multiple times produces the same result.
+* For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed.  Inner characters should remain unmodified.


Just a small note / nit (for the next draft) that this can read like as though it is meant for the implementers to strip names for user convenience since it's under the Algorithm section. This will lead to many failing tests that look for invalid leading and trailing characters.

I'm not sure I follow this one, can you suggest what the remedy would be? I don't see how "leading and trailing whitespace" can be conflated with names/labels.

Sorry, take or leave this suggestion if it isn't clear.

For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified.

From my point of view, listing this bullet under algorithm makes me think it's part of the normalization algorithm to strip any leading and trailing whitespace a user may have inputted for the user's convenience. I originally implemented this and it led to failing negative-case tests because they were looking for errors related to this. Once I realized this is meant from the user's perspective, I stopped trimming leading and trailing whitespace in our implementation and the tests began correctly failing with the expected failures.

fselmo · 2023-06-16T21:52:18Z

ens-improvement-proposals/ensip-15-normalization-standard.md

+1. Start with the set of **ALL** groups.
+1. For each unique character in the label:
+	* If the character is **Confused** (a member of a **Whole Confusable**):
+		* Retain groups with **Whole Confusable** characters excluding the **Confusable Extent** of the matching **Confused** character.


As an implementer, I'm having a difficult time walking through this section. I'm not sure what this line means exactly. I think giving one example flow might help or maybe a bit better clarification.

edit: I think with the examples below it helps... but since Confusable Extent is defined as the mapping of multiple characters and multiple groups, maybe this is closer to "Retain a set of the groups from all Confusable Extents excluding the groups in the Confusable Extent of the matching character." Is that what this is asking for?

This algorithm was difficult to explain, but it's more digestible once you work through a few examples or see it visually. This tool may help: https://adraffy.github.io/ens-normalize.js/test/confused.html

The Unicode suggestion is also very terse and suffers from the problem that it's only hypothetical: "As usual, this algorithm is intended only as a definition; implementations should use an optimized routine that produces the same result." It also doesn't determine how to choose a "winner" and interacts weirdly with the Augmented Scripts and Script Extensions. From my POV, this is why this logic is implemented virtually no where on the web.

For reference, this is my implementation, which is only a few lines. My code uses an additional optimization which is the "Confusable Extent Complement", so instead of retaining the excluded, you just intersect a pre-computed complement (for every confusable character, I've precomputed the groups that are outside that characters confusable extent.)

Another way of describing the algorithm is that you're trying to create a function which generates confusable strings. You already know the string you have is "valid" (according to the previous step of Validate: every character belongs to at least one group.)

Starting from a generator of all strings, you take each character in the actual string, and fine-tune the generator to produce similar strings. However, instead of making it only generate the original string, you make it generate all of the OTHER (confusable) strings EXCEPT original string.

If you can successfully produce a function of this kind, then the original string is confusable by construction — there exists a confusable string that isn't linked to the original string (or its extents.)

Since some groups have confusables between their own characters and some characters are shared between groups, the Confusable Extent concept is necessary to ensure that the constructed generator doesn't produce strings that are just valid-BUT-different substitutions of the original string.

Retain groups with Whole Confusable characters excluding the Confusable Extent of the matching Confused character.

"Retain a set of the groups from all Confusable Extents excluding the groups in the Confusable Extent of the matching character." Is that what this is asking for?

The extent is only relevant on the current confusable to exclude everything it could be substituted with it.

Prior Groups: {A, B, C, D}
Current Character: X (placeholder)
Current Confusable: X → {C, D}, Y → {B}
Union of Confusable Groups: {B, C, D}
Current Character Confusable Extent: {X} x {C, D}
Confusable Groups not in Extent: {B}
Retained Groups: {B}

The Confusable Extent concept is my own. There may be a simpler way to explain it. The current algorithm is designed to be efficient and minimal (so it can be implemented on-chain). It's possible a more direct algorithm exists.

fselmo · 2023-06-19T21:35:44Z

Thank you for taking the time to explain this. I really appreciate it. My intention here is just to chime in where I couldn't understand the spec in hopes of making it more clear for others.

I think I found out where my implementation was going "wrong" and it's not immediately clear to me if this is intentional or not. I'm passing all the tests now but I still have some questions.

Taking an example label from the tests, "ᎫᏦᎥ" (codepoints [5035, 5094, 5029]), we end up with all confused characters. The retained groups are Latin, Han, Japanese, Korean.

From the spec:

If any Confused characters were found:
- Assert none of the remaining groups contain any of the buffered characters.

In this case there are no buffered characters because all were confused. Using a comparison such as all() in python, the following (pseudocode) still ends up evaluating to True even though the buffer is empty:

>>> buffer
set()

>>> all(cp in retained_group.valid_cps() for cp in buffer)
True

With the above in mind, really any group would contain all of the empty set buffer. If you used any list of cps for group.valid_cps() above, this would evaluate to True, even though there are no characters that we are actually comparing.

I checked with this exact example and this is how the other python implementation is working (namehash's implementation) and they based their code off the js code. I haven't checked the js if that's a similar case.

This really tripped me up because if it is intended this way, as I implemented it, I would only compare if the buffer was non-empty otherwise no group would truly contain all of an empty character set. If anything this feels like it should not be an invalid name because there is no remaining group that contains all of the buffered characters because there are no buffered characters.

It's possible I'm just too fried from looking at only this for the past few days 😄
If this is all how it's supposed to evaluate, I think a description for the case of an empty buffer would help quite a bit.

edit:

Asserting none of the remaining groups containing any of the buffered characters should look more like:

if any(cp in retained_group.valid_cps() for cp in buffer):
    raise InvalidName(f"...: {real_group} / {retained_group.name}")

And if I do use this, the specific test case I mentioned would not raise this exception since there are no characters to look for. I originally had something similar to this and that was causing about 122 failures that were very difficult to debug until I got here.

adraffy · 2023-06-20T05:04:34Z

Ah yes, you are correct, that should be ALL:

Assert none of the remaining groups contain ALL of the buffered characters.

And yes, ALL of empty set is true. I can make this more clear.

If you finish the loop without exiting early, you have at least 1 group.

If there are no buffered characters, as you describe, its confusable since another group has a confusable character for each character in the original string.

However, if there are buffered characters, there's a chance some of those characters aren't in final set (which could mean that string isn't normalizable and therefore not confusable). I perform this check at the end of the loop because the excluded set keeps shrinking as you progress through the string.

For reference, characters that get buffered must belong to multiple groups and can come from two sources:

non-confusable non-unique (eg. _)
confusable but valid (Whole Confusable but not confused) (eg. a)

Both are a consequence of the following logic:

If the character is Confused (a member of a Whole Confusable):
If the character is Unique, the label is not confusable.
Otherwise: ...

If you have a Cyrillic а, it's important to know that that confuses with Latin a.
But if you have Latin a, it doesn't confuse with anything since it was selected as a winner.

fselmo · 2023-06-20T15:53:56Z

No that makes sense. Changing any to all I think resolves all my issues haha. It makes sense why an empty buffer means an invalid label, but that might benefit from an explicit statement in the spec since there are no characters to compare at that point. Thanks again for the help 👍🏼

adraffy added 2 commits April 3, 2023 00:53

initial import

822cc21

add tests

9edf944

add gen docs, remove github links, update dir names

61196f5

Arachnid requested changes Apr 26, 2023

View reviewed changes

adraffy added 11 commits May 15, 2023 01:00

add clarifications, join/split, more examples, cleanup

57e470e

check formatting

e2a0031

check formatting again

f7c34ab

minor fixes

116e16c

minor fixes again

b02ba52

add comment about strings/codepoints

735e5a3

expand emoji definition

712cfb8

update formatting

49445f6

remove emoji arrow

be07f70

minor fixes

1557de6

return

ed66edb

Arachnid approved these changes Jun 5, 2023

View reviewed changes

Arachnid merged commit 7d0e70f into ensdomains:master Jun 5, 2023

fselmo reviewed Jun 15, 2023

View reviewed changes

fselmo reviewed Jun 16, 2023

View reviewed changes

adraffy mentioned this pull request Jul 24, 2023

ENSIP-15: Formatting, Validation Tests, Explain nf.json, etc. #152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENSIP-15: Normalization Standard #134

ENSIP-15: Normalization Standard #134

adraffy commented Apr 3, 2023

Arachnid commented Apr 21, 2023

adraffy commented Apr 24, 2023 •

edited

Arachnid commented Apr 24, 2023

Arachnid Apr 25, 2023

adraffy May 7, 2023

adraffy May 31, 2023

Arachnid Apr 25, 2023

djstrong Apr 26, 2023

adraffy May 7, 2023

adraffy May 8, 2023

Arachnid May 12, 2023

djstrong May 12, 2023

adraffy May 15, 2023

Arachnid May 15, 2023

adraffy May 18, 2023

adraffy May 31, 2023

Arachnid May 18, 2023

fselmo Jun 15, 2023

adraffy Jun 15, 2023

fselmo Jun 15, 2023

fselmo Jun 16, 2023 •

edited

adraffy Jun 19, 2023

fselmo Jun 19, 2023 •

edited

fselmo Jun 16, 2023 •

edited

adraffy Jun 19, 2023 •

edited

fselmo commented Jun 19, 2023 •

edited

adraffy commented Jun 20, 2023 •

edited

fselmo commented Jun 20, 2023

ENSIP-15: Normalization Standard #134

ENSIP-15: Normalization Standard #134

Conversation

adraffy commented Apr 3, 2023

Arachnid commented Apr 21, 2023

adraffy commented Apr 24, 2023 • edited

Arachnid commented Apr 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fselmo Jun 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fselmo Jun 19, 2023 • edited

Choose a reason for hiding this comment

fselmo Jun 16, 2023 • edited

Choose a reason for hiding this comment

adraffy Jun 19, 2023 • edited

Choose a reason for hiding this comment

fselmo commented Jun 19, 2023 • edited

adraffy commented Jun 20, 2023 • edited

fselmo commented Jun 20, 2023

adraffy commented Apr 24, 2023 •

edited

fselmo Jun 16, 2023 •

edited

fselmo Jun 19, 2023 •

edited

fselmo Jun 16, 2023 •

edited

adraffy Jun 19, 2023 •

edited

fselmo commented Jun 19, 2023 •

edited

adraffy commented Jun 20, 2023 •

edited