Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENSIP-15: Normalization Standard #134

Merged
merged 14 commits into from Jun 5, 2023
Merged

Conversation

adraffy
Copy link
Contributor

@adraffy adraffy commented Apr 3, 2023

This is incomplete and requires some input/suggestions.

  1. How do I pick an ENSIP number? I chose 15 as it was next in the sequence.

  2. What should I do about the links to my repo? For example, chars-disallowed.js is just a massive file with various comments. Maybe these types of files should be formatted into separate document(s)?

  3. Where should I put static reference files? I created a new directory normalization-standard/ however these files potentially should be versioned or dated (eg. this is the Unicode 15 deployment.)

  4. Normalization really has (2) parts: the derivation of the spec and the implementation of the spec. For the ENSIP, I mostly describe the implementation and only outline/hyperlink to the derivation (as it requires significantly more explanation.) For example, without links to my code, I probably need a document that explains the script-groups.js in more detail.

@Arachnid
Copy link
Member

  1. 15 is fine.
  2. Please add an 'ensip-15/' folder under 'ens-improvement-proposals' and put them in there.
  3. As above for Add section describing DNSSEC #2
  4. That seems fine by and large. Ideally the spec should allow someone to create a new implementation from nothing but the spec, but that can include data files (just not code).

@adraffy
Copy link
Contributor Author

adraffy commented Apr 24, 2023

I moved my stuff into the ens-improvement-proposals/ensip-15/ directory. I added 3 additional markdown files which enable me to remove the links to my repo files.

I believe the spec.json approach has worked so far. Last I checked, djstrong had a compliant Python implementation.


To clarify, you want me to add a more technical document, like DNSSEC which includes some snippets and direct repo links?

@Arachnid
Copy link
Member

Technical docs would be warmly welcomed! But for the ENSIP itself, all that's required is that an independent implementer can create a compliant implementation given only the standard.

* For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified.
* No string transformations (like case-folding) should be applied.

1. Split the name into [labels](./ensip-1-ens#name-syntax).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're trivial, but the split and join functions need to be defined somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late edits — okay, adding Split and Join sections

I had left them undefined since it's the same mechanics as ENSIP-1 (which I referenced.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are Join and Split sections:

  • Split
    • Partition a name into labels, separated by 2D (.) FULL STOP, and return the resulting array.
    • Example: "abc.123.eth"["abc", "123", "eth"]
  • Join
    • Assemble an array of labels into a name, inserting 2D (.) FULL STOP between each label, and return the resulting string.
    • Example: ["abc", "123", "eth"]"abc.123.eth"

Given a string, convert to codepoints, and produce a list of **Text** and **Emoji** tokens, each with a payload of codepoints.

1. Allocate an empty codepoint buffer.
1. Find the longest emoji sequence that matches the remaining input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you do this? It should be specified or linked to from here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adraffy Maybe it is simpler to explain to use regular expression constructed as alternative of all emojis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My convention was that the initial paragraph describes the process of the following itemized steps.

Is the question how to convert a string to codepoints? If so, can I make a general disclaimer that all strings are Unicode strings and 1:1 with sequence of codepoints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adraffy Maybe it is simpler to explain to use regular expression constructed as alternative of all emojis.

The naive regex for this is enormous (48KB) although you're correct that it's conceptually simple, since it's just: /^(emoji1|emoji2|...)/u where \uFE0F is replaced with \uFE0F? and * is replaced with \*

In v8, it appears to only takes 3MB of memory to store that regex (~30K DFA states). I wonder if that's actually a better way to parse the emoji... I'll have to do some tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question here is how do you "find the longest emoji sequence that matches the remaining input"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adraffy Maybe it is simpler to explain to use regular expression constructed as alternative of all emojis.

The naive regex for this is enormous (48KB) although you're correct that it's conceptually simple, since it's just: /^(emoji1|emoji2|...)/u where \uFE0F is replaced with \uFE0F? and * is replaced with \*

In v8, it appears to only takes 3MB of memory to store that regex (~30K DFA states). I wonder if that's actually a better way to parse the emoji... I'll have to do some tests.

@adraffy It is not better for performance, in Python we have done that instead of using trie as you did.

I have proposed this here for answering @Arachnid question. Maybe regex is simpler than explaining the trie.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the following to indicate the concept of longest:

  • The longest sequence prevents matching on a shorter sequence that has the same initial codepoints.
  • Example: 👨🏻‍💻 [1F468 1F3FB 200D 1F4BB]
    • Match (1): 👨️ [1F468] man
    • Match (2): 👨🏻 [1F468 1F3FB] man: light skin tone
    • Match (4): 👨🏻‍💻 [1F468 1F3FB 200D 1F4BB] man technologist: light skin tone ← longest match

Kinda related, I spun off the emoji stuff into a new repo https://github.com/adraffy/emoji.js where I combine a simple compressor with the regex matching idea to get full UTS-51 support for 8KB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the "concept of longest" that is unspecified - it's how you determine the longest sequence of emoji. You should provide specific directions on identifying where an emoji sequence ends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. I added the following near the top of the document:

An Emoji Sequence is a single entity composed of one of more emoji characters and emoji components.

And made sure I was using the word Emoji Sequence consistently throughout the document (except for the spec.json part where I say the word emoji a million times.)

I also added djstrong's regex suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added definition of a string:

  • A string is a sequence of Unicode codepoints.

### Definitions

* Terms in **bold** throughout this document correspond with [components of `spec.json`](#description-of-specjson).
* An **Emoji Sequence** is a [single entity composed of one of more](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) emoji characters and emoji components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also explain how implementers determine if something is an emoji character or emoji component?

@Arachnid Arachnid merged commit 7d0e70f into ensdomains:master Jun 5, 2023
* Unicode version `15.0.0`
* Normalization is a living specification and should use the latest stable version of Unicode.
* [`spec.json`](./ensip-15/spec.json) contains all [necessary data](#description-of-specjson) for normalization.
* [`nf.json`](./ensip-15/nf.json) contains all necessary data for [Unicode Normalization Forms](https://unicode.org/reports/tr15/) NFC and NFD.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly is nf.json supposed to be used to aid in the implementation? How does an implementer reference the ranks key and how is one supposed to interpret the lists in there?

I'm curious if this can actually be used as a reference with static value mapping (as spec.json tends to help with) or if during implementation NFC and / or NFD actually need to be applied to the text tokens.

An explainer of this file would be quite helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, I didn't provide an explanation.

My thinking was that NF version issue is really only important on the web, where there are almost no space-efficient client-side Unicode libraries, and my library and the official ENS library cover that use-case. I was assuming any non-web user would just import a library with the matching Unicode version.

I will include a section in my next update.


For reference, ranks is just a dense packing of Unicode combining classes.

Because NF orders combining marks, a static replacement table is insufficient.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks for the prompt reply 👌🏼


* Normalization is the process of canonicalizing a name before for [hashing](./ensip-1-ens.md#namehash-algorithm).
* It is idempotent: applying normalization multiple times produces the same result.
* For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified.
Copy link

@fselmo fselmo Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small note / nit (for the next draft) that this can read like as though it is meant for the implementers to strip names for user convenience since it's under the Algorithm section. This will lead to many failing tests that look for invalid leading and trailing characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this one, can you suggest what the remedy would be? I don't see how "leading and trailing whitespace" can be conflated with names/labels.

Copy link

@fselmo fselmo Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, take or leave this suggestion if it isn't clear.

For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified.

From my point of view, listing this bullet under algorithm makes me think it's part of the normalization algorithm to strip any leading and trailing whitespace a user may have inputted for the user's convenience. I originally implemented this and it led to failing negative-case tests because they were looking for errors related to this. Once I realized this is meant from the user's perspective, I stopped trimming leading and trailing whitespace in our implementation and the tests began correctly failing with the expected failures.

1. Start with the set of **ALL** groups.
1. For each unique character in the label:
* If the character is **Confused** (a member of a **Whole Confusable**):
* Retain groups with **Whole Confusable** characters excluding the **Confusable Extent** of the matching **Confused** character.
Copy link

@fselmo fselmo Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an implementer, I'm having a difficult time walking through this section. I'm not sure what this line means exactly. I think giving one example flow might help or maybe a bit better clarification.


edit: I think with the examples below it helps... but since Confusable Extent is defined as the mapping of multiple characters and multiple groups, maybe this is closer to "Retain a set of the groups from all Confusable Extents excluding the groups in the Confusable Extent of the matching character." Is that what this is asking for?

Copy link
Contributor Author

@adraffy adraffy Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm was difficult to explain, but it's more digestible once you work through a few examples or see it visually. This tool may help: https://adraffy.github.io/ens-normalize.js/test/confused.html

The Unicode suggestion is also very terse and suffers from the problem that it's only hypothetical: "As usual, this algorithm is intended only as a definition; implementations should use an optimized routine that produces the same result." It also doesn't determine how to choose a "winner" and interacts weirdly with the Augmented Scripts and Script Extensions. From my POV, this is why this logic is implemented virtually no where on the web.

For reference, this is my implementation, which is only a few lines. My code uses an additional optimization which is the "Confusable Extent Complement", so instead of retaining the excluded, you just intersect a pre-computed complement (for every confusable character, I've precomputed the groups that are outside that characters confusable extent.)


Another way of describing the algorithm is that you're trying to create a function which generates confusable strings. You already know the string you have is "valid" (according to the previous step of Validate: every character belongs to at least one group.)

Starting from a generator of all strings, you take each character in the actual string, and fine-tune the generator to produce similar strings. However, instead of making it only generate the original string, you make it generate all of the OTHER (confusable) strings EXCEPT original string.

If you can successfully produce a function of this kind, then the original string is confusable by construction — there exists a confusable string that isn't linked to the original string (or its extents.)

Since some groups have confusables between their own characters and some characters are shared between groups, the Confusable Extent concept is necessary to ensure that the constructed generator doesn't produce strings that are just valid-BUT-different substitutions of the original string.


Retain groups with Whole Confusable characters excluding the Confusable Extent of the matching Confused character.

"Retain a set of the groups from all Confusable Extents excluding the groups in the Confusable Extent of the matching character." Is that what this is asking for?

The extent is only relevant on the current confusable to exclude everything it could be substituted with it.

Prior Groups: {A, B, C, D}
Current Character: X (placeholder)
Current Confusable: X → {C, D}, Y → {B}
Union of Confusable Groups: {B, C, D}
Current Character Confusable Extent: {X} x {C, D}
Confusable Groups not in Extent: {B}
Retained Groups: {B}


The Confusable Extent concept is my own. There may be a simpler way to explain it. The current algorithm is designed to be efficient and minimal (so it can be implemented on-chain). It's possible a more direct algorithm exists.

@fselmo
Copy link

fselmo commented Jun 19, 2023

Thank you for taking the time to explain this. I really appreciate it. My intention here is just to chime in where I couldn't understand the spec in hopes of making it more clear for others.

I think I found out where my implementation was going "wrong" and it's not immediately clear to me if this is intentional or not. I'm passing all the tests now but I still have some questions.

Taking an example label from the tests, "ᎫᏦᎥ" (codepoints [5035, 5094, 5029]), we end up with all confused characters. The retained groups are Latin, Han, Japanese, Korean.

From the spec:

If any Confused characters were found:
- Assert none of the remaining groups contain any of the buffered characters.

In this case there are no buffered characters because all were confused. Using a comparison such as all() in python, the following (pseudocode) still ends up evaluating to True even though the buffer is empty:

>>> buffer
set()

>>> all(cp in retained_group.valid_cps() for cp in buffer)
True

With the above in mind, really any group would contain all of the empty set buffer. If you used any list of cps for group.valid_cps() above, this would evaluate to True, even though there are no characters that we are actually comparing.

I checked with this exact example and this is how the other python implementation is working (namehash's implementation) and they based their code off the js code. I haven't checked the js if that's a similar case.

This really tripped me up because if it is intended this way, as I implemented it, I would only compare if the buffer was non-empty otherwise no group would truly contain all of an empty character set. If anything this feels like it should not be an invalid name because there is no remaining group that contains all of the buffered characters because there are no buffered characters.

It's possible I'm just too fried from looking at only this for the past few days 😄
If this is all how it's supposed to evaluate, I think a description for the case of an empty buffer would help quite a bit.


edit:

Asserting none of the remaining groups containing any of the buffered characters should look more like:

if any(cp in retained_group.valid_cps() for cp in buffer):
    raise InvalidName(f"...: {real_group} / {retained_group.name}")

And if I do use this, the specific test case I mentioned would not raise this exception since there are no characters to look for. I originally had something similar to this and that was causing about 122 failures that were very difficult to debug until I got here.

@adraffy
Copy link
Contributor Author

adraffy commented Jun 20, 2023

Ah yes, you are correct, that should be ALL:

Assert none of the remaining groups contain ALL of the buffered characters.

And yes, ALL of empty set is true. I can make this more clear.


If you finish the loop without exiting early, you have at least 1 group.

If there are no buffered characters, as you describe, its confusable since another group has a confusable character for each character in the original string.

However, if there are buffered characters, there's a chance some of those characters aren't in final set (which could mean that string isn't normalizable and therefore not confusable). I perform this check at the end of the loop because the excluded set keeps shrinking as you progress through the string.


For reference, characters that get buffered must belong to multiple groups and can come from two sources:

  1. non-confusable non-unique (eg. _)
  2. confusable but valid (Whole Confusable but not confused) (eg. a)

Both are a consequence of the following logic:

If the character is Confused (a member of a Whole Confusable):
If the character is Unique, the label is not confusable.
Otherwise: ...

If you have a Cyrillic а, it's important to know that that confuses with Latin a.
But if you have Latin a, it doesn't confuse with anything since it was selected as a winner.

@fselmo
Copy link

fselmo commented Jun 20, 2023

No that makes sense. Changing any to all I think resolves all my issues haha. It makes sense why an empty buffer means an invalid label, but that might benefit from an explicit statement in the spec since there are no characters to compare at that point. Thanks again for the help 👍🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants