Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure unicode characters are handled correctly #282

Closed
Arachnid opened this issue May 8, 2019 · 11 comments · Fixed by #324
Closed

Ensure unicode characters are handled correctly #282

Arachnid opened this issue May 8, 2019 · 11 comments · Fixed by #324
Assignees
Labels
bug Something isn't working permanent registrar

Comments

@Arachnid
Copy link
Member

Arachnid commented May 8, 2019

For example, https://manager.ens.domains/name/%F0%9F%91%A9%E2%80%8D%E2%9D%A4%EF%B8%8F%E2%80%8D%F0%9F%91%A8.eth should be treated as too short (it's 4 unicode characters) - but the UI doesn't say it's too short, or offer the opportunity to register it.

@makoto makoto self-assigned this May 8, 2019
@makoto
Copy link
Member

makoto commented May 8, 2019

Ok, this is quite confusing.

Currently ENS manager counts the characters as 7 chars as they are double bytes characters.

When I copy&pasted the url, the URL bar shows as 3 characters.

Screenshot 2019-05-08 at 17 41 41

However, if I copy&pste the emoji part into console, I can see that it is actually 5 chars because there are 2 whitespace which separates 3 characters.

Screenshot 2019-05-08 at 17 45 22

I can further verify that by converting the unicode into array which is suggested as an easy way to count unicode characters.

const emoji = [...name]

Screenshot 2019-05-08 at 17 45 28

If I want to add fuel into the fire, 👩‍❤‍👨and 👩‍❤️‍👨 are actually the same according to this.

Can you point to the rule our smart contract is following with some example emojis and their expected number lengths so that I can try my best to match that.🤯

@makoto
Copy link
Member

makoto commented May 8, 2019

Should 💩💩💩💩.eth be counted as 4 chars or 8 chars?

Screenshot 2019-05-08 at 18 13 16

Apparently Alex Van de Sande owned it and just migrated

http://localhost:3000/name/%F0%9F%92%A9%F0%9F%92%A9%F0%9F%92%A9%F0%9F%92%A9.eth
https://twitter.com/avsa/status/1126154257466634241

@Arachnid
Copy link
Member Author

Arachnid commented May 8, 2019

We should be counting unicode characters. That means that 'wide' characters only count as one (contrary to JS). This is what the contract does, too - it accepts UTF-8 encoded characters and counts codepoints.

👩‍❤️‍👨 is five characters (👩‍❤‍👨 with ZWJ characters between them). Alex's name is four.

@makoto
Copy link
Member

makoto commented May 8, 2019

Then is converting unicode into array as below cover all the cases or are there any edge cases I have to consider?

const emoji = [...name]

@Arachnid
Copy link
Member Author

Arachnid commented May 8, 2019

I'm not a JS expert, but if it converts it to unicode codepoints, then you're sorted.

@cfl0ws cfl0ws added bug Something isn't working permanent registrar labels May 10, 2019
@makoto
Copy link
Member

makoto commented May 13, 2019

They look different and url are also different but the namehash looks the same.

emoji 👩‍❤️‍👨👩‍❤️‍👨👩‍❤️‍👨
namehash 0xbb76480c9276a2e796e1eaaa4e7f18b1f350e294d466793ed00b05dc69cd5be9
url https://manager.ens.domains/search/%F0%9F%91%A9%E2%80%8D%E2%9D%A4%EF%B8%8F%E2%80%8D%F0%9F%91%A8%F0%9F%91%A9%E2%80%8D%E2%9D%A4%EF%B8%8F%E2%80%8D%F0%9F%91%A8%F0%9F%91%A9%E2%80%8D%E2%9D%A4%EF%B8%8F%E2%80%8D%F0%9F%91%A8

emoji 👩‍❤‍👨👩‍❤‍👨👩‍❤‍👨
namehash 0xbb76480c9276a2e796e1eaaa4e7f18b1f350e294d466793ed00b05dc69cd5be9
url http://manager.ens.domains/search/%F0%9F%91%A9%E2%80%8D%E2%9D%A4%E2%80%8D%F0%9F%91%A8%F0%9F%91%A9%E2%80%8D%E2%9D%A4%E2%80%8D%F0%9F%91%A8%F0%9F%91%A9%E2%80%8D%E2%9D%A4%E2%80%8D%F0%9F%91%A8

Because the namehash are the same, ens manger treat them the same. Is this a bug on eth-ens-namehash or should we do some special encoding before passing into namehash?

@Arachnid
Copy link
Member Author

What's happening is that namehash is normalising the name first, and normalisation is removing the ZWJ characters that create the combined emoji. That does mean that those two names are considered to be the same.

We should make sure we're checking the length of a name after normalisation.

@makoto
Copy link
Member

makoto commented May 14, 2019

So 👩‍❤️‍👨 is displayed as 👩‍❤‍👨 and counted as 3? Also do we need to normalize URL string?

@Arachnid
Copy link
Member Author

So 👩‍❤️‍👨 is displayed as 👩‍❤‍👨 and counted as 3?

That's right.

Also do we need to normalize URL string?

Yes; any time a user visits a non-normalised URL, we should normalise it and redirect them.

@makoto
Copy link
Member

makoto commented May 15, 2019

Ok, here is a bit more detailed analysis.

Please have a look at this.

Screenshot 2019-05-15 at 19 38 45

👩‍❤️‍👨actually has 6 chars because there are 2 zwj between the heart and man.

Parsing it with idna-uts46 lib (which is put by @jefflau ) takes 1 zwj char away to deconstruct the combined char into 3 chars but there are still 2 zwj chars. If I manually remove all zwj chars (eg: [...res].filter((r) => r !== zwj).join('')), then it becomes 3 but loses the color (red) hence the left size shows grey heart. 6 chars and 5 chars the ones which creates same name hash whereas 3 chars does generate completely different hash.

Now I have the following options.

  • a: 👩‍❤️‍👨counts as 6
  • b: 👩‍❤‍👨counts as 5
  • c: 👩❤👨(this looks like b but heart is grey) counts as 3

When I try to check how many they will be counted at Solidity using remix with this gist they are counted as same.

As A&B generates same namehash while c is different. I will convert all into (b), making it count as 5:

FYI, 🏳️‍🌈looks also combined emoji of 🏳🌈but we can retain the rainbow flag shape after being parsed.

Screenshot 2019-05-15 at 20 30 34

Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞ . gets malformed.

@Arachnid
Copy link
Member Author

All that needs to be done is this:

  1. Normalise the name.
  2. Count the characters in the name after normalisation.

Whatever UTS-46 normalisation does is correct, and what we should do everywhere.

For normalisation and hashing alike, you can use the eth-ens-namehash library, rather than importing idna-uts46 directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working permanent registrar
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants