Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write our to_ascii function (drop ICU dependency) #89

Closed
lemire opened this issue Jan 21, 2023 · 15 comments · Fixed by #216
Closed

Write our to_ascii function (drop ICU dependency) #89

lemire opened this issue Jan 21, 2023 · 15 comments · Fixed by #216

Comments

@lemire
Copy link
Member

lemire commented Jan 21, 2023

It is not impossibly hard to drop the ICU dependency. It would simplify builds and improve engineering.

I estimate that it would take about a week of work.

@miguelteixeiraa
Copy link
Contributor

miguelteixeiraa commented Jan 25, 2023

Is the to_ascii function that lives in src/unicode.cpp the work related to this issue already? Or this issue is to replace that function?

@anonrig
Copy link
Member

anonrig commented Jan 25, 2023

That's correct @miguelteixeiraa. ToASCII refers to domain to ascii algorithm in the URL spec. Currently, we are using ICU in non-windows environments, and our own implementation in Windows. We want to get rid of ICU while keeping the test results happy.

@miguelteixeiraa
Copy link
Contributor

Thinking about pick this one up. Does this issue have any urgency? Is anyone already working on this? Also please let me know if there is another one that would be more interesting to pick up now.

@lemire
Copy link
Member Author

lemire commented Jan 27, 2023

@miguelteixeiraa It would be a great issue.

I have a nasty PR that demonstrates what it might look like...
#68

Don't copy my code, it is not good. It was done quickly. One mistake I did was to try to stay in UTF-8. It is a bad idea. Move to UTF-32 right away and process in UTF-32 mode throughout, getting back to UTF-8 only at the end.

But I proved that you could almost get it done in a few hours. In truth, I probably did less than 20% of the required work, but it was enough to pass 90% of our tests. Getting it to be correct takes many hours.

UTF-32 is pretty simple. Each character is a 32-bit integer (but only just about the first million values are allowed, so 32-bit is too much, but that's fine). The largest possible value is 0x10FFFF. But, actually, anything over 0xE01EF is disallowed for us.

Here are some starting points.

  1. Currently, we take in UTF-8 and we produce UTF-32. When we are done, we need to go back to UTF-32. Now all the necessary functions can be found there : https://github.com/simdutf/simdutf/tree/master/src/scalar We could cleanly add conversion functions to ada.
  2. We need to convert to and from punycode. The format is slightly insane. I tried doing it quickly on my own, but it is silly work.
  3. We need to implement mapping (see below). We want to implement non transitional there is no funny business.
  4. We need to implement normalization.
  5. We need to implement validity checks.

Reference to the standard: https://www.unicode.org/reports/tr46/#ToUnicode

Step 3 is a fun part. It depends on the Unicode version. For Unicode 15, the table is https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt

Important: we want to be able to update the future Unicode releases so that the table should not be hard-coded... we should have a script that goes from the table to our implementation.

  • Map. For each code point in the domain_name string, look up the status value in Section 5, IDNA Mapping Table, and take the following actions:
    • disallowed: Leave the code point unchanged in the string, and record that there was an error.
    • ignored: Remove the code point from the string. This is equivalent to mapping the code point to an empty string.
    • mapped: Replace the code point in the string by the value for the mapping in Section 5, IDNA Mapping Table.
    • valid: Leave the code point unchanged in the string.
  • Normalize. Normalize the domain_name string to Unicode Normalization Form C. See https://dev.w3.org/cvsweb/charlint/charlint.pl?rev=1.28;content-type=text%2Fplain for a Perl script that does it.
  • Break. Break the string into labels at U+002E ( . ) FULL STOP.
  • Convert/Validate. For each label in the domain_name string:
    • If the label starts with “xn--”:
      • Attempt to convert the rest of the label to Unicode according to Punycode [[RFC3492 (https://www.unicode.org/reports/tr46/#RFC3492)]. If that conversion fails, record that there was an error, and continue with the next label. Otherwise replace the original label in the string by the results of the conversion.
      • Verify that the label meets the validity criteria in Section 4.1, Validity Criteria for Nontransitional Processing. If any of the validity criteria are not satisfied, record that there was an error.
    • If the label does not start with “xn--”:
      Verify that the label meets the validity criteria in Section 4.1, Validity Criteria for the input Processing choice (Transitional or Nontransitional). If any of the validity criteria are not satisfied, record that there was an error.

What you want to do is write some kind of script that reads the table and outputs code. That's more or less or ICU does it, I think.

Take 0x2e. It is the '.' in ascii. It remains a dot. Fine. But look later...

3002          ; mapped                 ; 002E          # 1.1  IDEOGRAPHIC FULL STOP

So character 0x3002 must be mapped to 0x2e.

Some characters are ignored, we just skip them.

Now we must somehow go from the table to efficient C++ code. This has been done before, and I have not looked at how people do it. So here is my naive take.

At the moment, 5986 different characters are mapped to 2945 distinct characters. This can be implemented using a key-value map, using 32 bits for the key and 32 bits (or less) for the value. That's about 50 kB right there. Maybe less if you get clever. A flat data structure with a binary search might work to locate the values and the mapping.

Next we have the disallowed and the ignored values. This could be implemented using a flat list where you have 2 bits per unicode code points (1114111), so that's 2 MB. Each code point can be allowed, disallowed, ignored or mapped. But we probably don't want to use 2MB. So just represent the data as an list of ranges. Each range can be defined by its starting position (a value no larger than 1114111) and two bits representing whether it is allowed, disallowed, ignored or mapped. So each range can be represented using 32 bits. The table has something like 9000 lines, but most of it is descriptive, there are far fewer than 9000 ranges in there... So, let us say 24 kB to store that. Implementation-wise, a simple binary search through the ranges will do.

So mapping, unallowed and so forth... we are talking about less than 100kB.

Normalization is tiny. So I am hopeful we can do all of that in less than 150kB

I recommend breaking down the implementation into the steps I have outlined. At each stage, write some tests. Don't be a cowboy like I did... it is too complicated to do in one sitting.

@lemire
Copy link
Member Author

lemire commented Jan 27, 2023

I could probably even do the mapping and scripting over the big table. I have a good idea of how to do it.

The normalization only looks mysterious, but if you scan through the perl code, you see that there is relatively little involved.

@lemire
Copy link
Member Author

lemire commented Jan 27, 2023

For some context, currently we rely on ICU which requires something like 40 MB. I think that implementing just what we need would be far smaller. Less than 200 kB.

@miguelteixeiraa
Copy link
Contributor

Seems definitely hard. I like it. Thanks for sharing these ideas with me! I'll work on this.

@miguelteixeiraa
Copy link
Contributor

miguelteixeiraa commented Jan 30, 2023

This task is hard 😵

I don't know when I'm going to finish this, but if anyone already has an idea in mind and thinks they could implement it quickly, please don't get hung up on me.

I was thinking about storing the idna table in an interval map like this one:

template <typename T, typename U>
class IntervalMap {
    std::map<std::pair<T, T>, U> storage;

public:
    IntervalMap() {}

    void insert(std::pair<T, T> interval, U value)
    {
        storage[interval] = value;
    }

    U find(T value)
    {
        auto it = storage.lower_bound(std::make_pair(value, value));
        if (it == storage.end() || value < it->first.first) {
            if (it == storage.begin())
                return U();
            --it;
        }
        if (value >= it->first.first && value <= it->first.second)
            return it->second;
        return U();
    }
};

then the table would be created like this:

IntervalMap<unsigned int, std::pair<std::string, std::vector<unsigned int>>> idna_table;

idna_table.insert(std::make_pair(0x2FA1C, 0x2FA1C), {"mapped", {0x9F3B}});
idna_table.insert(std::make_pair(0x2FA1D, 0x2FA1D), {"mapped", {0x2A600}});
idna_table.insert(std::make_pair(0x2FA1E, 0x2FFFD), {"disallowed", {}});
idna_table.insert(std::make_pair(0x2FFFE, 0x2FFFF), {"disallowed", {}});
idna_table.insert(std::make_pair(0x30000, 0x3134A), {"valid", {}});

and access would be like this:

idna_table.find(0x2FA12).second  // the list of chars 

but maybe this implementation is too costly.. what do you think?

the construction of this interval map could be done from a script, like this very ugly one here:

    url = "https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt"
    response = requests.get(url)
    mapping_table = response.text
    
    lines = []
    
    for line in mapping_table.split("\n"):
        if(line.startswith("#") or ";" not in line):
            continue
        
        line = line[0:line.index("#")]
        line = line.split(";")
        line = [x.strip() for x in line if x.strip() != ""]
        final_line = "idna_table.insert(std::make_pair("
        if(".." in line[0]):
            range = line[0].split("..")
            final_line+=f"0x{range[0]}" + ", " + f"0x{range[1]}"
        else:
            final_line+=f"0x{line[0]}, 0x{line[0]}"
        
        final_line+="), {" + "\"" + line[1] + "\", {"
        
        if(len(line) > 2 and line[2] != "0xNV8"):
            mapped_value = line[2].split(" ")
            for value in mapped_value:
                final_line += f"0x{value}" + ", "
            final_line = final_line[:len(final_line) - 2]
            final_line += "}});"
        else:
            final_line+="}});"
        
        print(final_line)

@miguelteixeiraa
Copy link
Contributor

miguelteixeiraa commented Jan 30, 2023

I could also ignore the "range/interval" thing, use unordered_map to store the whole table and get an O(1)

@lemire
Copy link
Member Author

lemire commented Jan 30, 2023

@miguelteixeiraa This task is hard 😵

Do not try to do it all, and especially not all at once. It is indeed quite a large project... there is a reason nobody does ever does it!

I have a good idea how to implement this table and I can do it. You leave it up to me, I can probably do it this week.

@lemire
Copy link
Member Author

lemire commented Jan 30, 2023

@miguelteixeiraa @anonrig I think we could start a secondary project, maybe https://github.com/ada-url/idna/, and work on it there. Because, indeed, it will be non-trivial and it will take some time (weeks).

@lemire
Copy link
Member Author

lemire commented Jan 30, 2023

@miguelteixeiraa @anonrig And it could be cool to have it as a separate component.

@miguelteixeiraa
Copy link
Contributor

I think we could start a secondary project, maybe https://github.com/ada-url/idna/, and work on it there.

And it could be cool to have it as a separate component.

Sounds like a good idea! : )

@lemire
Copy link
Member Author

lemire commented Feb 1, 2023

Note that it is an evolving target : whatwg/url#744

@lemire
Copy link
Member Author

lemire commented Feb 4, 2023

@miguelteixeiraa Can you have a look at https://github.com/ada-url/idna ?

Next step would be to write some tests for to_ascii. It is not complete because it does not do mapping, it does not scan for forbidden characters and it does not check for bidi conditions. But it is otherwise a good sketch.

@lemire lemire mentioned this issue Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants