Skip to content

Loading…

[Pre-commit discussion] Faster hostnameFromURI #953

Closed
chrisaljoudi opened this Issue · 8 comments

2 participants

@chrisaljoudi
Owner

I've been able to tune and rewrite parts of URI.hostnameFromURI and have gotten a result that appears to be significantly faster on all browsers I've tested (it was more than two times as fast (100% faster) on Safari, ~25% faster on Chrome, and 15-20% on Firefox).

To start, here's the benchmark.

I haven't committed this yet, @gorhill, because I wanted to know what you thought about this. There are lots of different optimizations/improvements, so we can selectively use them if some of them are problematic.

Here's the new hostnameFromURI:

function charCodeTableFromStr(str) {
    var obj = {};
    for (var i = 0, n = str.length; i < n; i++) {
        obj[str.charCodeAt(i)] = true;
    }
    return obj;
}
var notAllowedInScheme = charCodeTableFromStr("/?#");
var allowedInBareAuth = charCodeTableFromStr("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-.");

// The most used function, so it better be fast.
URI.hostnameFromURI = function(uri) {
    var i = 0, len = uri.length;
    if(uri.lastIndexOf('http:', 0) === 0) {
        i = 7;
    }
    else if(uri.lastIndexOf('https:', 0) === 0) {
        i = 8;
    }
    else if(uri.lastIndexOf('//', 0) === 0) {
        i = 2;
    }
    else {  // some other scheme
        var cur;
        while(i < len) {
            cur = uri.charCodeAt(i);
            if(cur === 58) {   // is cur == ':' ?
                break;
            }
            else if(notAllowedInScheme[cur]) {
                return '';
            }
            i++;
        }
        i += 3;
        if(i >= len) {
            return '';
        }
    }
    var j = i;
    while(j < len && uri.charCodeAt(j) !== 47) { // indexOf '/'
         j++;
    }
    uri = uri.slice(i, j); // this is the authority; reuse `uri`
    len = j - i, i = 0;
    while(i < len) {
        if (!allowedInBareAuth[uri.charCodeAt(i++)]) {
            // found a character indicating this is not a bare authority
            i = 0; 
            break;
        }
    }
    if(i) { // we're good if i wasn't reset 
        return uri.toLowerCase();  // this is the most common exit point
    }
    var matches = reHostFromAuthority.exec(uri);
    if (!matches) {
        matches = reIPv6FromAuthority.exec(uri);
        if (!matches) {
            return '';
        }
    }
    return matches[1].toLowerCase();
};
@gorhill

It fails with data URI and this special case: http://./, which I have encountered once. It's supposed to return an empty string.

I extended the benchmark so as to also act as a test bench:
http://jsperf.com/uritools-hostname-from-uri/3

I had to comment out the tests which fail to complete the benchmark.

Also, are you sure that the Safari benchmark wasn't a result of the browser optimizing away what it might have seen as pointless -- since host is never read from, only written to? It's why now I always put test/throw in benchmarks on JSperf, I've often seen Firefox optimizing away tests.

Edit: Notice how the benchmark results are different with http://jsperf.com/uritools-hostname-from-uri/3.

@gorhill

Also, I just want to remind that this is a key core piece of code, and whatever change will have to go through a whole lot of testing. Touching such core piece of code which has sustained the test of time I find this quite unnerving at this point, as stability is a top priority given how the user base has grown. The last thing I want is a major breakage user-side because we didn't foresee some cases.

@chrisaljoudi
Owner

@gorhill thanks for helping out!

I had a bunch of tiny screw-ups in there; here's a fixed version with all the tests you set up.

Also, are you sure that the Safari benchmark wasn't a result of the browser optimizing away what it might have seen as pointless -- since host is never read from, only written to?

You're right that given only the original benchmark, that could be possible — however, the new benchmark eliminates that possibility.

@chrisaljoudi
Owner

Here are the results so far from my machine (obviously, more tests would help):

bench

@chrisaljoudi
Owner

After thinking about it some more and experimenting, it seems like the benefits are currently minimal compared to the fragility a hand-coded string matcher would cause.

Closing for now.

@gorhill

Err.. I was ready to go ahead with this, I just wanted some time to look at the code and match it against the RFC.

@gorhill

Just adding the simplest regexp for the most common case encountered, http and https, we can gain something, without compromising the reliability of the whole thing. Example, since most URI tested will be http or https, if we use this as first best-guess regexp:

var reVeryNaiveButFast = /^https?:\/\/([0-9a-z_][0-9a-z._-]+)\//;

...
var matches = reVeryNaiveButFast.exec(uri);
if ( matches ) {
    return matches[1];
}
...

There is some gain, and yet in case of no match, the current code path is used. That regexp is essentially just one which combine a couple of the steps done afterward. In most case, it will be a match, yet when it is not, it will fall back on the time-tested code.

Edit: Note that the gain is best shown when almost all URI tested are a match, of course, which is something expected in reality. I suppose it would be a good idea to have a benchmark for test purpose, and one for speed purpose thinking of it.

@chrisaljoudi
Owner

@gorhill great idea — created the benchmark, and it shows similar, consistent improvements.

@gorhill gorhill added a commit that referenced this issue
@gorhill gorhill related to #953 d2a6a38
@AlexVallat AlexVallat added a commit to AlexVallat/uBlock that referenced this issue
@gorhill gorhill related to #953 efc0c8f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.