Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: find scheme with a "perfect hash" #12347

Closed
wants to merge 2 commits into from
Closed

Conversation

bagder
Copy link
Member

@bagder bagder commented Nov 17, 2023

Instead of a loop to scan over the potentially 30+ scheme names, use a "perfect hash" table. This works fine because the set of schemes is known and cannot change in a build. The hash algorithm and table size is made to only make a single scheme index a single table entry.

The perfect hash is generated by a separate tool (schemetable.c) that is provided as well.

@bagder bagder added the URL label Nov 17, 2023
@bagder bagder closed this Nov 17, 2023
@bagder bagder deleted the bagder/scheme-perfect-hash branch November 17, 2023 09:46
@bagder bagder restored the bagder/scheme-perfect-hash branch November 17, 2023 10:29
@bagder bagder reopened this Nov 17, 2023
@bagder

This comment was marked as outdated.

lib/url.c Outdated Show resolved Hide resolved
@bagder bagder marked this pull request as ready for review November 17, 2023 12:04
@bagder bagder force-pushed the bagder/scheme-perfect-hash branch 3 times, most recently from 225e05b to 2409fc1 Compare November 17, 2023 16:20
@dfandrich
Copy link
Contributor

I tried using gperf to create an alternative perfect hash function for comparison, with these results. The gperf method uses an extra 3 array lookups in the hash function, plus the additional size penalty of an extra const lookup array and redundant gperf_case_strcmp() function and arrays, but it has the advantage that it can be automatically regenerated when a new protocol is added without having to manually find a new perfect hash function (which could happen in this proposed method). All the #ifdefs around the protocols would need to be added manually to the gperf code, though, every time it needs to be re-run. Since new protocols aren't added all that frequently, the current PR is probably good enough. I'd add some a paragraph or two of documentation explaining how to add a new protocol since it's not as simple as adding it to scheme2num.c and rerunning.

A considerably simpler alternative would be to simply sort the table and use bsearch(). Worst-case would be 5 string comparisons/loops for that one, versus 2 for this PR and 32 for the original. Average case is probably more like 4.5 for bsearch, 2 for this PR and 16 for the original, so not far off but with a big boost in simplicity.

@bagder
Copy link
Member Author

bagder commented Nov 17, 2023

gperf

I think the gperf approach is worse than this PR. Slower and more complicated. Tweaking the hash when we get more entries is not likely to be a problem.

Average case

The current version (without this PR) is sorted based on (assumed) protocol popularity, on the basis that the scheme use is not actually random but URLs are more likely to use http or https thansmb or dict. This PR makes that assumption moot since it will run at a fixed speed only corresponding with the scheme length.

I'd add some a paragraph or two of documentation explaining how to add a new protocol since it's not as simple as adding it to scheme2num.c and rerunning.

It can be that simple, but it will get even better if you try tweaking the hash function. I actually just did and managed to shrink the table a little more... 😁

@bagder
Copy link
Member Author

bagder commented Nov 17, 2023

I've now improved schemetable.c so that it can search for the optimal config for its hash algorithm. It runs over a range of different initial and shift values to see which combo that makes the smallest output array. It helped me reduce the table a few more entries down to 67.

It also means that when adding a scheme or two to the table, we can just rerun the program and it can find a new optimal combo by itself.

This tool generates a scheme-matching table.
Instead of a loop to scan over the potentially 30+ scheme names, use a
"perfect hash" table. This works fine because the set of schemes is
known and cannot change in a build. The hash algorithm and table size is
made to only make a single scheme index a single table entry.

The perfect hash is generated by a separate tool (scheme2num.c) that
needs to be provided as well if we decide to go with route.
@bagder bagder closed this in b2d8f3f Nov 19, 2023
@bagder bagder deleted the bagder/scheme-perfect-hash branch November 19, 2023 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants