Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting hashing technique #618

Merged
merged 7 commits into from
Mar 21, 2024
Merged

Documenting hashing technique #618

merged 7 commits into from
Mar 21, 2024

Conversation

lemire
Copy link
Member

@lemire lemire commented Mar 20, 2024

As remarked by @the-moisrex in issue 617, Premature optimiztion, the hashing technique to identify the protocol is unnecessary if the protocol string is predictable. By simply testing all six possible strings, starting from the most probable, it is likely that you can often get better performance in many practical cases.

It is a trade-off between best average or best-case performance and worst-case performance. A naive approach, in the worst case, may need to do six different string comparisons which could be potentially expensive. We made the choice to go with an option that is slightly less optimal in the case where we can predict, at compile time, the protocol, but has the benefit of having well understood worst-case performance. It is a good old "O(1)" performance vs "O(N)" performance where N is not too large.

This trade-off can be revisited at any time. It is worth documenting it in the code, however. This is what this PR does.

See also : #619

@lemire lemire requested a review from anonrig March 20, 2024 19:52
@lemire lemire mentioned this pull request Mar 20, 2024
@the-moisrex
Copy link
Contributor

I'm just gonna leave this here; it's something from my own implementation but I've changed mine because I needed to ignore tabs and newlines.

You should not be aiming for the worse case scenario; which, the worse case scenario of this algorithm is not as bad as you think it is. If you have 10000 character of scheme characters, worse case is not 10000. Worse case would be sum of all of the special schemes which is just 21.

I have a benchmark for this as well.

        /**
         * @return 0 if unknown, otherwise return the port
         */
        [[nodiscard]] constexpr stl::uint16_t known_port() const noexcept {
            // NOLINTBEGIN(*-magic-numbers)
            switch (this->size()) {
                case 2:
                    if (this->operator[](0) == 'w' && this->operator[](1) == 's') {
                        return 80U;
                    }
                    break;
                case 3:
                    if (*this == "wss") {
                        return 443U;
                    }
                    if (*this == "ftp") {
                        return 21U;
                    }
                    break;
                case 4:
                    if (*this == "http") {
                        return 80U;
                    }
                    break;
                case 5:
                    if (*this == "https") {
                        return 443U;
                    }
                    break;
                default: break;
            }
            return 0U;
            // NOLINTEND(*-magic-numbers)
        }

@lemire
Copy link
Member Author

lemire commented Mar 20, 2024

@the-moisrex I understand that these functions may be bottlenecks in your applications. In ada, with the current code, identifying the protocol or or the special port, does not even register when profiling. It may be possible to gain 1% or 2% in some tests but we can rule out significant gains from the profiling data alone, at least in the tests that we have done so far.

We are still open to revisiting design choices, but this particular component of the code is unlikely to become a priority.

I should point out that the code routine you offer is a form of hashing: you use the key length as the hash function and then branch out.

There is evidently a wide range of design choices, and I invite you to pursue your investigations in your projects.

@ada-url ada-url locked as resolved and limited conversation to collaborators Mar 20, 2024
include/ada/scheme-inl.h Show resolved Hide resolved
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
@lemire lemire merged commit 7dc9e67 into main Mar 21, 2024
21 checks passed
@lemire lemire deleted the documenting_hashing_technique branch March 21, 2024 13:23
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants