-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL comparison #65
Comments
First we need to research the subject and figure out what a conforming implementation looks like (see rfc3986). We should also check what other URL libraries are doing in terms of comparison. The notes can be collected here in this issue. |
So it seems like this issue and #8 are almost the same issues. Each normalization strategy represents a comparison strategy. The main difference for us, because we care about memory allocation, is that we probably want normalization algorithms and comparison algorithms to work as if the underlying strings were normalized, instead of reusing the algorithms. Boost.URL design (TL;DR)The final choice here is between String comparison and Syntax-Based normalization. For For assert( u1.string() == u2.string() ); and Syntax-Based comparison with: assert( u1 == u2 ); The only difference between normalization and comparison is one of them acts as if they are normalized. This has a cost when the URLs are different but it's always constant on code-point and does not require any memory allocations from the heap. Other libsOther libraries, such as folly and apache, don't include normalization. Javascript's URL library doesn't include normalization but there are some famous libraries such as sindresorhus/normalize-url people seem to use. sindresorhus/normalize-url, however, has options that are not very related to what the RFC 3986 describes as normalization. It implicitly assumes URLs are always http, doesn't include RFC normalization rules, and includes lots of rules that would go beyond even http normalization because normalized URLs would point to different HTTP resources. In practice, they are at most some useful conversion functions for URLs, and Boost.URL provides alternatives for each of these functions. MethodsThese are the normalization/comparison strategies by their probability of false negatives and cost.
Only the first two make sense for the level of abstraction of Boost.URL. Scheme-Based Normalization fits better in Boost.HTTP. Protocol-Based Normalization makes sense in some web spiders. Trade-offsFalse negatives (two URLs to the same resource being considered different) can never be completely eliminated because they depend a lot on the context. Example: the same website served from two servers: Comparison will return false for the same resource. We can only eliminate false positives with rules augmented by the scheme, protocol, and contextual rules. Minimizing false negatives also has an extra cost for each normalization. So the goal is to minimize false negatives and completely eliminate false positives. Note on Relative referencesIn applications, relative references should not be compared directly by applications to identify resources. Fragments should often be ignored when compared to select a network action. A positive example is HTML anchors, which represent the same resource. A negative example is a Git commit tag, which represents different resources. More details on each method:Comparison Methods and Normalizations by the probability of false negatives and cost:
|
Some updates: I've implemented the Syntax-Based comparison as if the components were normalized to avoid allocating memory. The problem we have is the In practice, I think this is rarely going to be
So I'm still exploring the solution with no allocations and then we can later include some of these optimizations. |
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
fix #8, fix boostorg#65
url_view
needs to be equality comparable using the RFC algorithm. We might also consider a lexicographic comparison for containers.The text was updated successfully, but these errors were encountered: